scylladb

Author	SHA1	Message	Date
Yaron Kaikov	6c0825e2a6	release: prepare for 4.6.11	2022-11-28 15:45:26 +02:00
Nadav Har'El	db3dd3bdf6	Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek When filtering with multi column restriction present all other restrictions were ignored. So a query like: `SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;` would ignore the restriction `regular_col = 0`. This was caused by a bug in the filtering code: `2779a171fc/cql3/selection/selection.cc (L433-L449)` When multi column restrictions were detected, the code checked if they are satisfied and returned immediately. This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied. This code was introduced back in 2019, when fixing #3574. Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct. Fixes: #6200 Fixes: #12014 Closes #12031 * github.com:scylladb/scylladb: cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works boost/restrictions-test: uncomment part of the test that passes now cql-pytest: enable test for filtering combined multi column and regular column restrictions cql3: don't ignore other restrictions when a multi column restriction is present during filtering (cherry picked from commit `2d2034ea28`) Closes #12086	2022-11-27 00:15:04 +02:00
Pavel Emelyanov	4ad24180f5	Merge '[branch-4.6] multishard_mutation_query: don't unpop partition header of spent partition ' from Botond Dénes When stopping the read, the multishard reader will dismantle the compaction state, pushing back (unpopping) the currently processed partition's header to its originating reader. This ensures that if the reader stops in the middle of a partition, on the next page the partition-header is re-emitted as the compactor (and everything downstream from it) expects. It can happen however that there is nothing more for the current partition in the reader and the next fragment is another partition. Since we only push back the partition header (without a partition-end) this can result in two partitions being emitted without being separated by a partition end. We could just add the missing partition-end when needed but it is pointless, if the partition has no more data, just drop the header, we won't need it on the next page. The missing partition-end can generate an "IDL frame truncated" message as it ends up causing the query result writer to create a corrupt partition entry. Fixes: https://github.com/scylladb/scylladb/issues/9482 Closes #11914 * github.com:scylladb/scylladb: test/cql-pytest: add regression test for "IDL frame truncated" error mutation_compactor: detach_state(): make it no-op if partition was exhausted treewide: fix headers	2022-11-16 11:52:51 +03:00
Anna Mikhlin	755c7eeb6a	release: prepare for 4.6.10	2022-11-14 10:30:20 +02:00
Eliran Sinvani	8914ca8c58	cql: Fix crash upon use of the word empty for service level name Wrong access to an uninitialized token instead of the actual generated string caused the parser to crash, this wasn't detected by the ANTLR3 compiler because all the temporary variables defined in the ANTLR3 statements are global in the generated code. This essentialy caused a null dereference. Tests: 1. The fixed issue scenario from github. 2. Unit tests in release mode. Fixes #11774 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Message-Id: <20190612133151.20609-1-eliransin@scylladb.com> Closes #11777 (cherry picked from commit `ab7429b77d`)	2022-11-10 20:43:44 +02:00
Botond Dénes	e82e4bbed3	test/cql-pytest: add regression test for "IDL frame truncated" error (cherry picked from commit `11af489e84`)	2022-11-07 16:51:14 +02:00
Botond Dénes	f9c457778e	mutation_compactor: detach_state(): make it no-op if partition was exhausted detach_state() allows the user to resume a compaction process later, without having to keep the compactor object alive. This happens by generating and returning the mutation fragments the user has to re-feed to a newly constructed compactor to bring it into the exact same state the current compactor was at the point of stopping the compaction. This state includes the partition-header (partition-start and static-row if any) and the currently active range tombstone. Detaching the state is pointless however when the compaction was stopped such that the currently compacted partition was completely exhausted. Allowing the state to be detached in this case seems benign but it caused a subtle bug in the main user of this feature: the partition range scan algorithm, where the fragments included in the detached state were pushed back into the reader which produced them. If the partition happened to be exhausted -- meaning the next fragment in the reader was a partition-start or EOS -- this resulted in the partition being re-emitted later without a partition-end, resulting in corrupt query-result being generated, in turn resulting in an obscure "IDL frame truncated" error. This patch solves this seemingly benign but sinister bug by making the return value of `detach_state()` an std::optional and returning a disengaged optional when the partition was exhausted. (cherry picked from commit `70b4158ce0`)	2022-11-07 16:51:14 +02:00
Botond Dénes	8315a7b164	treewide: fix headers To fix CI.	2022-11-07 16:51:14 +02:00
Nadav Har'El	291ca8db60	cql3: fix cql3::util::maybe_quote() for keywords cql3::util::maybe_quote() is a utility function formatting an identifier name (table name, column name, etc.) that needs to be embedded in a CQL statement - and might require quoting if it contains non-alphanumeric characters, uppercase characters, or a CQL keyword. maybe_quote() made an effort to only quote the identifier name if neccessary, e.g., a lowercase name usually does not need quoting. But lowercase names that are CQL keywords - e.g., to or where - cannot be used as identifiers without quoting. This can cause problems for code that wants to generate CQL statements, such as the materialized-view problem in issue #9450 - where a user had a column called "to" and wanted to create a materialized view for it. So in this patch we fix maybe_quote() to recognize invalid identifiers by using the CQL parser, and quote them. This will quote reserved keywords, but not so-called unreserved keywords, which are allowed as identifiers and don't need quoting. This addition slows down maybe_quote(), but maybe_quote() is anyway only used in heavy operations which need to generate CQL. This patch also adds two tests that reproduce the bug and verify its fix: 1. Add to the low-level maybe_quote() test (a C++ unit test) also tests that maybe_quote() quotes reserved keywords like "to", but doesn't quote unreserved keywords like "int". 2. Add a test reproducing issue #9450 - creating a materialized view whose key column is a keyword. This new test passes on Cassandra, failed on Scylla before this patch, and passes after this patch. It is worth noting that maybe_quote() now has a "forward compatiblity" problem: If we save CQL statements generated by maybe_quote(), and a future version introduces a new reserved keyword, the parser of the future version may not be able to parse the saved CQL statement that was generated with the old mayb_quote() and didn't quote what is now a keyword. This problem can be solved in two ways: 1. Try hard not to introduced new reserved keywords. Instead, introduce unreserved keywords. We've been doing this even before recognizing this maybe_quote() future-compatibility problem. 2. In the next patch we will introduce quote() - which unconditionally quotes identifier names, even if lowercase. These quoted names will be uglier for lowercase names - but will be safe from future introduction of new keywords. So we can consider switching some or all uses of maybe_quote() to quote(). Fixes #9450 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220118161217.231811-1-nyh@scylladb.com> (cherry picked from commit `5d2f694a90`)	2022-11-07 10:38:10 +02:00
Jadw1	4da5fbaa24	CQL3: fromJson accepts string as bool The problem was incompatibility with cassandra, which accepts bool as a string in `fromJson()` UDF. The difference between Cassandra and Scylla now is Scylla accepts whitespaces around word in string, Cassandra don't. Both are case insensitive. Fixes: #7915 (cherry picked from commit `1902dbc9ff`)	2022-11-07 10:38:10 +02:00
Takuya ASADA	fc16664d81	locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service EC2 instance metadata service can be busy, ret's retry to connect with interval, just like we do in scylla-machine-image. Fixes #10250 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #11688 (cherry picked from commit `6b246dc119`) (cherry picked from commit `e2809674d2`)	2022-11-06 15:43:58 +02:00
Botond Dénes	80bea5341e	Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted. In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it). Fixes #11801 Closes #11808 * github.com:scylladb/scylladb: test/alternator: add test for issue 11801 MV: fix handling of view update which reassign the same key value materialized views: inline used-once and confusing function, replace_entry() (cherry picked from commit `e981bd4f21`)	2022-11-01 13:31:51 +02:00
Botond Dénes	6ecc772b56	mutation_partition: deletable_row::apply(shadowable_tombstone): remove redundant maybe_shadow() Shadowing is already checked by the underlying row_tombstone::apply(). This redundant check was introduced by a previous fix to #9483 (`6a76e12768`). The rest of that patch is good. Refs: #9483 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211115091513.181233-1-bdenes@scylladb.com> (cherry picked from commit `b136746040`)	2022-10-16 11:53:04 +03:00
Benny Halevy	0b2e951954	range_tombstone_list: insert_from: correct rev.update range_tombstone in not overlapping case 2nd std::move(start) looks like a typo in `fe2fa3f20d`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220404124741.1775076-1-bhalevy@scylladb.com> (cherry picked from commit `2d80057617`)	2022-10-14 12:29:56 +02:00
Pavel Emelyanov	f2a738497f	compaction_manager: Swallow ENOSPCs in ::stop() When being stopped compaction manager may step on ENOSPC. This is not a reason to fail stopping process with abort, better to warn this fact in logs and proceed as if nothing happened refs: #11245 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 16:02:33 +03:00
Pavel Emelyanov	badf7c816f	exceptions: Mark storage_io_error::code() with noexcept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 16:02:32 +03:00
Pavel Emelyanov	bfb86f2c78	compaction_manager: Shuffle really_do_stop() Make it the future-returning method and setup the _stop_future in its only caller. Makes next patch much simpler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-13 16:02:31 +03:00
Beni Peled	18e7a46038	release: prepare for 4.6.9	2022-10-09 08:54:33 +03:00
Nadav Har'El	cbcfa31e51	cql: validate bloom_filter_fp_chance up-front Scylla's Bloom filter implementation has a minimal false-positive rate that it can support (6.71e-5). When setting bloom_filter_fp_chance any lower than that, the compute_bloom_spec() function, which writes the bloom filter, throws an exception. However, this is too late - it only happens while flushing the memtable to disk, and a failure at that point causes Scylla to crash. Instead, we should refuse the table creation with the unsupported bloom_filter_fp_chance. This is also what Cassandra did six years ago - see CASSANDRA-11920. This patch also includes a regression test, which crashes Scylla before this patch but passes after the patch (and also passes on Cassandra). Fixes #11524. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11576 (cherry picked from commit `4c93a694b7`)	2022-10-04 16:23:25 +03:00
Nadav Har'El	5ee69ff3a9	alternator: return ProvisionedThroughput in DescribeTable DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable operation must return a ProvisionedThroughput structure, listing both ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not stated in some DynamoDB documentation but is explictly mentioned in https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html Also in empirically, DynamoDB returns ProvisionedThroughput with zeros even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this. The ProvisionedThroughput structure being missing was a problem for applications like DynamoDB connectors for Spark, if they implicitly assume that ProvisionedThroughput is returned by DescribeTable, and fail (as described in issue #11222) if it's outright missing. So this patch adds the missing ProvisionedThroughput structure, and the xfailing test starts to pass. Note that this patch doesn't change the fact that attempting to set a table to PROVISIONED billing mode is ignored: DescribeTable continues to always return PAY_PER_REQUEST as the billing mode and zero as the provisioned capacities. Fixes #11222 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11298 (cherry picked from commit `941c719a23`)	2022-10-03 14:29:22 +03:00
Tomasz Grabiec	949103d22a	test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone The generator was first setting the marker then applied tombstones. The marker was set like this: row.marker() = random_row_marker(); Later, when shadowable tombstones were applied, they were compacted with the marker as expected. However, the key for the row was chosen randomly in each iteration and there are multiple keys set, so there was a possibility of a key clash with an earlier row. This could override the marker without applying any tombstones, which is conditional on random choice. This could generate rows with markers uncompacted with shadowable tombstones. This broken row_cache_test::test_concurrent_reads_and_eviction on comparison between expected and read mutations. The latter was compacted because it went through an extra merge path, which compacts the row. Fix by making sure there are no key clashes. Closes #11663 (cherry picked from commit `5268f0f837`)	2022-10-03 09:00:28 +03:00
Botond Dénes	549cb60f4c	sstables: crawling mx-reader: make on_out_of_clustering_range() no-op Said method currently emits a partition-end. This method is only called when the last fragment in the stream is a range tombstone change with a position after all clustered rows. The problem is that consume_partition_end() is also called unconditionally, resulting in two partition-end fragments being emitted. The fix is simple: make this method a no-op, there is nothing to do there. Also add two tests: one targeted to this bug and another one testing the crawling reader with random mutations generated for random schema. Fixes: #11421 Closes #11422 (cherry picked from commit `be9d1c4df4`)	2022-09-30 17:56:58 +03:00
Botond Dénes	37633c5576	test/lib/random_schema: add a simpler overload for fixed partition count Some tests want to generate a fixed amount of random partitions, make their life easier. (cherry picked from commit `98f3d516a2`) Ref #11421 (prerequisite)	2022-09-30 17:56:10 +03:00
Michael Livshin	abd9f43fa7	batchlog_manager: warn when a batch fails to replay Only for reasons other than "no such KS", i.e. when the failure is presumed transient and the batch in question is not deleted from batchlog and will be retried in the future. (Would info be more appropriate here than warning?) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10556 Fixes #10636 (cherry picked from commit `00ed4ac74c`)	2022-09-29 12:13:21 +03:00
Raphael S. Carvalho	d41d4db5c0	compaction: Make cleanup withstand better disk pressure scenario It's not uncommong for cleanup to be issued against an entire keyspace, which may be composed of tons of tables. To increase chances of success if low on space, cleanup will now start from smaller tables first, such that bigger tables will have more space available, once they're reached, to satisfy their space requirement. parallel_for_each() is dropped and wasn't needed given that manager performs per-shard serialization of cleanup jobs. Refs #9504. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211130133712.64517-1-raphaelsc@scylladb.com> (cherry picked from commit `0d5ac845e1`)	2022-09-29 10:15:29 +03:00
Michał Radwański	c500043a78	flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO. In functions such as upgrade_to_v2 (excerpt below), if the constructor of transforming_reader throws, r needs to be destroyed, however it hasn't been closed. However, if a reader didn't start any operations, it is safe to destruct such a reader. This issue can potentially manifest itself in many more readers and might be hard to track down. This commit adds a bool indicating whether a close is anticipated, thus avoiding errors in the destructor. Code excerpt: flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) { class transforming_reader : public flat_mutation_reader_v2::impl { // ... }; return make_flat_mutation_reader_v2<transforming_reader>(std::move(r)); } Fixes #9065. (cherry picked from commit `9ada63a9cb`)	2022-09-29 09:40:07 +03:00
Pavel Emelyanov	af4752a526	messaging_service: Fix gossiper verb group When configuring tcp-nodelay unconditionally, messaging service thinks gossiper uses group index 1, though it had changed some time ago and now those verbs belong to group 0. fixes: #11465 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `2c74062962`)	2022-09-19 10:32:49 +03:00
Anna Mikhlin	0aa9a8c266	release: prepare for 4.6.8	2022-09-19 09:30:09 +03:00
Michał Chojnowski	85fd6ab377	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202 (cherry picked from commit `cdb3e71045`)	2022-09-18 13:30:28 +03:00
Beni Peled	7c79c513d1	release: prepare for 4.6.7	2022-09-07 11:17:55 +03:00
Karol Baryła	9a8e73f0c3	transport/server.cc: Return correct size of decompressed lz4 buffer An incorrect size is returned from the function, which could lead to crashes or undefined behavior. Fix by erroring out in these cases. Fixes #11476 (cherry picked from commit `1c2eef384d`)	2022-09-07 10:58:54 +03:00
Benny Halevy	fac0443200	snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema Detecting a secondary index by checking for a dot in the table name is wrong as tables generated by Alternator may contain a dot in their name. Instead detect bot hmaterialized view and secondary indexes using the schema()->is_view() method. Fixes #10526 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `aa127a2dbb`)	2022-09-06 17:56:30 +03:00
Piotr Sarna	6bcfef2cfa	cql3: fix misleading error message for service level timeouts The error message incorrectly stated that the timeout value cannot be longer than 24h, but it can - the actual restriction is that the value cannot be expressed in units like days or months, which was done in order to significantly simplify the parsing routines (and the fact that timeouts counted in days are not expected to be common). Fixes #10286 Closes #10294 (cherry picked from commit `85e95a8cc3`)	2022-09-01 20:34:22 +03:00
Juliusz Stasiewicz	d2c67a2429	cdc/check_and_repair_cdc_streams: ignore LEFT endpoints When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla would throw. This behavior is fixed so that LEFT nodes are simply ignored. Fixes #9771 Closes #9778 (cherry picked from commit `351f142791`)	2022-09-01 15:44:35 +03:00
Avi Kivity	d6c2f228e7	Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec Scenario: cache = [ row(pos=2, continuous=false), row(pos=after(2), dummy=true) ] Scanning read starts, starts populating [-inf, before(2)] from sstables. row(pos=2) is evicted. cache = [ row(pos=after(2), dummy=true) ] Scanning read finishes reading from sstables. Refreshes cache cursor via partition_snapshot_row_cursor::maybe_refresh(), which calls partition_snapshot_row_cursor::advance_to() because iterators are invalidated. This advances the cursor to after(2). no_clustering_row_between(2, after(2)) returns true, so advance_to() returns true, and maybe_refresh() returns true. This is interpreted by the cache reader as "the cursor has not moved forward", so it marks the range as complete, without emitting the row with pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads will also miss the row. The bug is in advance_to(), which is using no_clustering_row_between(a, b) to determine its result, which by definition excludes the starting key. Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction with reduced key range in the random_mutation_generator (1024 -> 16). Fixes #11239 Closes #11240 * github.com:scylladb/scylladb: test: mvcc: Fix illegal use of maybe_refresh() tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range() tests: row_cache_test: Introduce one_shot mode to throttle row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy	2022-08-11 19:19:30 +02:00
Yaron Kaikov	a1b1df2074	release: prepare for 4.6.6	2022-08-07 16:24:51 +03:00
Avi Kivity	14e13ecbd4	Merge 'Backport: Fix map subscript crashes when map or subscript is null' from Nadav Har'El This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0. Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug. Refs #10535. The original cover letter from https://github.com/scylladb/scylla/pull/10420: In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically. In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL. However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan. Fixes https://github.com/scylladb/scylla/issues/10361 Fixes https://github.com/scylladb/scylla/issues/10399 Fixes https://github.com/scylladb/scylla/pull/10401 Closes #11142 * github.com:scylladb/scylla: test/cql-pytest: reproducer for CONTAINS NULL bug expressions: don't dereference invalid map subscript in filter expressions: fix invalid dereference in map subscript evaluation test/cql-pytest: improve tests for map subscripts and nulls (cherry picked from commit `23a34d7e42`)	2022-07-31 15:44:00 +03:00
Benny Halevy	b8740bde6e	multishard_mutation_query: do_query: stop ctx if lookup_readers fails lookup_readers might fail after populating some readers and those better be closed before returning the exception. Fixes #10351 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10425 (cherry picked from commit `055141fc2e`)	2022-07-25 14:52:58 +03:00
Benny Halevy	1b23f8d038	sstables: time_series_sstable_set: insert: make exception safe Need to erase the shared sstable from _sstables if insertion to _sstables_reversed fails. Fixes #10787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `cd68b04fbf`)	2022-07-25 14:22:08 +03:00
Tomasz Grabiec	05a228e4c5	memtable: Fix missing range tombstones during reads under ceratin rare conditions There is a bug introduced in `e74c3c8` (4.6.0) which makes memtable reader skip one a range tombstone for a certain pattern of deletions and under certain sequence of events. _rt_stream contains the result of deoverlapping range tombstones which had the same position, which were sipped from all the versions. The result of deoverlapping may produce a range tombstone which starts later, at the same position as a more recent tombstone which has not been sipped from the partition version yet. If we consume the old range tombstone from _rt_stream and then refresh the iterators, the refresh will skip over the newer tombstone. The fix is to drop the logic which drains _rt_stream so that _rt_stream is always merged with partition versions. For the problem to trigger, there have to be multiple MVCC versions (at least 2) which contain deletions of the following form: [a, c] @ t0 [a, b) @ t1, [b, d] @ t2 c > b The proper sequence for such versions is (assuming d > c): [a, b) @ t1, [b, d] @ t2 Due to the bug, the reader will produce: [a, b) @ t1, [b, c] @ t0 The reader also needs to be preempted right before processing [b, d] @ t2 and iterators need to get invalidated so that lsa_partition_reader::do_refresh_state() is called and it skips over [b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it does emit the proper range tombstone, it's possible that it will violate fragment order in the stream if _rt_stream accumulated remainders (possible with 3 MVCC versions). The problem goes away once MVCC versions merge. Fixes #10913 Fixes #10830 Closes #10914 (cherry picked from commit `a6aef60b93`) [avi: backport prerequisite position_range_to_clustering_range() too]	2022-07-19 19:27:15 +03:00
Yaron Kaikov	2ec293ab0e	release: prepare for 4.6.5	2022-07-19 16:02:46 +03:00
Pavel Emelyanov	b60f14601e	azure_snitch: Do nothing on non-io-cpu All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: #10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `c6d0bc87d0`)	2022-07-17 14:22:29 +03:00
Raphael S. Carvalho	284dd21ef7	compaction_manager: Fix race when selecting sstables for rewrite operations Rewrite operations are scrub, cleanup and upgrade. Race can happen because 'selection of sstables' and 'mark sstables as compacting' are decoupled. So any deferring point in between can lead to a parallel compaction picking the same files. After commit `2cf0c4bbf`, files are marked as compacting before rewrite starts, but it didn't take into account the commit `c84217ad` which moved retrieval of candidates to a deferring thread, before rewrite_sstables() is even called. Scrub isn't affected by this because it uses a coarse grained approach where whole operation is run with compaction disabled, which isn't good because regular compaction cannot run until its completion. From now on, selection of files and marking them as compacting will be serialized by running them with compaction disabled. Now cleanup will also retrieve sstables with compaction disabled, meaning it will no longer leave uncleaned files behind, which is important to avoid data resurrection if node regains ownership of data in uncleaned files. Fixes #8168. Refs #8155. [backport notes: - minor conflict around run_with_compaction_disabled() - bumped into our old friend https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95111, so I had to use std::ref() on local copy of lambda - with the yielding part of candidate retrieval now happening in rewrite_sstables(), task registration is moved to after run_with_ compaction_disabled() call, so the latter won't incorrectly try to stop the task that called it, which triggers an assert in debug mode. ] Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com> (cherry picked from commit `80a1ebf0f3`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10963	2022-07-13 18:45:36 +03:00
Pavel Emelyanov	8b52f1d6e7	view: Fix trace-state pointer use after move It's moved into .mutate_locally() but it captured and used in its continuation. It works well just because moved-from pointer looks like nullptr and all the tracing code checks for it to be non-such. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/ (CI job failed on post-actions thus it's red) Fixes #11015 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220711134152.30346-1-xemul@scylladb.com> (cherry picked from commit `5526738794`)	2022-07-12 14:21:11 +03:00
Piotr Sarna	157951f756	view: exclude using static columns in the view filter The code which applied view filtering (i.e. a condition placed on a view column, e.g. "WHERE v = 42") erroneously used a wildcard selection, which also assumes that static columns are needed, if the base table contains any such columns. The filtering code currently assumes that no such columns are fetched, so the selection is amended to only ask for regular columns (primary key columns are sent anyway, because they are enabled via slice options, so no need to ask for them explicitly). Fixes #10851 Closes #10855 (cherry picked from commit `bc3a635c42`)	2022-07-11 17:07:22 +03:00
Juliusz Stasiewicz	4f643ed4a5	cdc: `check_and_repair_cdc_streams`: regenerate if too many streams are present If the number of streams exceeds the number of token ranges it indicates that some spurious streams from decommissioned nodes are present. In such a situation - simply regenerate. Fixes #9772 Closes #9780 (cherry picked from commit `ea46439858`)	2022-07-07 18:53:14 +02:00
Avi Kivity	b598629b7f	messaging: do isolate default tenants In `10dd08c9` ("messaging_service: supply and interpret rpc isolation_cookies", 4.2), we added a mechanism to perform rpc calls in remote scheduling groups based on the connection identity (rather than the verb), so that connection processing itself can run in the correct group (not just verb processing), and so that one verb can run in different groups according to need. In `16d8cdadc` ("messaging_service: introduce the tenant concept", 4.2), we changed the way isolation cookies are sent: scheduling_group messaging_service::scheduling_group_for_verb(messaging_verb verb) const { return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group; @@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge if (must_compress) { opts.compressor_factory = &compressor_factory; } opts.tcp_nodelay = must_tcp_nodelay; opts.reuseaddr = true; - opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + // We send cookies only for non-default statement tenant clients. + if (idx > 3) { + opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + } This effectively disables the mechanism for the default tenant. As a result some verbs will be executed in whatever group the messaging service listener was started in. This used to be the main group, but in `554ab03` ("main: Run init_server and join_cluster inside maintenance scheduling group", 4.5), this was change to the maintenance group. As a result normal read/writes now compete with maintenance operations, raising their latency significantly. Fix by sending the isolation cookie for all connections. With this, a 2-node cassandra-stress load has 99th percentile increase by just 3ms during repair, compared to 10ms+ before. Fixes #9505. Closes #10673 (cherry picked from commit `c83393e819`)	2022-07-05 13:42:10 +03:00
Nadav Har'El	43f82047b9	Merge 'types: fix is_string for reversed types' from Piotr Sarna Checking if the type is string is subtly broken for reversed types, and these types will not be recognized as strings, even though they are. As a result, if somebody creates a column with DESC order and then tries to use operator LIKE on it, it will fail because the type would not be recognized as a string. Fixes #10183 Closes #10181 * github.com:scylladb/scylla: test: add a case for LIKE operator on a descending order column types: fix is_string for reversed types (cherry picked from commit `733672fc54`)	2022-07-03 17:59:56 +03:00
Benny Halevy	ec3c07de6e	compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group It was assumed that offstrategy compaction is always triggered by streaming/repair where it would inherit the caller's scheduling group. However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see how the expiration of this timer will inherit anything from streaming/repair. Also, since `d309a86`, offstrategy compaction may be triggered by the api where it will run in the default scheduling group. The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`. Fixes #10151 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com> (cherry picked from commit `0764e511bb`)	2022-07-03 14:30:54 +03:00
Takuya ASADA	82572e8cfe	scylla_coredump_setup: support new format of Storage field Storage field of "coredumpctl info" changed at systemd-v248, it added "(present)" on the end of line when coredump file available. Fixes #10669 Closes #10714 (cherry picked from commit `ad2344a864`)	2022-07-03 13:55:25 +03:00
Nadav Har'El	2b9ed79c6f	alternator: forbid empty AttributesToGet In DynamoDB one can retrieve only a subset of the attributes using the AttributesToGet or ProjectionExpression paramters to read requests. Neither allows an empty list of attributes - if you don't want any attributes, you should use Select=COUNT instead. Currently we correctly refuse an empty ProjectionExpression - and have a test for it: test_projection_expression.py::test_projection_expression_toplevel_syntax However, Alternator is missing the same empty-forbidding logic for AttributesToGet. An empty AttributesToGet is currently allowed, and basically says "retrieve everything", which is sort of unexpected. So this patch adds the missing logic, and the missing test (actually two tests for the same thing - one using GetItem and the other Query). Fixes #10332 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-1-nyh@scylladb.com> (cherry picked from commit `9c1ebdceea`)	2022-07-03 13:36:02 +03:00
Avi Kivity	ab0b6fd372	Update seastar submodule (json crash in describe_ring) * seastar 7a430a0830...8b2c13b346 (1): > Merge 'stream_range_as_array: always close output stream' from Benny Halevy Fixes #10592.	2022-06-08 16:49:53 +03:00
Nadav Har'El	12f1718ef4	alternator: allow DescribeTimeToLive even without TTL enabled We still consider the TTL support in Alternator to be experimental, so we don't want to allow a user to enable TTL on a table without turning on a "--experimental-features" flag. However, there is no reason not to allow the DescribeTimeToLive call when this experimental flag is off - this call would simply reply with the truth - that the TTL feature is disabled for the table! This is important for client code (such as the Terraform module described in issue #10660) which uses DescribeTimeToLive for information, even when it never intends to actually enable TTL. The patch is trivial - we simply remove the flag check in DescribeTimeToLive, the code works just as before. After this patch, the following test now works on Scylla without experimental flags turned on: test/alternator/run test_ttl.py::test_describe_ttl_without_ttl Refs #10660 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `8ecf1e306f`)	2022-05-30 20:40:34 +03:00
Tomasz Grabiec	322dfe8403	sstable: partition_index_cache: Fix abort on bad_alloc during page loading When entry loading fails and there is another request blocked on the same page, attempt to erase the failed entry will abort because that would violate entry_ptr guarantees, which is supposed to keep the entry alive. The fix in `92727ac36c` was incomplete. It only helped for the case of a single loader. This patch makes a more general approach by relaxing the assert. The assert manifested like this: scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed. Fixes #10617 Closes #10653 (cherry picked from commit `f87274f66a`)	2022-05-30 13:00:46 +03:00
Beni Peled	11f008e8fd	release: prepare for 4.6.4	2022-05-16 15:20:35 +03:00
Benny Halevy	fd7314a362	table: clear: serialize with ongoing flush Get all flush permits to serialize with any ongoing flushes and preventing further flushes during table::clear, in particular calling discard_completed_segments for every table and clearing the memtables in clear_and_add. Fixes #10423 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `aae532a96b`)	2022-05-15 13:43:43 +03:00
Raphael S. Carvalho	d27468f078	compaction: LCS: don't write to disengaged optional on compaction completion Dtest triggers the problem by: 1) creating table with LCS 2) disabling regular compaction 3) writing a few sstables 4) running maintenance compaction, e.g. cleanup Once the maintenance compaction completes, disengaged optional _last_compacted_keys triggers an exception in notify_completion(). _last_compacted_keys is used by regular for its round-robin file picking policy. It stores the last compacted key for each level. Meaning it's irrelevant for any other compaction type. Regular compaction is responsible for initializing it when it runs for the first time to pick files. But with it disabled, notify_completion() will find it uninitialized, therefore resulting in bad_optional_access. To fix this, the procedure is skipped if _last_compacted_keys is disengaged. Regular compaction, once re-enabled, will be able to fill _last_compacted_keys by looking at metadata of the files. compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_ block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy] now passes. Fixes #10378. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #10508 (cherry picked from commit `8e99d3912e`)	2022-05-15 13:20:30 +03:00
Juliusz Stasiewicz	74ef1ee961	CQL: Replace assert by exception on invalid auth opcode One user observed this assertion fail, but it's an extremely rare event. The root cause - interlacing of processing STARTUP and OPTIONS messages - is still there, but now it's harmless enough to leave it as is. Fixes #10487 Closes #10503 (cherry picked from commit `603dd72f9e`)	2022-05-10 14:03:03 +02:00
Benny Halevy	07549d159c	compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates _estimated_remaining_tasks gets updated via get_next_non_expired_sstables -> get_compaction_candidates, but otherwise if we return earlier from get_sstables_for_compaction, it does not get updated and may go out of sync. Refs #10418 (to be closed when the fix reaches branch-4.6) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10419 (cherry picked from commit `01f41630a5`)	2022-05-09 09:36:22 +03:00
Eliran Sinvani	189bbcd82d	prepared_statements: Invalidate batch statement too It seams that batch prepared statements always return false for depends_on, this in turn renders the removal criteria from the prepared statements cache to always be false which result by the queries not being evicted. Here we change the function to return the true state meaning, they will return true if one of the sub queries is dependant upon the keyspace and/ or column family. Fixes #10129 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> (cherry picked from commit `4eb0398457`)	2022-05-08 12:33:00 +03:00
Eliran Sinvani	70e6921125	cql3 statements: Change dependency test API to express better it's purpose Cql statements used to have two API functions, depends_on_keyspace and depends_on_column_family. The former, took as a parameter only a table name, which makes no sense. There could be multiple tables with the same name each in a different keyspace and it doesn't make sense to generalize the test - i.e to ask "Does a statement depend on any table named XXX?" In this change we unify the two calls to one - depends on that takes a keyspace name and optionally also a table name, that way every logical dependency tests that makes sense is supported by a single API call. (cherry picked from commit `bf50dbd35b`) Ref #10129	2022-05-08 12:32:41 +03:00
Calle Wilund	e314158708	cdc: Ensure columns removed from log table are registered as dropped If we are redefining the log table, we need to ensure any dropped columns are registered in "dropped_columns" table, otherwise clients will not be able to read data older than now. Includes unit test. Should probably be backported to all CDC enabled versions. Fixes #10473 Closes #10474 (cherry picked from commit `78350a7e1b`)	2022-05-05 11:34:56 +02:00
Tomasz Grabiec	46586532c9	loading_cache: Make invalidation take immediate effect There are two issues with current implementation of remove/remove_if: 1) If it happens concurrently with get_ptr(), the latter may still populate the cache using value obtained from before remove() was called. remove() is used to invalidate caches, e.g. the prepared statements cache, and the expected semantic is that values calculated from before remove() should not be present in the cache after invalidation. 2) As long as there is any active pointer to the cached value (obtained by get_ptr()), the old value from before remove() will be still accessible and returned by get_ptr(). This can make remove() have no effect indefinitely if there is persistent use of the cache. One of the user-perceived effects of this bug is that some prepared statements may not get invalidated after a schema change and still use the old schema (until next invalidation). If the schema change was modifying UDT, this can cause statement execution failures. CQL coordinator will try to interpret bound values using old set of fields. If the driver uses the new schema, the coordinaotr will fail to process the value with the following exception: User Defined Type value contained too many fields (expected 5, got 6) The patch fixes the problem by making remove()/remove_if() erase old entries from _loading_values immediately. The predicate-based remove_if() variant has to also invalidate values which are concurrently loading to be safe. The predicate cannot be avaluated on values which are not ready. This may invalidate some values unnecessarily, but I think it's fine. Fixes #10117 Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com> (cherry picked from commit `8fa704972f`)	2022-05-04 15:38:11 +03:00
Avi Kivity	0114244363	Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop. Fixes: #10450 Closes #10451 * github.com:scylladb/scylla: replica/database: drop_column_family(): drop querier cache entries after waiting for ops replica/database: finish coroutinizing drop_column_family() replica/database: make remove(const column_family&) private (cherry picked from commit `7f1e368e92`)	2022-05-01 17:11:52 +03:00
Avi Kivity	f154c8b719	Update tools/java submodule (bad IPv6 addresses in nodetool) * tools/java 05ec511bbb...46744a92ff (1): > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index Fixes #10442	2022-04-28 11:35:09 +03:00
Beni Peled	8bf149fdd6	release: prepare for 4.6.3	2022-04-14 14:16:52 +03:00
Tomasz Grabiec	0265d56173	utils/chunked_managed_vector: Fix sigsegv during reserve() Fixes the case of make_room() invoked with last_chunk_capacity_deficit but _size not in the last reserved chunk. Found during code review, no user impact. Fixes #10364. Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com> (cherry picked from commit `0c365818c3`)	2022-04-13 10:29:30 +03:00
Tomasz Grabiec	e50452ba43	utils/chunked_vector: Fix sigsegv during reserve() Fixes the case of make_room() invoked with last_chunk_capacity_deficit but _size not in the last reserved chunk. Found during code review, no known user impact. Fixes #10363. Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com> (cherry picked from commit `01eeb33c6e`) [avi: make max_chunk_capacity() public for backport]	2022-04-13 10:29:03 +03:00
Avi Kivity	a205f644cb	transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3 we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since the client won't understand the v4 errors), but we still send the new error codes. This causes the client to become confused. Fix by updating the error codes. A better fix is to move the error code from the constructor parameter list and hard-code it in the constructor, but that is left for a follow-up after this minimal fix. Fixes #5610. Closes #10362 (cherry picked from commit `987e6533d2`)	2022-04-13 09:49:02 +03:00
Tomasz Grabiec	f136b5b950	utils/chunked_managed_vector: Fix corruption in case there is more than one chunk If reserve() allocates more than one chunk, push_back() should not work with the last chunk. This can result in items being pushed to the wrong chunk, breaking internal invariants. Also, pop_back() should not work with the last chunk. This breaks when there is more than one chunk. Currently, the container is only used in the sstable partition index cache. Manifests by crashes in sstable reader which touch sstables which have partition index pages with more than 1638 partition entries. Introduced in `78e5b9fd85` (4.6.0) Fixes #10290 Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com> (cherry picked from commit `41fe01ecff`)	2022-04-08 10:53:52 +03:00
Takuya ASADA	69a1325884	docker: enable --log-to-stdout which mistakenly disabled Since our Docker image moved to Ubuntu, we mistakenly copy dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not used in Ubuntu (it should be /etc/default). So /etc/default/scylla-server is just default configuration of scylla-server .deb package, --log-to-stdout is 0, same as normal installation. We don't want keep the duplicated configuration file anyway, so let's drop dist/docker/etc/sysconfig/scylla-server and configure /etc/default/scylla-server in build_docker.sh. Fixes #10270 Closes #10280 (cherry picked from commit `bdefea7c82`)	2022-04-07 12:13:35 +03:00
Avi Kivity	ab153c9b94	Update seastar submodule (logger deadlock with large messages) * seastar 34e58f9995...94a462d94b (2): > log: Fix silencer to be shard-local and logger-global > log: Silence logger when logging Fixes #10336.	2022-04-05 19:43:49 +03:00
Beni Peled	eb372d7f03	release: prepare for 4.6.2	2022-04-05 16:59:53 +03:00
Takuya ASADA	e232711e7e	docker: run scylla as root Previous versions of Docker image runs scylla as root, but `cb19048` accidently modified it to scylla user. To keep compatibility we need to revert this to root. Fixes #10261 Closes #10325 (cherry picked from commit `f95a531407`)	2022-04-05 12:46:12 +03:00
Takuya ASADA	0a440b6d4a	docker: revert scylla-server.conf service name change We changed supervisor service name at `cb19048`, but this breaks compatibility with scylla-operator. To fix the issue we need to revert the service name to previous one. Fixes #10269 Closes #10323 (cherry picked from commit `41edc045d9`)	2022-04-05 12:42:36 +03:00
Piotr Sarna	00bb1e8145	cql3: fix qualifying restrictions with IN for indexing When a query contains IN restriction on its partition key, it's currently not eligible for indexing. It was however erroneously qualified as such, which lead to fetching incorrect results. This commit fixes the issue by not allowing such queries to undergo indexing, and comes with a regression test. Fixes #10300 Closes #10302 (cherry picked from commit `c0fd53a9d7`)	2022-04-03 11:21:43 +03:00
Avi Kivity	e30dbee2db	Update seastar submodule (pidof command not installed) * seastar 50e1549b2c...34e58f9995 (1): > seastar-cpu-map.sh: switch from pidof to pgrep Fixes #10238.	2022-03-29 12:40:17 +03:00
Beni Peled	2309d6b51e	release: prepare for 4.6.1	2022-03-28 10:57:31 +03:00
Benny Halevy	b77ca07709	atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal Following up on `a57c087c89`, compare_atomic_cell_for_merge should compare the ttl value in the reverse order since, when comparing two cells that are identical in all attributes but their ttl, we want to keep the cell with the smaller ttl value rather than the larger ttl, since it was written at a later (wall-clock) time, and so would remain longer after it expires, until purged after gc_grace seconds. Fixes #10173 Test: mutation_test.test_cell_ordering, unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com> (cherry picked from commit `a085ef74ff`)	2022-03-24 18:08:07 +02:00
Benny Halevy	bb0a38f889	atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge currently returns std::strong_ordering::equal if two cells are equal in every way except their ttl:s. The problem with that is that the cells' hashes are different and this will cause repair to keep trying to repair discrepancies caused by the ttl being different. This may be triggered by e.g. the spark migrator that computes the ttl based on the expiry time by subtracting the expiry time from the current time to produce a respective ttl. If the cell is migrated multiple times at different times, it will generate cells that the same expiry (by design) but have different ttl values. Fixes #10156 Test: mutation_test.test_cell_ordering, unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com> (cherry picked from commit `a57c087c89`)	2022-03-24 18:08:07 +02:00
Benny Halevy	c48fd03463	atomic_cell: compare_atomic_cell_for_merge: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com> (cherry picked from commit `d43da5d6dc`) Ref #10156	2022-03-24 18:07:54 +02:00
Benny Halevy	eb78e6d4b8	atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison No need to check first the the cells' expiry is different or that deletion_time is different before comparing them with `<=>`. If they are the same the function returns std::strong_ordering::equal anyhow and that is the same as `<=>` comparing identical values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com> (cherry picked from commit `be865a29b8`) Ref #10156	2022-03-24 18:07:32 +02:00
Avi Kivity	4b1b0a55c0	replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc compare_atomic_cell_for_merge() was placed in database.cc, before atomic_cell.cc existed. Move it to its correct place. Closes #9889 (cherry picked from commit `6c53717a39`)	2022-03-24 18:07:11 +02:00
Benny Halevy	172a8628d5	main: shutdown: do not abort on certain system errors Currently any unhandled error during deferred shutdown is rethrown in a noexcept context (in ~deferred_action), generating a core dump. The core dump is not helpful if the cause of the error is "environmental", i.e. in the system, rather than in scylla itself. This change detects several such errors and calls _Exit(255) to exit the process early, without leaving a coredump behind. Otherwise, call abort() explicitly, rather than letting terminate() be called implicitly by the destructor exception handling code. Fixes #9573 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com> (cherry picked from commit `132c9d5933`)	2022-03-24 14:49:24 +02:00
Nadav Har'El	5688b125e6	Seastar: backport Seastar fix for missing scring escape in JSON output Backported Seastar fix: > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz Fixes #9061 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-23 21:27:13 +02:00
Piotr Sarna	6da4acb41e	expression: fix get_value for mismatched column definitions As observed in #10026, after schema changes it somehow happened that a column defition that does not match any of the base table columns was passed to expression verification code. The function that looks up the index of a column happens to return -1 when it doesn't find anything, so using this returned index without checking if it's nonnegative results in accessing invalid vector data, and a segfault or silent memory corruption. Therefore, an explicit check is added to see if the column was actually found. This serves two purposes: - avoiding segfaults/memory corruption - making it easier to investigate the root cause of #10026 Closes #10039 (cherry picked from commit 7b364fec9849e9a342af1c240e3a7185bf5401ef)	2022-03-21 10:46:34 +01:00
Botond Dénes	f09cc9a01d	Merge 'service: storage_service: announce new CDC generation immediately with RBNO' from Kamil Braun When a new CDC generation is created (during bootstrap or otherwise), it is assigned a timestamp. The timestamp must be propagated as soon as possible, so all live nodes can learn about the generation before their clocks reach the generation's timestamp. The propagation mechanism for generation timestamps is gossip. When bootstrap RBNO was enabled this was not the case: the generation timestamp was inserted into gossiper state too late, after the repair phase finished. Fix this. Also remove an obsolete comment. Fixes https://github.com/scylladb/scylla/issues/10149. Closes #10154 * github.com:scylladb/scylla: service: storage_service: announce new CDC generation immediately with RBNO service: storage_service: fix indentation (cherry picked from commit `f1b2ff1722`)	2022-03-16 12:27:24 +01:00
Raphael S. Carvalho	cd2e33ede4	compaction_manager: Abort reshape for tables waiting for a chance to run Tables waiting for a chance to run reshape wouldn't trigger stop exception, as the exception was only being triggered for ongoing compactions. Given that stop reshape API must abort all ongoing tasks and all pending ones, let's change run_custom_job() to trigger the exception if it found that the pending task was asked to stop. Tests: dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces unit: dev Fixes #9836. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com> (cherry picked from commit `07fba4ab5d`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220311183053.46625-1-raphaelsc@scylladb.com>	2022-03-15 16:58:47 +02:00
Benny Halevy	32d0698d78	compaction_manager: rewrite_sstables: do not acquire table write lock Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain `6737c88045`. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177 (cherry picked from commit `11ea2ffc3c`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>	2022-03-14 18:15:49 +02:00
Piotr Jastrzebski	93cf43ae4b	cdc: Handle compact storage correctly in preimage Base tables that use compact storage may have a special artificial column that has an empty type. `c010cefc4d` fixed the main CDC path to handle such columns correctly and to not include them in the CDC Log schema. This patch makes sure that generation of preimage ignores such empty column as well. Fixes #9876 Closes #9910 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> (cherry picked from commit `09d4438a0d`)	2022-03-10 14:25:02 +02:00
Nadav Har'El	2f2d22a864	cql: INSERT JSON should refuse empty-string partition key Add the missing partition-key validation in INSERT JSON statements. Scylla, following the lead of Cassandra, forbids an empty-string partition key (please note that this is not the same as a null partition key, and that null clustering keys are allowed). Trying to INSERT, UPDATE or DELETE a partition with an empty string as the partition key fails with a "Key may not be empty". However, we had a loophole - you could insert such empty-string partition keys using an "INSERT ... JSON" statement. The problem was that the partition key validation was done in one place - `modification_statement::build_partition_keys()`. The INSERT, UPDATE and DELETE statements all inherited this same method and got the correct validation. But the INSERT JSON statement - insert_prepared_json_statement overrode the build_partition_keys() method and this override forgot to call the validation function. So in this patch we add the missing validation. Note that the validation function checks for more than just empty strings - there is also a length limit for partition keys. This patch also adds a cql-pytest reproducer for this bug. Before this patch, the test passed on Cassandra but failed on Scylla. Reported by @FortTell Fixes #9853. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220116085216.21774-1-nyh@scylladb.com> (cherry picked from commit `8fd5041092`)	2022-03-02 22:00:15 +02:00
Avi Kivity	5f92f54f06	Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec cached_page::on_evicted() is invoked in the LSA allocator context, set in the reclaimer callback installed by the cache_tracker. However, cached_pages are allocated in the standard allocator context (note: page content is allocated inside LSA via lsa_buffer). The LSA region will happily deallocate these, thinking that they these are large objects which were delegated to the standard allocator. But the _non_lsa_memory_in_use metric will underflow. When it underflows enough, shard_segment_pool.total_memory() will become 0 and memory reclamation will stop doing anything, leading to apparent OOM. The fix is to switch to the standard allocator context inside cached_page::on_evicted(). evict_range() was also given the same treatment as a precaution, it currently is only invoked in the standard allocator context. The series also adds two safety checks to LSA to catch such problems earlier. Fixes #10056 \cc @slivne @bhalevy Closes #10130 * github.com:scylladb/scylla: lsa: Abort when trying to free a standard allocator object not allocated through the region lsa: Abort when _non_lsa_memory_in_use goes negative tests: utils: cached_file: Validate occupancy after eviction test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch utils: cached_file: Fix alloc-dealloc mismatch during eviction (cherry picked from commit `ff2cd72766`)	2022-02-26 11:28:53 +02:00
Benny Halevy	395f2459b4	cql3: result_set: remove std::ref from comperator& Applying std::ref on `RowComparator& cmp` hits the following compilation error on Fedora 34 with libstdc++-devel-11.2.1-9.fc34.x86_64 ``` FAILED: build/dev/cql3/statements/select_statement.o clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1 -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20 -ffile-prefix-map=/home/bhalevy/dev/scylla=. -march=westmere -DBOOST_TEST_DYN_LINK -Iabseil -fvisibility=hidden -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc In file included from cql3/statements/select_statement.cc:14: In file included from ./cql3/statements/select_statement.hh:16: In file included from ./cql3/statements/raw/select_statement.hh:16: In file included from ./cql3/statements/raw/cf_statement.hh:16: In file included from ./cql3/cf_name.hh:16: In file included from ./cql3/keyspace_element_name.hh:16: In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))> ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here reference_wrapper(_Up&& __uref) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)] = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>; ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here : public __is_nothrow_constructible_impl<_Tp, _Args...>::type ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here : public conditional<_B1::value, _B2, _B1>::type ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all) return __and_<typename _Base::_Local_storage, ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__partial_sort(__first, __last, __last, __comp); ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__introsort_loop(__first, __last, ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp)); ^ ./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here std::sort(_rows.begin(), _rows.end(), std::ref(cmp)); ^ cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here rs->sort(_ordering_comparator); ^ 1 error generated. ninja: build stopped: subcommand failed. ``` Fixes #10079. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com> (cherry picked from commit `3e20fee070`) [avi: backport for developer quality-of-life rather than as a bug fix]	2022-02-16 10:08:24 +02:00
Raphael S. Carvalho	019d50bb5c	Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME" This reverts commit `4c05e5f966`. Moving cleanup to maintenance group made its operation time up to 10x slower than previous release. It's a blocker to 4.6 release, so let's revert it until we figure this all out. Probably this happens because maintenance group is fixed at a relatively small constant, and cleanup may be incrementally generating backlog for regular compaction, where the former is fighting for resources against the latter. Fixes #10060. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220213165147.56204-1-raphaelsc@scylladb.com> Ref: `a9427f150a`	2022-02-14 12:10:38 +02:00
Avi Kivity	bbe775b926	utils: logalloc: correct and adjust timing unit in stall report The stall report uses the millisecond unit, but actually reports nanoseconds. Switch to microseconds (milliseconds are a bit too coarse) and use the safer "duration / 1us" style rather than "duration::count()" that leads to unit confusion. Fixes #9733. Closes #9734 (cherry picked from commit `f907205b92`)	2022-02-12 15:56:42 +02:00
Yaron Kaikov	469c94ea17	release: prepare for 4.6.0	2022-02-08 16:45:50 +02:00
Nadav Har'El	4c780d0265	alternator: allow REMOVE of non-existent nested attribute DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x exists in the item, but x.y doesn't - the removal silently does nothing. Alternator incorrectly generated an error in this case, and unfortunately we didn't have a test for this case. So in this patch we add the missing test (which fails on Alternator before this patch - and passes on DynamoDB) and then fix the behavior. After this patch, "REMOVE x.y" will remain an error if "x" doesn't exist (saying "document paths not valid for this item"), but if "x" exists and is a map, but "x.y" doesn't, the removal will silently do nothing and will not be an error. Fixes #10043. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220207133652.181994-1-nyh@scylladb.com> (cherry picked from commit `9982a28007`)	2022-02-08 11:48:18 +02:00
Michael Livshin	0181de1f2c	shard_reader: check that _reader is valid before dereferencing After `fc729a804`, `shard_reader::close()` is not interrupted with an exception any more if read-ahead fails, so `_reader` may in fact be null. Fixes #9923 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220117120405.152927-1-michael.livshin@scylladb.com> (cherry picked from commit `d7a993043d`)	2022-02-07 10:10:58 +02:00
Benny Halevy	7597a79ef9	shard_reader: Continue after read_ahead error If read ahead failed, just issue a log warning and proceed to close the reader. Currently co_await will throw and the evictable reader won't be closed. This is seen occasionally in testing, e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/1010/artifact/logs-all.debug.2/1640918573898_lwt_banking_load_test.py%3A%3ATestLWTBankingLoad%3A%3Atest_bank_with_nemesis/node2.log ``` ERROR 2021-12-31 02:40:56,160 [shard 0] mutation_reader - shard_reader::close(): failed to stop reader on shard 1: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem) ``` Fixes #9865. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220102124636.2791544-1-bhalevy@scylladb.com> (cherry picked from commit `fc729a804b`)	2022-02-07 10:09:05 +02:00
Nadav Har'El	8f5148e921	docker: don't repeat "--alternator-address" option twice If the Docker startup script is passed both "--alternator-port" and "--alternator-https-port", a combination which is supposed to be allowed, it passes to Scylla the "--alternator-address" option twice. This isn't necessary, and worse - not allowed. So this patch fixes the scyllasetup.py script to only pass this parameter once. Fixes #10016. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220202165814.1700047-1-nyh@scylladb.com> (cherry picked from commit `cb6630040d`)	2022-02-03 18:39:47 +02:00
Yaron Kaikov	5694ec189f	release: prepare for 4.6.rc5	2022-02-03 16:19:46 +02:00
Calle Wilund	34d470967a	commitlog: Fix double clearing of _segment_allocating shared_future. Fixes #10020 Previous fix `445e1d3` tried to close one double invocation, but added another, since it failed to ensure all potential nullings of the opt shared_future happened before a new allocator could reset it. This simplifies the code by making clearing the shared_future a pre-requisite for resolving its contents (as read by waiters). Also removes any need for try-catch etc. Closes #10024 (cherry picked from commit `1e66043412`)	2022-02-03 07:43:18 +02:00
Calle Wilund	61db571a44	commitlog: Ensure we never have more than one new_segment call at a time Refs #9896 Found by @eliransin. Call to new_segment was wrapped in with_timeout. This means that if primary caller timed out, we would leave new_segment calls running, but potentially issue new ones for next caller. This could lead to reserve segment queue being read simultanously. And it is not what we want. Change to always use the shared_future wait, all callers, and clear it only on result (exception or segment) Closes #10001 (cherry picked from commit `445e1d3e41`)	2022-02-01 09:10:27 +02:00
Tomasz Grabiec	5b5a300a9e	util: cached_file: Fix corruption after memory reclamation was triggered from population If memory reclamation is triggered inside _cache.emplace(), the _cache btree can get corrupted. Reclaimers erase from it, and emplace() assumes that the tree is not modified during its execution. It first locates the target node and then does memory allocation. Fix by running emplace() under allocating section, which disables memory reclamation. The bug manifests with assert failures, e.g: ./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed. Fixes #9915 Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com> (cherry picked from commit `b734615f51`)	2022-01-31 01:24:47 +02:00
Avi Kivity	148a65d0d6	Update seastar submodule (gratuitous exceptions on allocation failure) * seastar a189cdc45d...a375681303 (1): > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed Fixes #9982.	2022-01-30 20:02:24 +02:00
Avi Kivity	e3ad14d55f	Point seastar submodule at scylla-seastar.git This allows us to backport fixes to seastar selectively.	2022-01-30 20:01:12 +02:00
Calle Wilund	2b506c2d4a	commitlog: Ensure we don't run continuation (task switch) with queues modified Fixes #9955 In #9348 we handled the problem of failing to delete segment files on disk, and the need to recompute disk footprint to keep data flow consistent across intermittent failures. However, because _reserve_segments and _recycled_segments are queues, we have to empty them to inspect the contents. One would think it is ok for these queues to be empty for a while, whilst we do some recaclulating, including disk listing -> continuation switching. But then one (i.e. I) misses the fact that these queues use the pop_eventually mechanism, which does _not_ handle a scenario where we push something into an empty queue, thus triggering the future that resumes a waiting task, but then pop the element immediately, before the waiting task is run. In fact, _iff_ one does this, not only will things break, they will in fact start creating undefined behaviour, because the underlying std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push operations -> we will pop an empty queue, immediately making it non-empty, but using undefined memory (with luck null/zeroes). Strictly speakging, seastar::queue::pop_eventually should be fixed to handle the scenario, but nontheless we can fix the usage here as well, by simply copy objects and do the calculation "in background" while we potentially start popping queue again. Closes #9966 (cherry picked from commit `43f51e9639`)	2022-01-27 10:24:03 +02:00
Avi Kivity	50aad1c668	Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA We found that monitor mode of mdadm does not work on RAID0, and it is not a bug, expected behavior according to RHEL developer. Therefore, we should stop enabling mdmonitor when RAID0 is specified. Fixes #9540 ---- This reverts `0d8f932` and introduce correct fix. Closes #9970 * github.com:scylladb/scylla: scylla_raid_setup: use mdmonitor only when RAID level > 0 Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8" (cherry picked from commit `df22396a34`)	2022-01-27 10:21:25 +02:00
Yaron Kaikov	7bf3f37cd1	release: prepare for 4.6.rc4	2022-01-23 10:44:09 +02:00
Botond Dénes	0f7f8585f2	reader_permit: release_base_resources(): also update _resources If the permit was admitted, _base_resources was already accounted in _resource and therefore has to be deducted from it, otherwise the permit will think it leaked some resources on destruction. Test: dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count) Refs: #9751 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220119132550.532073-1-bdenes@scylladb.com> (cherry picked from commit `a65b38a9f7`)	2022-01-20 18:39:25 +02:00
Pavel Emelyanov	2c65c4a569	Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec Appending an empty range adjacent to an existing range tombstone would not deoverlap (by dropping the empty range tombstone) resulting in different (non canoncial) result depending on the order of appending. Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x. Appending [a, x) then [x, b) then [x, x) would give [a, b) Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b) The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical. Fixes #9661 Closes #9764 * github.com:scylladb/scylla: range_tombstone_list: Deoverlap adjacent empty ranges range_tombstone_list: Convert to work in terms of position_in_partition (cherry picked from commit `b2a62d2b59`)	2022-01-20 12:35:21 +02:00
Avi Kivity	f85cd289bc	Merge "repair: make sure there is one permit per repair with count res" from Botond " Repair obtains a permit for each repair-meta instance it creates. This permit is supposed to track all resources consumed by that repair as well as ensure concurrency limit is respected. However when the non-local reader path is used (shard config of master != shard config of follower), a second permit will be obtained -- for the shard reader of the multishard reader. This creates a situation where the repair-meta's permit can block the shard permit, creating a deadlock situation. This patch solves this by dropping the count resource on the repair-meta's permit when a non-local reader path is executed -- that is a multishard reader is created. Fixes: #9751 " * 'repair-double-permit-block/v4' of https://github.com/denesb/scylla: repair: make sure there is one permit per repair with count res reader_permit: add release_base_resource() (cherry picked from commit `52b7778ae6`)	2022-01-17 16:02:55 +02:00
Beni Peled	5e661af9a4	release: prepare for 4.6.rc3	2022-01-17 13:11:54 +02:00
Calle Wilund	5629b67d25	messaging_service: Make dc/rack encryption check for connection more strict Fixes #9653 When doing an outgoing connection, in a internode_encryption=dc/rack situation we should not use endpoint/local broadcast solely to determine if we can downgrade a connection. If gossip/message_service determines that we will connect to a different address than the "official" endpoint address, we should use this to determine association of target node, and similarly, if we bind outgoing connection to interface != bc we need to use this to decide local one. Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc until such time that gossip can give accurate info on dc/rack for "internal" ip addresses of nodes. (cherry picked from commit `4778770814`)	2022-01-16 19:10:57 +02:00
Takuya ASADA	ad632cf7fc	dist: fix scylla-housekeeping uuid file chmod call Should use chmod() on a file, not fchmod() Fixes #9683 Closes #9802 (cherry picked from commit `7064ae3d90`)	2022-01-10 16:57:34 +02:00
Botond Dénes	ca24bebcf2	sstables/partition_index_cache: destroy entry ptr on error The error-handling code removes the cache entry but this leads to an assertion because the entry is still referenced by the entry pointer instance which is returned on the normal path. To avoid this clear the pointer on the error path and make sure there are no additional references kept to it. Fixes #9887 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220105140859.586234-2-bdenes@scylladb.com> (cherry picked from commit `92727ac36c`)	2022-01-07 21:21:44 +01:00
Calle Wilund	7dc5abb6f8	commitlog: Don't allow error_handler to swallow exception Fixes #9798 If an exception in allocate_segment_ex is (sub)type of std::system_error, commit_error_handler might _not_ cause throw (doh), in which case the error handling code would forget the current exception and return an unusable segment. Now only used as an exception pointer replacer. Closes #9870 (cherry picked from commit `3c02cab2f7`)	2022-01-06 14:10:18 +02:00
Yaron Kaikov	e8a1cfb6f8	release: prepare for 4.6.rc2	2022-01-02 09:15:47 +02:00
Tomasz Grabiec	fc312b3021	lsa: Fix segment leak on memory reclamation during alloc_buf alloc_buf() calls new_buf_active() when there is no active segment to allocate a new active segment. new_buf_active() allocates memory (e.g. a new segment) so may cause memory reclamation, which may cause segment compaction, which may call alloc_buf() and re-enter new_buf_active(). The first call to new_buf_active() would then override _buf_active and cause the segment allocated during segment compaction to be leaked. This then causes abort when objects from the leaked segment are freed because the segment is expected to be present in _closed_segments, but isn't. boost::intrusive::list::erase() will fail on assertion that the object being erased is linked. Introduced in `b5ca0eb2a2`. Fixes #9821 Fixes #9192 Fixes #9825 Fixes #9544 Fixes #9508 Refs #9573 Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com> (cherry picked from commit `7038dc7003`)	2021-12-30 18:56:28 +02:00
Nadav Har'El	7b82aaf939	alternator: fix error on UpdateTable for non-existent table When the UpdateTable operation is called for a non-existent table, the appropriate error is ResourceNotFoundException, but before this patch we ran into an exception, which resulted in an ugly "internal server error". In this patch we use the existing get_table() function which most other operations use, and which does all the appropriate verifications and generates the appropriate Alternator api_error instead of letting internal Scylla exceptions escape to the user. This patch also includes a test for UpdateTable on a non-existent table, which used to fail before this patch and pass afterwards. We also add a test for DeleteTable in the same scenario, and see it didn't have this bug. As usual, both tests pass on DynamoDB, which confirms we generate the right error codes. Fixes #9747. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211206181605.1182431-1-nyh@scylladb.com> (cherry picked from commit `31eeb44d28`)	2021-12-29 22:59:25 +02:00
Nadav Har'El	894a4abfae	commitlog: fix missing wait for semaphore units Commit `dcc73c5d4e` introduced a semaphore for excluding concurrent recalculations - _reserve_recalculation_guard. Unfortunately, the two places in the code which tried to take this guard just called get_units() - which returns a future<units>, not units - and never waited for this future to become available. So this patch adds the missing "co_await" needed to wait for the units to become available. Fixes #9770. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211214122612.1462436-1-nyh@scylladb.com> (cherry picked from commit `b8786b96f4`)	2021-12-29 13:18:59 +02:00
Takuya ASADA	4dcf023470	scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8 On CentOS8, mdmonitor.service does not works correctly when using mdadm-4.1-15.el8.x86_64 and later versions. Until we find a solution, let's pinning the package version to older one which does not cause the issue (4.1-14.el8.x86_64). Fixes #9540 Closes #9782 (cherry picked from commit `0d8f932f0b`)	2021-12-28 11:38:04 +02:00
Benny Halevy	283788828e	compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested Currently when scrub/validate is stopped (e.g. via the api), scrub_validate_mode_validate_reader co_return:s without closing the reader passed to it - causing a crash due to internal error check, see #9766. Throwing a compaction_stopped_exception rather than co_return:ing an exception will be handled as any other exeption, including closing the reader. Fixes #9766 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com> (cherry picked from commit `c89876c975`)	2021-12-15 15:03:59 +02:00
Pavel Emelyanov	730a147ba6	row-cache: Handle exception (un)safety of rows_entry insertion The B-tree's insert_before() is throwing operation, its caller must account for that. When the rows_entry's collection was switched on B-tree all the risky places were fixed by `ee9e1045`, but few places went under the radar. In the cache_flat_mutation_reader there's a place where a C-pointer is inserted into the tree, thus potentially leaking the entry. In the partition_snapshot_row_cursor there are two places that not only leak the entry, but also leave it in the LRU list. The latter it quite nasty, because those entry can be evicted, eviction code tries to get rows_entry iterator from "this", but the hook happens to be unattached (because insertion threw) and fails the assert. fixes: #9728 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `ee103636ac`)	2021-12-14 15:53:42 +02:00
Pavel Emelyanov	9897e83029	partition_snapshot_row_cursor: Shuffle ensure_result creation Both places get the C-pointer on the freshly allocated rows_entry, insert it where needed and return back the dereferenced pointer. The C-pointer is going to become smart-pointer that would go out of scope before return. This change prepares for that by constructing the ensure_result from the iterator, that's returned from insertion of the entry. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `9fd8db318d`) Ref #9728	2021-12-14 15:52:37 +02:00
Asias He	1a9b64e6f6	storage_service: Wait for seastar::get_units in node_ops The seastar::get_units returns a future, we have to wait for it. Fixes #9767 Closes #9768 (cherry picked from commit `9859c76de1`)	2021-12-12 18:42:20 +02:00
Takuya ASADA	49fe9e2c8e	dist: allow running scylla-housekeeping with strict umask setting To avoid failing scylla-housekeeping in strict umask environment, we need to chmod a+r on repository file and housekeeping.uuid. Fixes #9683 Closes #9739 (cherry picked from commit `ea20f89c56`)	2021-12-12 14:25:57 +02:00
Takuya ASADA	d0580c41ee	dist: add support im4gn/is4gen instance on AWS Add support next-generation, storage-optimized ARM64 instance types. Fixes #9711 Closes #9730 (cherry picked from commit `097a6ee245`)	2021-12-08 14:29:44 +02:00
Beni Peled	542394c82f	release: prepare for 4.6.rc1	2021-12-08 11:08:45 +02:00
Avi Kivity	018ad3f6f4	test: refine test suite names exposed via xunit format The test suite names seen by Jenkins are suboptimal: there is no distinction between modes, and the ".cc" suffix of file names is interpreted as a class name, which is converted to a tree node that must be clicked to expand. Massage the names to remove unnecessary information and add the mode. Closes #9696 (cherry picked from commit `ef3edcf848`) Fixes #9738.	2021-12-05 19:58:22 +02:00
Avi Kivity	9b8b7efb54	tests: consolidate boost xunit result files The recent parallelization of boost unit tests caused an increase in xml result files. This is challenging to Jenkins, since it appears to use rpc-over-ssh to read the result files, and as a result it takes more than an hour to read all result files when the Jenkins main node is not on the same continent as the agent. To fix this, merge the result files in test.py and leave one result file per mode. Later we can leave one result file overall (integrating the mode into the testsuite name), but that can wait. Tested on a local Jenkins instance (just reading the result files, not the entire build). Closes #9668 (cherry picked from commit `b23af15432`) Fixes #9738	2021-12-05 19:57:39 +02:00
Botond Dénes	1c3e63975f	Merge 'Backport of #9348 (xceptions in commitlog::segment_manager::delete_segments could cause footprint counters to loose track)' from Calle Wilund Backport of series to 4.6 Upstream merge commit: `e2c27ee743`. Refs #9348 Closes #9702 * github.com:scylladb/scylla: commitlog: Recalculate footprint on delete_segment exceptions commitlog_test: Add test for exception in alloc w. deleted underlying file commitlog: Ensure failed-to-create-segment is re-deleted commitlog::allocate_segment_ex: Don't re-throw out of function	2021-12-02 09:22:19 +02:00
Calle Wilund	11bb03e46d	commitlog: Recalculate footprint on delete_segment exceptions Fixes #9348 If we get exceptions in delete_segments, we can, and probably will, loose track of footprint counters. We need to recompute the used disk footprint, otherwise we will flush too often, and even block indefinately on new_seg iff using hard limits.	2021-11-29 14:56:48 +00:00
Calle Wilund	810e410c5d	commitlog_test: Add test for exception in alloc w. deleted underlying file Tests that we can handle exception-in-alloc cleanup if the file actually does not exist. This however uncovers another weakness (addressed in next patch) - that we can loose track of disk footprint here, and w. hard limits end up waiting for disk space that never comes. Thus test does not use hard limit.	2021-11-29 14:56:43 +00:00
Calle Wilund	97f6da0c3e	commitlog: Ensure failed-to-create-segment is re-deleted Fixes #9343 If we fail in allocate_segment_ex, we should push the file opened/created to the delete set to ensure we reclaim the disk space. We should also ensure that if we did not recycle a file in delete_segments, we still wake up any recycle waiters iff we made a file delete instead. Included a small unit test.	2021-11-29 14:51:39 +00:00
Calle Wilund	c229fe9694	commitlog::allocate_segment_ex: Don't re-throw out of function Fixes #9342 commitlog_error_handler rethrows. But we want to not. And run post-handler cleanup (co_await)	2021-11-29 14:51:39 +00:00
Tomasz Grabiec	ee1ca8ae4d	lsa: Add sanity checks around lsa_buffer operations We've been observing hard to explain crashes recently around lsa_buffer destruction, where the containing segment is absent in _segment_descs which causes log_heap::adjust_up to abort. Add more checks to catch certain impossible senarios which can lead to this sooner. Refs #9192. Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com> (cherry picked from commit `bf6898a5a0`)	2021-11-24 15:17:37 +01:00
Tomasz Grabiec	6bfd322e3b	lsa: Mark compact_segment_locked() as noexcept We cannot recover from a failure in this method. The implementation makes sure it never happens. Invariants will be broken if this throws. Detect violations early by marking as noexcept. We could make it exception safe and try to leave the data structures in a consistent state but the reclaimer cannot make progress if this throws, so it's pointless. Refs #9192 Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com> (cherry picked from commit `4d627affc3`)	2021-11-24 15:17:35 +01:00
Tomasz Grabiec	afc18d5070	cql: Fix missing data in indexed queries with base table short reads Indexed queries are using paging over the materialized view table. Results of the view read are then used to issue reads of the base table. If base table reads are short reads, the page is returned to the user and paging state is adjusted accordingly so that when paging is resumed it will query the view starting from the row corresponding to the next row in the base which was not yet returned. However, paging state's "remaining" count was not reset, so if the view read was exhausted the reading will stop even though the base table read was short. Fix by restoring the "remaining" count when adjusting the paging state on short read. Tests: - index_with_paging_test - secondary_index_test Fixes #9198 Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com> (cherry picked from commit `1e4da2dcce`)	2021-11-23 11:22:00 +02:00
Tomasz Grabiec	2ec22c2404	sstables: partition_index_cache: Avoid abort due to benign bad_alloc inside allocating section shared_promise::get_shared_future() is marked noexcept, but can allocate memory. It is invoked by sstable partition index cache inside an allocating section, which means that allocations can throw bad_alloc even though there is memory to reclaim, so under normal conditions. Fix by allocating the shared_promise in a stable memory, in the standard allocator via lw_shared_ptr<>, so that it can be accessed outside allocating section. Fixes #9666 Tests: - build/dev/test/boost/sstable_partition_index_cache_test Message-Id: <20211122165100.1606854-1-tgrabiec@scylladb.com> (cherry picked from commit `1d84bc6c3b`)	2021-11-23 11:21:27 +02:00
Avi Kivity	19da778271	Merge "Run gossiper message handlers in a gate" from Pavel E " When gossiper processes its messages in the background some of the continuations may pop up after the gossiper is shutdown. This, in turn, may result in unwanted code to be executed when it doesn't expect. In particular, storage_service notification hooks may try to update system keyspace (with "fresh" peer info/state/tokens/etc). This update doesn't work after drain because drain shuts down commitlog. The intention was that gossiper did _not_ notify anyone after drain, because it's shut down during drain too. But since there are background continuations left, it's not working as expected. refs: #9567 tests: unit(dev), dtest.concurrent_schema_changes.snapshot(dev) " * 'br-gossiper-background-messages-2' of https://github.com/xemul/scylla: gossiper: Guard background processing with gate gossiper: Helper for background messaging processing (cherry picked from commit `9e2b6176a2`)	2021-11-19 07:25:26 +02:00
Avi Kivity	cbd4c13ba6	Merge 'Revert "scylla_util.py: return bool value on systemd_unit.is_active()"' from Takuya ASADA On scylla_unit.py, we provide `systemd_unit.is_active()` to return `systemctl is-active` output. When we introduced systemd_unit class, we just returned `systemctl is-active` output as string, but we changed the return value to bool after that (`2545d7fd43`). This was because `if unit.is_active():` always becomes True even it returns "failed" or "inactive", to avoid such scripting bug. However, probably this was mistake. Because systemd unit state is not 2 state, like "start" / "stop", there are many state. And we already using multiple unit state ("activating", "failed", "inactive", "active") in our Cloud image login prompt: https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_login#L135 After we merged `2545d7fd43`, the login prompt is broken, because it does not return string as script expected (https://github.com/scylladb/scylla-machine-image/issues/241). I think we should revert `2545d7fd43`, it should return exactly same value as `systemctl is-active` says. Fixes #9627 Fixes scylladb/scylla-machine-image#241 Closes #9628 * github.com:scylladb/scylla: scylla_ntp_setup: use string in systemd_unit.is_active() Revert "scylla_util.py: return bool value on systemd_unit.is_active()" (cherry picked from commit `c17101604f`)	2021-11-18 11:44:11 +02:00
Pavel Emelyanov	338871802d	generic_server: Keep server alive during conn background processing There's at least one tiny race in generic_server code. The trailing .handle_exception after the conn->process() captures this, but since the whole continuation chain happens in the background, that this can be released thus causing the whole lambda to execute on freed generic_server instance. This, in turn, is not nice because captured this is used to get a _logger from. The fix is based on the observation that all connections pin the server in memory until all of them (connections) are destructed. Said that, to keep the server alive in the aforementioned lambda it's enough to make sure the conn variable (it's lw_shared_ptr on the connection) is alive in it. Not to generate a bunch of tiny continuations with identical set of captures -- tail the single .then_wrapped() one and do whatever is needed to wrap up the connection processing in it. tests: unit(dev) fixes: #9316 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211115105818.11348-1-xemul@scylladb.com> (cherry picked from commit `ba16318457`)	2021-11-17 10:21:11 +02:00
Yaron Kaikov	8b5b1b8af6	dist/docker/debian/build_docker.sh: debian version fix for rc releases When building a docker we relay on `VERSION` value from `SCYLLA-VERSION-GEN` . For `rc` releases only there is a different between the configured version (X.X.rcX) and the actualy debian package we generate (X.X~rcX) Using a similar solution as i did in `dcb10374a5` Fixes: #9616 Closes #9617 (cherry picked from commit `060a91431d`)	2021-11-12 20:07:19 +02:00
Takuya ASADA	ea89eff95d	dist/docker: fix bashrc filename for Ubuntu For Debian variants, correct filename is /etc/bash.bashrc. Fixes #9588 Closes #9589 (cherry picked from commit `201a97e4a4`)	2021-11-10 14:25:27 +02:00
Michał Radwański	96421e7779	memtable: fix gcc function argument evaluation order induced use after move clang evaluates function arguments from left to right, while gcc does so in reverse. Therefore, this code can be correct on clang and incorrect on gcc: ``` f(x.sth(), std::move(x)) ``` This patch fixes one such instance of this bug, in memtable.cc. Fixes #9605. Closes #9606 (cherry picked from commit `eff392073c`)	2021-11-10 08:58:09 +02:00
Botond Dénes	142336ca53	mutation_writer/feed_writer: don't drop readers with small amount of content Due to an error in transforming the above routine, readers who have <= a buffer worth of content are dropped without consuming them. This is due to the outer consume loop being conditioned on `is_end_of_stream()`, which will be set for readers that eagerly pre-fill their buffer and also have no more data then what is in their buffer. Change the condition to also check for `is_buffer_empty()` and only drop the reader if both of these are true. Fixes: #9594 Tests: unit(mutation_writer_test --repeat=200, dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211108092923.104504-1-bdenes@scylladb.com> (cherry picked from commit `4b6c0fe592`)	2021-11-09 14:13:21 +02:00
Calle Wilund	492f12248c	commitlog: Add explicit track var for "wasted space" to avoid double counting Refs #9331 In segment::close() we add space to managers "wasted" counter. In destructor, if we can cleanly delete/recycle the file we remove it. However, if we never went through close (shutdown - ok, exception in batch_cycle - not ok), we can end up subtracting numbers that were never added in the first place. Just keep track of the bytes added in a var. Observed behaviour in above issue is timeouts in batch_cycle, where we declare the segment closed early (because we cannot add anything more safely - chunks could get partial/misplaced). Exception will propagate to caller(s), but the segment will not go through actual close() call -> destructor should not assume such. Closes #9598 (cherry picked from commit `3929b7da1f`)	2021-11-09 14:07:04 +02:00
Yaron Kaikov	7eb7a0e5fe	release: prepare for 4.6.rc0	2021-11-08 09:18:26 +02:00
Botond Dénes	e991604918	schema: make private constructor invokable via make_lw_shared The schema has a private constructor, which means it can't be constructed with `make_lw_shared()` even by classes which are otherwise able to invoke the private constructor themselves. This results in such classes (`schema_builder`) resorting to building a local schema object, then invoking `make_lw_shared()` with the schema's public move constructor. Moving a schema is not cheap at all however, so each `schema_builder::build()` call results in two expensive schema construction operations. We could make `make_lw_shared()` a friend of `schema` to resolve this, but then we'd de-facto open the private consctructor to the world. Instead this patch introduces a private tag type, which is added to the private constructor, which is then made public. Everybody can invoke the constructor but only friends can create the private tag instance required to actually call it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211105085940.359708-1-bdenes@scylladb.com>	2021-11-07 12:51:09 +02:00
Tomasz Grabiec	31bc1eb681	Merge 'Memtable reversing reader: fix computing rt slice, if there was previously emitted range tombstone.' from Michał Radwański This PR started by realizing that in the memtable reversing reader, it never happened on tests that `do_refresh_state` was called with `last_row` and `last_rts` which are not `std::nullopt`. Changes - fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes, - fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`). Fixes #9486 Closes #9572 * github.com:scylladb/scylla: partition_snapshot_reader: fix indentation in fill_buffer range_tombstone_list: {lower,upper,}slice share comparator implementation test: memtable: add full_compaction in background partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set range_tombstone_list: add lower_slice	2021-11-05 15:27:03 +01:00
Michał Radwański	ee601b7d87	partition_snapshot_reader: fix indentation in fill_buffer	2021-11-05 10:51:58 +01:00
Michał Radwański	35b1c3ff52	range_tombstone_list: {lower,upper,}slice share comparator implementation slice (2 overloads), upper_slice, lower_slice previously had implementations of a comparator. Move out the common structs, so that all 4 of them can share implementation.	2021-11-05 10:51:58 +01:00
Michael Livshin	60f76155a7	build: have configure.py create compile_commands.json compile_commands.json (a.k.a. "compdb", https://clang.llvm.org/docs/JSONCompilationDatabase.html) is intended to help stand-alone C-family LSP servers index the codebase as precisely as possible. The actively maintained LSP servers with good C++ support are: - Clangd (https://clangd.llvm.org/) - CCLS (https://github.com/MaskRay/ccls) This change causes a successful invocation of configure.py to create a unified Scylla+Seastar+Abseil compdb for every selected build mode, and to leave a valid symlink in the source root (if a valid symlink already exists, it will be left alone). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9558	2021-11-05 11:28:37 +02:00
Raphael S. Carvalho	4950ce539c	schema: replace outdated comment on default compaction strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211104210043.199156-1-raphaelsc@scylladb.com>	2021-11-05 00:35:41 +02:00
Nadav Har'El	5e52858295	rjson, alternator: rename set() functions add() The rjson::set() sounds like it can set any member of a JSON object (i.e., map), but that's not true :-( It calls the RapidJson function AddMember() so it can only add a member to an object which doesn't have a member with the same name (i.e., key). If it is called with a key that already has a value, the result may have two values for the same key, which is ill-formed and can cause bugs like issue #9542. So in this patch we begin by renaming rjson::set() and its variant to rjson::add() - to suggest to its user that this function only adds members, without checking if they already exist. After this rename, I was left with dozens of calls to the set() functions that need to changed to either add() - if we're sure that the object cannot already have a member with the same name - or to replace() if it might. The vast majority of the set() calls were starting with an empty item and adding members with fixed (string constant) names, so these can be trivially changed to add(). It turns out that all other set() calls - except the one fixed in issue #9542 - can also use add() because there are various "excuses" why we know the member names will be unique. A typical example is a map with column-name keys, where we know that the column names are unique. I added comments in front of such non-obvious uses of add() which are safe. Almost all uses of rjson except a handful are in Alternator, so I verified that all Alternator test cases continue to pass after this patch. Fixes #9583 Refs #9542 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211104152540.48900-1-nyh@scylladb.com>	2021-11-04 16:35:38 +01:00
Nadav Har'El	b95e431228	alternator: fix bug in ReturnValues=ALL_NEW This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in some cases returned the OLD (pre-modification) value of some of the attributes, instead of its NEW value. The bug was caused by a confusion in our JSON utility function, rjson::set(), which sounds like it can set any member of a map, but in fact may only be used to add a new member - if a member with the same name (key) already existed, the result is undefined (two values for the same key). In ReturnValues=ALL_NEW we did exactly this: we started with a copy of the original item, and then used set() to override some of the members. This is not allowed. So in this patch, we introduce a new function, rjson::replace(), which does what we previously thought that rjson::set() does - i.e., replace a member if it exists, or if not, add it. We call this function in the ReturnValues=ALL_NEW code. This patch also adds a test case that reproduces the incorrect ALL_NEW results - and gets fixed by this patch. In an upcoming patch, we should rename the confusingly-named set() functions and audit all their uses. But we don't do this in this patch yet. We just add some comments to clarify what set() does - but don't change it, and just add one new function for replace(). Fixes #9542 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211104134937.40797-1-nyh@scylladb.com>	2021-11-04 16:34:58 +01:00
Michał Radwański	cac9ac5126	test: memtable: add full_compaction in background Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source in background. Without it, some paths in the partition snapshot reader weren't covered, as the tests always managed to read all range tombstones and rows which cover a given clustering range from just a single snapshot. Now, when full_compaction happens in process of reading from a clustering range, we can force state refresh with non-nullopt positions of last row and last range tombstone. Note: this inability to test affected only the reversing reader.	2021-11-04 16:19:54 +01:00
Michał Radwański	94b263e356	partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set If Reversing and _last_rts was set, the created rt_slice still contained range tombstones between _last_rts and (snapshot) clustering range end. This is wrong - the correct range is between (snapshot) clustering range begin and _last_rts.	2021-11-04 16:10:07 +01:00
Pavel Emelyanov	6e97d2ce87	Merge branch 'compaction_cleanup_and_improvements_v2' from Raphael S. Carvalho Cleanup and improvements for compaction * 'compaction_cleanup_and_improvements_v2' of https://github.com/raphaelsc/scylla: compaction: fix outdated doc of compact_sstables() table: fix indentation in compact_sstables() table: give a more descriptive name to compaction_data in compact_sstables() compaction_manager: rename submit_major_compaction to perform_major_compaction compaction: fix indentantion in compaction.hh compaction: move incremental_owned_ranges_checker into cleanup_compaction compaction: make owned ranges const in cleanup_compaction compaction: replace outdated comment in regular_compaction compaction: give a more descriptive name to compaction_data compaction_manager: simplify creation of compaction_data	2021-11-04 17:27:07 +03:00
Raphael S. Carvalho	132a840ed5	compaction: fix outdated doc of compact_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 11:09:24 -03:00
Raphael S. Carvalho	98dd57113f	table: fix indentation in compact_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 11:09:24 -03:00
Raphael S. Carvalho	51aa79e267	table: give a more descriptive name to compaction_data in compact_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 11:09:24 -03:00
Raphael S. Carvalho	8ce9cda391	compaction_manager: rename submit_major_compaction to perform_major_compaction for symmetry, let's call it perform_* as it doesn't work like submission functions which doesn't wait for result, like the one for minor compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:54:00 -03:00
Raphael S. Carvalho	0d745912d0	compaction: fix indentantion in compaction.hh Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:50:46 -03:00
Raphael S. Carvalho	5af9a690c1	compaction: move incremental_owned_ranges_checker into cleanup_compaction let's move checker into cleanup as it's not needed elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:49:44 -03:00
Raphael S. Carvalho	04ef2124c6	compaction: make owned ranges const in cleanup_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:47:12 -03:00
Raphael S. Carvalho	d86c2491d4	compaction: replace outdated comment in regular_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:45:34 -03:00
Raphael S. Carvalho	b344db1696	compaction: give a more descriptive name to compaction_data info is no longer descriptive, as compaction now works with compaction_data instead of compaction_info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:43:08 -03:00
Raphael S. Carvalho	63dc4e2107	compaction_manager: simplify creation of compaction_data there's no need for wrapping compaction_data in shared_ptr, also let's kill unused params in create_compaction_data to simplify its creation. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:35:49 -03:00
Takuya ASADA	9b4cf8c532	scylla_util.py: On is_gce(), return False when it's on GKE GKE metadata server does not provide same metadata as GCE, we should not return True on is_gce(). So try to fetch machine-type from metadata server, return False if it 404 not found. Fixes #9471 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #9582	2021-11-04 12:49:06 +02:00
Avi Kivity	a64458e71a	Merge "Run test cases in parallel by default" from Pavel E " Some time ago there was introduced the --parallel-cases option that was set to False by default. Now everything is ready for making it True. Running in a BYO job shows that it takes 30 minutes less to complete the debug tests. Other timings remain almost the same. tests: unit(dev), unit(debug) " * 'br-parallel-cases-by-default' of https://github.com/xemul/scylla: test.py: Run parallel cases by default test, raft: Keep many-400 case out of debug mode test.py: Cache collected test-cases	2021-11-04 10:10:08 +02:00
Pavel Emelyanov	d1679b66f2	test.py: Run parallel cases by default There were few missing bits before making this the default. - default max number of AIOs, now tests are run with the greatly reduced value - 1.5 hours single case from database_test, now it's split and scales with --parallel-cases - suite add_test methods called in a loop for --repeat options, patch #1 from this set fixes it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-04 10:47:13 +03:00
Pavel Emelyanov	12cf69e5f5	test, raft: Keep many-400 case out of debug mode This case takes 45+ minutes which is 1.5 times longer then the second longest case out there. I propose to keep the many-400 case out of debug runs, there's many-100 one which is configured the same way but uses 4x times less "nodes". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-04 10:47:13 +03:00
Pavel Emelyanov	0d0ccd50b5	test.py: Cache collected test-cases The add_test method of a siute can be called several times in a row e.g. in case of --repeat option or because there are more than one custom_args entries in the suite.yaml file. In any case it's pointless to re-collect the test cases by launching the test binary again, it's much faster (and 100% safe) to keep the list of cases from the previous call and re-use it if the test name matches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-04 10:47:13 +03:00
Avi Kivity	e1817b536f	build: clobber user/group info from node_exporter tarball node_exporter is packaged with some random uid/gid in the tarball. When extracting it as an ordinary user this isn't a problem, since the uid/gid are reset to the current user, but that doesn't happen under dbuild since `tar` thinks the current user is root. This causes a problem if one wants to delete the build directory later, since it becomes owned by some random user (see /etc/subuid) Reset the uid/gid infomation so this doesn't happen. Closes #9579	2021-11-04 09:27:13 +02:00
Raphael S. Carvalho	ab0217e30e	compaction: Improve overall efficiency by not diluting it with relatively inefficient jobs Compaction efficiency can be defined as how much backlog is reduced per byte read or written. We know a few facts about efficiency: 1) the more files are compacted together (the fan-in) the higher the efficiency will be, however... 2) the bigger the size difference of input files the worse the efficiency, i.e. higher write amplification. so compactions with similar-sized files are the most efficient ones, and its efficiency increases with a higher number of files. However, in order to not have bad read amplification, number of files cannot grow out of bounds. So we have to allow parallel compaction on different tiers, but to avoid "dilution" of overall efficiency, we will only allow a compaction to proceed if its efficiency is greater than or equal to the efficiency of ongoing compactions. By the time being, we'll assume that strategies don't pick candidates with wildly different sizes, so efficiency is only calculated as a function of compaction fan-in. Now when system is under heavy load, then fan-in threshold will automatically grow to guarantee that overall efficiency remains stable. Please note that fan-in is defined in number of runs. LCS compaction on higher levels will have a fan-in of 2. Under heavy load, it may happen that LCS will temporarily switch to size-tiered mode for compaction to keep up with amount of data being produced. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211103215110.135633-2-raphaelsc@scylladb.com>	2021-11-03 20:03:23 +02:00
Raphael S. Carvalho	0db70a8d98	compaction: STCS: pick bucket with largest fan-in instead STCS is considering the smallest bucket, out of the ones which contain more than min_threshold elements, to be the most interesting one to compact now. That's basically saying we'll only compact larger tiers once we're done with smaller ones. That can be problematic because under heavy load, larger tiers cannot be compacted in a timely manner even though they're the ones contributing the most to read amplification. For example, if we're producing sstables in smaller tiers at roughly the same rate that we can compact them, then it may happen that larger tiers will not be compacted even though new sstables are being pushed to them. Therefore, backlog will not be reduced in a satisfactory manner, so read latency is affected. By picking the bucket with largest fan-in instead, we'll choose the most efficient compaction, as we'll target buckets which can reduce more from backlog once compacted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211103215110.135633-1-raphaelsc@scylladb.com>	2021-11-03 20:03:19 +02:00
Raphael S. Carvalho	e9cb56cd81	table: Adjust partition estimation for segregation on memtable flush If memtable flush is segregated into multiple files, partition estimation becomes innacurate and consequently bloom filters are bigger than needed, leading to an increase in memory consumption. To fix this, let's wire adjust_partition_estimate() into the flush procedure, such that original estimation will be adjusted if segregation is going to be performed. That's done by feeding mutation_source_metadata, which will leave original estimation unchanged if no segregation is needed, but will adjust it otherwise. Fixes #9581. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211103141600.65806-2-raphaelsc@scylladb.com>	2021-11-03 17:51:03 +02:00
Raphael S. Carvalho	2340cfa957	memtable-sstable: Extend interface to allow adjustment of estimated partitions Without tweaking interface, there was no way to adjust estimated partitions on flush. For example, when segregating a memtable for TWCS, all produced sstables would have an estimation equal to the memtable size, even though each only contains a subset of it, which leads to a significant increase in memory consumption for bloom filters. Subsequent work will use this interface to perform the adjustment. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211103141600.65806-1-raphaelsc@scylladb.com>	2021-11-03 17:51:03 +02:00
Avi Kivity	08ce119703	Merge "Fix twcs reshape disjoint test case" from Pavel E " There are 3 overlapping problems with the test case. It has use after move that covers wrond window selection and relies on a time-since-epoch being aligned with the time window by chance. tests: unit(dev) " * 'br-twcs-test-fixes' of https://github.com/xemul/scylla: test, compaction: Do not rely on random timestamp test, compaction: Fix use after move in twcs reshape	2021-11-03 17:38:29 +02:00
Asias He	f5f5714aa6	repair: Return HTTP 400 when repiar id is not found There are two APIs for checking the repair status and they behave differently in case the id is not found. ``` {"host": "192.168.100.11:10001", "method": "GET", "uri": "/storage_service/repair_async/system_auth?id=999", "duration": "1ms", "status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate: Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"} {"host": "192.168.100.11:10001", "method": "GET", "uri": "/storage_service/repair_status?id=999&timeout=1", "duration": "0ms", "status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate: Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"} ``` The correct status code is 400 as this is a parameter error and should not be retried. Returning status code 500 makes smarter http clients retry the request in hopes of server recovering. After this patch: curl -X PGET 'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999' {"message": "unknown repair id 9999", "code": 400} curl -X GET 'http://127.0.0.1:10000/storage_service/repair_status?id=9999' {"message": "unknown repair id 9999", "code": 400} Fixes #9576 Closes #9578	2021-11-03 17:15:40 +02:00
Pavel Emelyanov	9628d72964	test, compaction: Do not rely on random timestamp Again, there's a sub-case with sequential time stamps that still works by chance. This time it's because splitting 256 sstables into buckets of maximum 8 ones is allowed to have the 1st and the last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The exact generation depends on the time-since-epoch at which it starts. When all the cases are run altogether this time luckily happens to be well-aligned with 8-hours and the generated buckets are filled perfectly. When this particular test-case is run all alone (e.g. by --run_test or --parallel-cases) then the starting time becomes different and it gets less than 4 sstables in its first bucket. The fix is in adjusting the starting time to be aligned with the 8 hours window. Actually, the 8 hours appeared in the previous patch, before which it was 24 hours. Nonetheless, the above reasoning applies to any size of the time window that's less than 256, so it's still an independent fix. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-03 15:41:19 +03:00
Pavel Emelyanov	c6ce6b9ca1	test, compaction: Fix use after move in twcs reshape The options are std::move-d twice -- first into schema builder then into compaction strategy. Surprisingly, but the 2nd move makes the test work. There's a sub-case in this case that checks sstables with incremental timestamps with 1 hour step -- 0h, 1h, 2h, ... 255h. Next, the twcs buckets generator obeys a minimal threshold of 4 sstables per bucket. Those with less sstables in are not included in the job. Finally, since the options used to create the twcs are empty after the 1st move the default window of 24 hours is used. If they were configured correctly with 1 hour window then all buckets would contain 1 sstable and the generated job would become empty. So the fix is both -- don't move after move and make the window size large enough to fit more sstables than the mentioned minimum. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-03 15:09:15 +03:00
Piotr Sarna	f36bbe05b4	Merge 'alternator: add support for AttributeUpdates Add operation' from Nadav Har'El In UpdateItem's AttributeUpdates (old-style parameter) we were missing support for the ADD operation - which can increment a number, or add items to sets (or to lists, even though this fact isn't documented). This two-patch series add this missing feature. The first patch just moves an existing function to where we can reuse it, and the second patch is the actual implementation of the feature (and enabling its test). Fixes #5893 Closes #9574 * github.com:scylladb/scylla: alternator: add support for AttributeUpdates ADD operation alternator: move list_concatenate() function	2021-11-03 09:33:50 +01:00
Avi Kivity	075ceb8918	Merge 'AWS: add scylla_io_setup preset parameters for ARM instances' from Takuya ASADA Currently, scylla-server fails to start on ARM instances because scylla_io_setup does not have preset parameters even instance type added to 'supported instance'. To fix this, we need to add io parameter preset on scylla_io_setup. Also, we mistakenly added EBS only instances at `a004b1da30`, need to remove them. Instrances does not have ephemeral disk should be 'unsupported instance', we still run our AMI on it, but we print warning message on login prompt, and user requires to run scylla_io_setup. Fixes #9493 Closes #9532 * github.com:scylladb/scylla: scylla_util.py: remove EBS only ARM instances from support instance list scylla_io_setup: support ARM instances on AWS	2021-11-03 10:19:59 +02:00
Nadav Har'El	00335b1901	alternator: add support for AttributeUpdates ADD operation In UpdateItem's AttributeUpdates (old-style parameter) we were missing support for the ADD operation - which can increment a number, or add items to sets (or to lists, even though this fact isn't documented). This patch adds this feature, and the test for it begins to pass so its "xfail" marker is removed. Fixes #5893 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-03 10:19:26 +02:00
Nadav Har'El	7e6c5394f3	alternator: move list_concatenate() function The list_concatenate() function was only used for UpdateExpression's ADD operation, so we made it a static function in the source file where it was used. In the next patch, we'll want to use it in another place (AttributeUpdates' ADD operation), so let's move it to the same file where similar functions for sets exist. This patch is almost entirely a code move, but also makes one small change: list_concatenate() used to throw an exception if one of the arguments wasn't a list, but the text of this exception was specific to UpdateExpression. So in the new version, we return a null value in this case - and the caller checks for it and throws the right exception. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-03 10:19:26 +02:00
Nadav Har'El	56eb994d8f	alternator: allow Authorization header to be without spaces The "Authorization" HTTP header is used in DynamoDB API to sign requests. Our parser for this header, in server::verify_signature(), required the different components of this header to be separated by a comma followed by a whitespace - but it turns out that in DynamoDB both spaces and commas are optional - one of them is enough. At least one DynamoDB client library - the old "boto" (which predated boto3) - builds this header without spaces. In this patch we add a test that shows that an Authorization header with spaces removed works fine in DynamoDB but didn't work in Alternator, and after this patch modifies the parsing code for this header, the test begins to pass (and the other tests show that the previously-working cases didn't break). Fixes #9568 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211101214114.35693-1-nyh@scylladb.com>	2021-11-03 06:38:28 +02:00
Takuya ASADA	4a96a8145e	scylla_util.py: remove EBS only ARM instances from support instance list Since we required ephemeral disks for our AMI, these EBS only ARM instances cannot add in it is 'supported instance' list. We still able to run our AMI on these instance types but login message warns it is 'unsupported instance type', and requires to run scylla_io_setup manually.	2021-11-03 10:26:42 +09:00
Takuya ASADA	4e8060ba72	scylla_io_setup: support ARM instances on AWS Add preset parameters for AWS ARM intances. Fixes #9493	2021-11-03 10:26:42 +09:00
Avi Kivity	b3d5651fd7	Update seastar submodule * seastar 083898a172...a189cdc45d (7): > print: deprecate print() family > treewide: replace uses of fmt_print() and fprint() with direct fmt calls > circular_buffer: mark clear noexcept > circular_buffer: mark trivial methods noexcept > Merge "file: allow destroying append_challenged_posix_file_impl following a close failure" from Benny > merge: Add parsing HTTP response status > inet_address: fix usage of `htonl` for clang	2021-11-02 19:26:09 +02:00
garanews	7a6a59eb7c	fix some typo in docs Closes #9510	2021-11-02 19:59:16 +03:00
Botond Dénes	6ad0a2989c	compaction/scrub: segregate input only in segregate mode scrub_compaction assumes that `make_interposer_consumer()` is called only when `use_interposer_consumer()` returns true. This is false, so in effect scrub always ends up using the segregating interposer. Fix this by short-circuiting the former method when the latter returns true, returning the passed-in consumer unchanged. Tests: unit(dev) Fixes #9541 Closes #9564	2021-11-02 15:25:22 +02:00
Avi Kivity	15a80bb5ce	Update tools/jmx submodule * tools/jmx 5c383b6...48d37f3 (1): > StorageService: scrub: fix scrubMode is empty condition Ref scylladb/scylla-jmx#180.	2021-11-02 15:21:31 +02:00
Avi Kivity	2f23d22739	Merge 'Scrub compaction: add a new in-memory partition segregation method' from Botond Dénes The current disk-based segregation method works well enough for most cases, but it struggles greatly when there are a lot of partitions in the input. When this is the case it will produce tons of buckets (sstables), in the order of hundreds or even thousands. This puts a huge strain on different parts of the system. This series introduces a new segregation method which specializes on the lots of small partitions case. If the conditions are right, it can cause a drastic reduction of buckets. In one case I tested, a 1.1GB sstable with 3.6M partitions in it produced just 2 output sstables, down from the 500+ with the on-disk method. This new method uses a memtable to sort out-of-order partitions. In-order partitions bypass this sorting altogether and go to the disk directly. This method is not suitable for cases where either the partition are large or the total amount of data is large. For those the disk-based method should be used. Scrub compaction decides on the method to use based on heuristics. Tests: unit(dev) Closes #9548 * github.com:scylladb/scylla: compaction: scrub_compaction: add bucket count to finish message test/boost: mutation_writer_test: harden the partition-based segregator test mutation_writer: remove now unused on-disk partition segregator compaction,test: use the new in-memory segregator for scrub mutation_writer/partition_based_splitting_writer: add memtable-based segregator	2021-11-02 15:18:41 +02:00
Tomasz Grabiec	00814dcadc	Merge "raft: randomized_nemesis_test: perform cluster reconfigurations" from Kamil We introduce a new operation to the framework: `reconfiguration`. The operation sends a reconfiguration request to a Raft cluster. It bounces a few times in case of `not_a_leader` results. A side effect of the operation is modifying a `known` set of nodes which the operation's state has a reference to. This `known` set can then be used by other operations (such as `raft_call`s) to find the current leader. For now we assume that reconfigurations are performed sequentially. If a reconfiguration succeeds, we change `known` to the new configuration. If it fails, we change `known` to be the set sum of the previous configuration and the current configuration (because we don't know what the configuration will eventually be - the old or the attempted one - so any member of the set sum may eventually become a leader). We use a dedicated thread (similarly to the network partitioning thread) to periodically perform random reconfigurations. * kbr/reconfig-v2: test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test test: raft: randomized_nemesis_test: improve the bouncing algorithm test: raft: randomized_nemesis_test: handle more error types test: raft: randomized_nemesis_test put `variant` and `monostate` `ostream` `operator<<`s into `std` namespace test: raft: randomized_nemesis_test: `reconfiguration` operation	2021-11-02 13:55:45 +01:00
Botond Dénes	eaf4454ac8	compaction: scrub_compaction: add bucket count to finish message It is useful to know how many buckets (output sstables) scrub produced in total. The end compaction message will only report those still open when the scrub finished, but will omit those that were closed in the middle.	2021-11-02 12:24:37 +02:00
Botond Dénes	e4e369053b	test/boost: mutation_writer_test: harden the partition-based segregator test Test both methods: the "old" disk-based one and the recently added in-memory one, with different configurations and also add additional checks to ensure they don't loose data.	2021-11-02 12:24:37 +02:00
Botond Dénes	74f2290e49	mutation_writer: remove now unused on-disk partition segregator Also removes related tests, including the exception safety test which just spins forever with the memtable method.	2021-11-02 12:24:33 +02:00
Michał Radwański	07e78807e6	range_tombstone_list: add lower_slice lower_slice returns the range tombstones which have end inside range [start, before).	2021-11-02 10:50:31 +01:00
Botond Dénes	f2f529855d	compaction,test: use the new in-memory segregator for scrub	2021-11-02 09:00:44 +02:00
Botond Dénes	18599f26fa	mutation_writer/partition_based_splitting_writer: add memtable-based segregator The current method of segregating partitions doesn't work well for huge number of small partitions. For especially bad input, it can produce hundreds or even thousands of buckets. This patch adds a new segregator specialized for this use-case. This segregator uses a memtable to sort out-of-order partitions in-memory. When the memtable size reaches the provided max-memory limit, it is flushed to disk and a new empty one is created. In-order partitions bypass the sorting altogether and go to the fast-path bucket. The new method is not used yet, this will come in the next patch.	2021-11-02 08:23:16 +02:00
Asias He	9e8fc63585	repair: Fix range intersection for start_token and end_token option The range::subtract() was used as a trick to implement range::intersection() when intersection was not available at that time. The following code is problematic: dht::token_range given_range_complement(tok_end, tok_start); because dht::token_range should be non-wrapping range. To fix, use compat::unwrap_into() to unwrap the range and use the range::intersection() to calculate the intersection. Note even if the given_range_complement is problematic, current code generates correct intersections. Example 1: $ curl -X POST 'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=5&endToken=100' [shard 0] repair - starting user-requested repair for keyspace keyspace1, repair id [id=1, uuid=aa71a192-5967-4f05-99b8-5febd9d81d50], options {{ endToken -> 100}, { startToken -> 5}} [shard 0] repair - start=5, end=100, given_range_complement=(100, 5], wrange=(100, 5], is_wrap_around=true, ranges={(-inf, -7612759882658906007], (-7612759882658906007, -6766703710995023384], (-6766703710995023384, 2918449800065200439], (2918449800065200439, 8039072586540417979], (8039072586540417979, +inf)}, intersections={(5, 100]} [shard 0] repair - New method intersections={(5, 100]} Example 2: $ curl -X POST 'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=100&endToken=5' [shard 0] repair - starting user-requested repair for keyspace keyspace1, repair id [id=1, uuid=f6076438-015c-4bdc-8ebd-0a55664365fa], options {{ endToken -> 5}, { startToken -> 100}} [shard 0] repair - start=100, end=5, given_range_complement=(5, 100], wrange=(5, 100], is_wrap_around=false, ranges={(-inf, -7612759882658906007], (-7612759882658906007, -6766703710995023384], (-6766703710995023384, 2918449800065200439], (2918449800065200439, 8039072586540417979], (8039072586540417979, +inf)}, intersections={ (-inf, -7612759882658906007], (-7612759882658906007, -6766703710995023384], (-6766703710995023384, 5], (100, 2918449800065200439], (2918449800065200439, 8039072586540417979], (8039072586540417979, +inf)} [shard 0] repair - New method intersections={ (-inf, -7612759882658906007], (-7612759882658906007, -6766703710995023384], (-6766703710995023384, 5], (100, 2918449800065200439], (2918449800065200439, 8039072586540417979], (8039072586540417979, +inf)} Fixes #9560 Closes #9561	2021-11-01 12:43:49 +02:00
Nadav Har'El	e6d17d8de2	test/cql-pytest: remove "xfail" label on a reproducer for a fixed bug The two cql-pytest tests: test_frozen_collection.py::test_wrong_set_order_in_nested test_frozen_collection.py::test_wrong_set_order_in_nested_2 Which used to fail, and therefore marked "xfail", got fixed by commit `5589f348e7` ("cql3: expr: Implement evaluate(expr::bind_variable). That commit made the handling of bound variables in prepared statements more rigorous, and in particular made sure that sets are re-sorted not only if they are at the top level of the value (as happened in the old code), but also if they are nested inside some other container. This explains the surprising fact that we could only reproduce bug with prepared statements, and only with nested sets - while top-level sets worked correctly. As the tests no longer failed and the bug tested by them really did get fixed, in this patch we remove the "xfail" marker from these tests. Closes #7856. This issue was really fixed by the aforementioned commit, but let's close it now. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211029221731.1113554-1-nyh@scylladb.com>	2021-11-01 10:29:32 +02:00
Avi Kivity	1aff7d19c2	treewide: replace seastar::fmt_print() with fmt::print() We shouldn't be using Seastar as a text formatting library; that's not its focus. Use fmt directly instead. fmt::print() doesn't return the output stream which is a minor inconvenience, but that's life. Closes #9556	2021-11-01 10:05:16 +02:00
Avi Kivity	bd0b573a92	Merge 'Partition based splitting writer exception safety' from Botond Dénes The partition based splitting writer (used by scrub) was found to be exception-unsafe, converting an `std::bad_alloc` to an assert failure. This series fixes the problem and adds a unit test checking the exception safety against `std::bad_alloc`:s fixing any other related problems found. Fixes: https://github.com/scylladb/scylla/issues/9452 Closes #9453 * github.com:scylladb/scylla: test: mutation_writer_test: add exception safety test for segregate_by_partition() mutation_writer: segregate_by_partition(): make exception safe mutation_reader: queue_reader_handle: make abandoned() exception safe mutation_writer: feed_writers(): make it a coroutine mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement	2021-10-31 21:15:19 +02:00
Takuya ASADA	c9499230c3	docker: add stopwaitsecs We need stopwaitsecs just like we do TimeoutStpSec=900 on scylla-server.service, to avoid timeout on scylla-server shutdown. Fixes #9485 Closes #9545	2021-10-31 20:38:10 +02:00
Vlad Zolotarov	79b0654d60	time_window_compaction_strategy: put expired sstables in a separate compaction task It's much more efficient to have a separate compaction task that consists completely from expired sstables and make sure it gets a unique "weight" than mixing expired sstables with non-expired sstables adding an unpredictable latency to an eviction event of an expired sstable. This change also improves the visibility of eviction events because now they are always going to appear in the log as compactions that compact into an empty set. Fixes #9533 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Closes #9534	2021-10-31 17:54:40 +02:00
Nadav Har'El	6ae0ea0c48	alternator: return the correct Content-Type header Although the DynamoDB API responses are JSON, additional conventions apply to these responses - such as how error codes are encoded in JSON. For this reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead of the standard `application/json` in its responses. Until this patch, Scylla used `application/json` in its responses. This unexpected content-type didn't bother any of the AWS libraries which we tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27). Moreover, we should return the x-amz-json-1.0 content type for future proofing: It turns out that AWS already defined x-amz-json-1.1 - see: https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html The 1.1 content type differs (only) in how it encodes error replies. If one day DynamoDB starts to use this new reply format (it doesn't yet) and if DynamoDB libraries will need to differenciate between the two reply formats, Alternator better return the right one. This patch also includes a new test that the Content-Type header is returned with the expected value. The test passes on DynamoDB, and after this patch it starts to pass on Alternator as well. Fixes #9554. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>	2021-10-31 10:50:25 +01:00
Raphael S. Carvalho	2bf47c902e	cql: set configurable restriction of DateTieredCompactionStrategy to warn by default Setting a value of "warn" will still allow the create or alter commands, but will warn the user, with a message that will appear both at the log and also at cqlsh for example. This is another step towards deprecating DTCS. Users need to know we're moving towards this direction, and setting the default value to warn is needed for this. Next step is to set it to false, and finally remove it from the code base. Refs #8914. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211029184503.102936-1-raphaelsc@scylladb.com>	2021-10-31 09:28:17 +02:00
Nadav Har'El	034f79cfb4	alternator: make api_error an std::exception Objects of type "api_error" are used in Alternator when throwing an error which will be reported as-is to the user as part of the official DynamoDB protocol. Although api_error objects are often thrown, the api_error class was not derived from std::exception, because that's not necessary in C++. However, it is useful for this exception to derive from std::except, so this is what this patch does. It is useful for api_error to inherit from std::exception because then our logging and debugging code knows how to print this exception with all its details. All we need to do is to implement a what() virtual function for api_error. Before this patch, logging an api_error just logs the type's name (i.e., the string "api_error"). After this patch, we get the full information stored in the api_error - the error's type and its message. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211017150555.225464-1-nyh@scylladb.com>	2021-10-29 10:23:55 +03:00
Nadav Har'El	666017f2f0	Merge 'Convert last uses of sprint() to fmt::format()' from Avi Kivity sprint() uses the printf-style formatting language while most of our code uses the Python-derived format language from fmt::format(). The last mass conversion of sprint() to fmt (in `1129134a4a`) missed some callers (principally those that were on multiple lines, and so the automatic converter missed them). Convert the remainder to fmt::format(), and some sprintf() and printf() calls, so we have just one format language in the code base. Seastar::sprint() ought to be deprecated and removed. Test: unit (dev) Closes #9529 * github.com:scylladb/scylla: utils: logalloc: convert debug printf to fmt::print() utils: convert fmt::fprintf() to fmt::print() main: convert fprint() to fmt::print() compress: convert fmt::sprintf() to fmt::format() tracing: replace seastar::sprint() with fmt::format() thrift: replace seastar::sprint() with fmt::format() test: replace seastar::sprint() with fmt::format() streaming: replace seastar::sprint() with fmt::format() storage_service: replace seastar::sprint() with fmt::format() repair: replace seastar::sprint() with fmt::format() redis: replace seastar::sprint() with fmt::format() locator: replace seastar::sprint() with fmt::format() db: replace seastar::sprint() with fmt::format() cql3: replace seastar::sprint() with fmt::format() cdc: replace seastar::sprint() with fmt::format() auth: replace seastar::sprint() with fmt::format()	2021-10-28 22:33:23 +03:00
Piotr Sarna	0b11771731	alternator: decouple auth from CQL query processor Alternator auth module used to piggy-back on top of CQL query processor to retrieve authentication data, but it's no longer the case. Instead, storage proxy is used directly. Closes #9538	2021-10-28 21:55:56 +03:00
Benny Halevy	a2fc3345bd	storage_service: futurize storage_service::describe_ring Convert storage_service::describe_ring to a coroutine to prevent reactor stalls as seen in #9280. Fixes #9280 Closes #9282 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #9282	2021-10-28 16:51:57 +03:00
Avi Kivity	1e1e4f4934	Update abseil submodule * abseil 9c6a50f...f70eada (122): > Fix over-aligned layout test with older gcc compilers (#1049) > Export of internal Abseil changes > Initial support for Haiku (#1045) > Export of internal Abseil changes > Export of internal Abseil changes > Remove bazelbuild/rules_cc dependency (#1038) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Use FreeBSD macro definition for ElfW macro for compatibility. (#1037) > Export of internal Abseil changes > Fix hashing on big endian platforms (#1028) > Fix typedef of sig_t on AIX (#1030) > Export of internal Abseil changes > Fixed typo `constuct` to `construct` in 3 places. (#1022) > Export of internal Abseil changes > Export of internal Abseil changes > Initial support for AIX (#1021) > Export of internal Abseil changes > Update from_chars documentation with regard to whitespace (#1020) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Include immintrin.h instead of wmmintrin.h (#1015) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add -Wno-unknown-warning-option to ABSL_LLVM_FLAGS to disable warnings on unknown warning flags. (#1008) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add missing ABSL_DLL for a few functions (#1002) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Simplifies the construction of the value returned by GenerateRealFromBits() (#994) > CMake: option to use cxx_std_11 (minimum) that propagates. (#986) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix Bazel build on aarch64 (#984) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > CMake: add option to use Google Test already installed on system (#969) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Use CMAKE_INSTALL_FULL_{LIBDIR,INCLUDEDIR}. (#963) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Uses alignas for portability in dynamic_annotations.h (#947) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Call FailureSignalHandlerOptions.writenfn with nullptr at the end (#938) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add missing `add_subdirectory()` call for "cleanup" (#925) > Allowing to change the MSVC runtime (#921) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix C++/CLI build problem (#916) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add support for more Linux architectures (#904) > Export of internal Abseil changes > Export of internal Abseil changes > Add support for m68k (#900) > Add support for sparc and sparc64 (#899) > Fix uc_mcontext register access on 32-bit PowerPC (#898) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes	2021-10-28 16:22:18 +03:00
Benny Halevy	531e32957d	compaction: time_window_compaction_strategy: get_reshaping_job: consider disjointness only when trimming With `062436829c`, we return all input sstables in strict mode if they are dosjoint even if they don't need reshaping at all. This leads to an infinite reshape loop when uploading sstables with TWCS. The optimization for disjoint sstables is worth it also in relaxed mode, so this change first makes sorting of the input sstables by first_key order independent of reshape_mode, and then it add a check for sstable_set_overlapping_count before trimming either the multi_window vector or any single_window bucket such that we don't trim the list if the candidates are disjoint. Adjust twcs_reshape_with_disjoint_set_test accordingly. And also add some debug logging in time_window_compaction_strategy::get_reshaping_job so one can figure out what's going on there. Test: unit(dev) DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>	2021-10-28 14:35:51 +03:00
Pavel Emelyanov	3c0a48ac38	Merge 'cdc stream generation: dont abandon stream description rewrite future' from Avi Kivity The cdc stream rewrite code launches work in the backround. It's careful to make sure all the dependencies are preserved via shared pointers, but there are implicit dependencies (system_keyspace) that are not. To fix this problem, this series changes the lifetime of the background work from unbounded to be bounded by the lifetime of storage_service. [ xemul: Not the storage_service, but the cdc_generation_service, now all the dances with rewriting and its futures happen inside the cdc part. Storage_service coroutinization and call to .stop() from main are here for several reasons - They are awseome - Splitting a PR into two is frustrating (as compared to ML threads) - Shia LaBeouf is good motivation speaker ] As explained in the patches, I don't expect this to be a real problem and the series just paves the way for making system_keyspace an explicit dependency. Test: unit (dev) Closes #9356 * github.com:scylladb/scylla: cdc: don't allow background streams description rewrite to delay too far storage_service: coroutinize stop() main: stop storage_service on shutdown	2021-10-28 12:36:36 +03:00
Botond Dénes	7c95bd3343	Merge 'Rename 'system.status' and 'system.describe_ring' virtual tables' from Avi Kivity 'system.status' and 'system.describe_ring' are imperfect names for what they do, so rename them. Fortunately they aren't exposed in any released version so there is no compatibility concern. Closes #9530 * github.com:scylladb/scylla: system_keyspace: rename 'system.describe_ring' to 'system.token_ring' system_keyspace: rename 'system.status' to 'system.cluster_status'	2021-10-28 11:46:20 +03:00
Avi Kivity	c30be50252	utils: logalloc: convert debug printf to fmt::print() Standardize on one format language.	2021-10-28 10:48:08 +03:00
Takuya ASADA	13ffe3c094	scylla_util.py: detect ephemeral/EBS disks correctly on Nitro System Currently, aws_instance.ephemeral_disks() returns both ephemeral disks and EBS disks on Nitro System. This is because both are attached as NVMe disks, we need to add disk type detection code on NVMe handle logic. Fixes #9440 Closes #9462	2021-10-28 08:58:25 +03:00
Piotr Sarna	f4cb8191fa	cql3: include system distributed tables in system stats Some time ago we started gathering stats for system tables in a separate class in order to be able to distinguish which queries come from the user - e.g. if the unpaged queries are internal or not. Originally, only local system tables were moved into this class, i.e. system and system_schema. It would make sense, however, to also include other internal keyspaces in this separate class - which includes system_distributed, system_traces, etc. Fixes #9380 Closes #9490	2021-10-28 08:58:25 +03:00
Avi Kivity	5e6e4aed53	Merge 'Add Scylla Sphinx Theme 1.0' from David Garcia Replaces https://github.com/scylladb/scylla/pull/9477 Related issue https://github.com/scylladb/sphinx-scylladb-theme/issues/133 Sphinx ScyllaDB Theme 1.0 is now released 🥳 We’ve made a number of updates to the look and feel of the theme to improve the overall user experience. You can read more about all notable changes [here](https://sphinx-theme.scylladb.com/stable/CHANGELOG#september-2021). This PR also cleans the file ``conf.py``, removing several unsued options. 1. Clone this PR. For more information, see [Cloning pull requests locally](https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally). 2. Enter the docs folder, and run: ``` make preview ```` 3. Open http://127.0.0.1:5500/ with your favorite browser. You will see the docs with the new look and feel. Closes #9515 * github.com:scylladb/scylla: Review docs config fix runtime errors upgrade theme to v1.x	2021-10-28 08:58:25 +03:00
Raphael S. Carvalho	affa1d9b04	utils/estimated_histogram.hh: fix division-by-zero in mean() if mean() is called when there are no elements in the histogram, a runtime error will happen due to division-by-zero. approx_exponential_histogram::mean() handles it but for some reason we forgot to do the same for estimated_histogram. this problem was found when adding an unit test which calls mean() in an empty histogram. Fixes #9531. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211027142813.56969-1-raphaelsc@scylladb.com>	2021-10-28 08:58:25 +03:00
Benny Halevy	b79e9b7396	tools: scylla-sstable: improve error reporting when loading schema from file Throw a proper exception from do_load_schemas if parse_statements fails to parse the schema cql. Catch it in scylla-sstable main() function so it won't be reported as seastar - unhandled exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211027124032.1787347-1-bhalevy@scylladb.com>	2021-10-28 08:58:25 +03:00
Avi Kivity	5ea0940ca9	system_keyspace: rename 'system.describe_ring' to 'system.token_ring' Table names are usually nouns, so SELECT/INSERT statements sound natural: "SELECT * FROM pets". 'system.describe_ring' defies this convention. Rename it to 'system.token_ring' so selects are natural. The name is not in any released version, so we can safely rename it.	2021-10-27 17:32:37 +03:00
Avi Kivity	5b21e4eb83	system_keyspace: rename 'system.status' to 'system.cluster_status' 'system.status' is too generic, it doesn't explain the status of what. 'system.node_status' is also ambiguous (this node? all nodes?) so I picked 'system.cluster_status'. The internal name, nodetool_status_table, was even worse (we're not querying the status of nodetool!) but fortunately wasn't exposed. The name is not in any released version, so we can safely rename it.	2021-10-27 17:31:45 +03:00
Avi Kivity	379454c235	utils: convert fmt::fprintf() to fmt::print() Standardizing on a common format language.	2021-10-27 17:02:00 +03:00
Avi Kivity	7bcc0b8d8b	main: convert fprint() to fmt::print() fprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	0131ae6b5d	compress: convert fmt::sprintf() to fmt::format() Standardize on one format language.	2021-10-27 17:02:00 +03:00
Avi Kivity	3a0f2091d7	tracing: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	d1616a7643	thrift: replace seastar::sprint() with fmt::format() sprint() is obsolete. Note InvalidRequestException used sprint() with runtime format, so both it and its callers were updated.	2021-10-27 17:02:00 +03:00
Avi Kivity	27a2c74b64	test: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	7abd105d79	streaming: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	16f2eadfd0	storage_service: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	2fb406138c	repair: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	bfa4535ba5	redis: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	36919a4ed7	locator: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	d9d03383fa	db: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 17:02:00 +03:00
Avi Kivity	9424f6e12f	cql3: replace seastar::sprint() with fmt::format() sprint() is obsolete. Note some calls where to helper functions that use sprint(), not to sprint() directly, so both the helpers and the callers were modified.	2021-10-27 17:02:00 +03:00
Avi Kivity	6b02aa72e2	cdc: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 14:30:06 +03:00
Avi Kivity	b9cc9bad4c	auth: replace seastar::sprint() with fmt::format() sprint() is obsolete.	2021-10-27 14:29:32 +03:00
Botond Dénes	9ec55e054d	treewide: distinguish truncated frame errors We have two identical "Truncated frame" errors, at: * read_frame_size() in serialization_visitors.hh; * cql_server::connection::read_and_decompress_frame() in transport/server.cc; When such an exception is thrown, it is impossible to tell where was it thrown from and it doesn't have any further information contained in it (beyond the basic information it being thrown implies). This patch solves both problems: it makes the exception messages unique per location and it adds information about why it was thrown (the expected vs. real size of the frame). Ref: #9482 Closes #9520	2021-10-27 12:27:16 +02:00
Alejo Sanchez	0a63e72fa4	api: (minor) fix typo bool instead of boolean In definition for /column_family/major_compaction/{name} there is an incorrect use of "bool" instead of "boolean". Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #9516	2021-10-27 12:25:59 +02:00
Benny Halevy	a21b1fbb2f	large_data_handle: add sstable name to log messages Although the sstable name is part of the system.large_* records, it is not printed in the log. In particular, this is essential for the "too many rows" warning that currently does not record a row in any large_* table so we can't correlate it with a sstable. Fixes #9524 Test: unit(dev) DTest: wide_rows_test.py Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>	2021-10-27 10:53:11 +03:00
Botond Dénes	6a76e12768	mutation_partition: row: make row marker shadowing symmetric Currently row marker shadowing the shadowable tombstone is only checked in `apply(row_marker)`. This means that shadowing will only be checked if the shadowable tombstone and row marker are set in the correct order. This at the very least can cause flakyness in tests when a mutation produced just the right way has a shadowable tombstone that can be eliminated when the mutation is reconstructed in a different way, leading to artificial differences when comparing those mutations. This patch fixes this by checking shadowing in `apply(shadowable_tombstone)` too, making the shadowing check symmetric. There is still one vulnerability left: `row_marker& row_marker()`, which allow overwriting the marker without triggering the corresponding checks. We cannot remove this overload as it is used by compaction so we just add a comment to it warning that `maybe_shadow()` has to be manually invoked if it is used to mutate the marker (compaction takes care of that). A caller which didn't do the manual check is mutation_source_test: this patch updates it to use `apply(row_marker)` instead. Fixes: #9483 Tests: unit(dev) Closes #9519	2021-10-26 20:40:31 +02:00
Benny Halevy	5f513ed28b	view_builder: consumer: flush_fragments: close reader on error Make sure to close the reader created by flush_fragments if an exception occurs before it's moved to `populate_views`. Note that it is also ok to close the reader _after_ it has been moved, in case populate_views itself throws after closing the reader that was moved it. For conveience flat_mutation_reader::close supports close-after-move. Fixes #9479 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211024164138.1100304-1-bhalevy@scylladb.com>	2021-10-24 19:53:31 +03:00
Benny Halevy	4062cd17e0	test: hashers_test: mutation_fragment_sanity_check: stop semaphore To stop the semaphore as required we need run the test in a seastar thread. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211024053402.990142-1-bhalevy@scylladb.com>	2021-10-24 11:29:23 +03:00
David Garcia	ff56b7e43e	Review docs config	2021-10-22 13:34:56 +01:00
Botond Dénes	0d744fd3fa	test: mutation_writer_test: add exception safety test for segregate_by_partition()	2021-10-21 06:50:22 +03:00
Botond Dénes	2ca6552909	mutation_writer: segregate_by_partition(): make exception safe Close reader if feed_writer() fails in the setup phase.	2021-10-21 06:50:22 +03:00
Botond Dénes	6c8e98e33d	mutation_reader: queue_reader_handle: make abandoned() exception safe Allocating the exception might fail, terminating the application as `abandoned()` is called in a noexcept context. Handle this case by catching the bad-alloc and aborting the reader with that instead when this happens.	2021-10-21 06:50:22 +03:00
Botond Dénes	de55ab571b	mutation_writer: feed_writers(): make it a coroutine The current code leaks exceptional futures. Instead of attempting to fix, just convert to cleaner and exception-safe coroutines.	2021-10-21 06:50:22 +03:00
Botond Dénes	40ca728a20	mutation_writer: partition_based_splitting_writer: erase old bucket if we fail to create replacement So we don't attempt to close already closed bucket again in `partition_based_splitting_writer::close()`.	2021-10-21 06:50:22 +03:00
Michał Radwański	9caf85f64a	partition_snapshot_reader: do not accidentally copy schema Functions `upper_bound` and `lower_bound` had signatures: ``` template<typename T, typename... Args> static rows_iter_type lower_bound(const T& t, Args... args); ``` This caused a dacay from `const schema&` to `schema` as one of the args, which in turn copied the schema in a fair number of the queries. Fix that by setting the parameter type to `Args&&`, which doesn't discard the reference. Fixes #9502 Closes #9507	2021-10-20 19:09:08 +03:00
Avi Kivity	a9951588b4	Update seastar submodule * seastar 994b4b5a0c...083898a172 (24): > Revert "memory: always allocate buf using "malloc" for non reactor" > Revert dpdk update to 21.08. > tutorial: Fix typos > queue: add back template requirement for element type to be nothrow move-constructible > Revert "queue: require element type to be nothrow move-constructible" > build: add the closing "-Wl,--no-whole-archive" to the ldflags > build: add -Wno-error=volatile to CXX_FLAGS > build: Include dpdk as a single object in libseastar.a > Merge: queue: cleanup exception handling > build: drop dpdk-specific machine architecture names > reactor: call memory::configure() before initialize dpdk > core/loop: parallel_for_each(): make entire function critical alloc section > Merge 'scheduling groups: Add compile parameter for setting max scheduling groups count at compile time' from Eliran Sinvani > test: coroutines_test: assign spinner lambda to local variable > shared_ptr: mark shared_from_this functions noexcept > lw_shared_ptr: mark shared_from_this functions noexcept > build: update download URL for Boost > Merge "build: build with dpdk v21.08" from Kefu > cpu_stall_detector: handle wraparounds in Linux perf_event ring buffer > entry_point.cc: default-initialize sigaction struct > reactor: s/gettid()/syscall(SYS_gettid)/ > memory: always allocate buf using "malloc" for non reactor > Revert "memory: always allocate buf using "malloc" for non reactor" > memory: always allocate buf using "malloc" for non reactor	2021-10-20 18:38:18 +03:00
Benny Halevy	0746b5add6	storage_service: replicate_to_all_cores: update all keyspaces Currently we update the effective_replication_map only on non-system keyspace, leaving the system keyspace, that uses the local replication strategy, with the empty replication_map, as it was first initialized. This may lead to a crash when get_ranges is called later as seen in #9494 where get_ranges was called from the perform_sstable_upgrade path. This change updates the effective_replication_map on all keyspaces rather than just on the non-system ones and adds a unit test that reproduces #9494 without the fix and passes with it. Fixes #9494 Test: unit(dev), database_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211020143217.243949-1-bhalevy@scylladb.com>	2021-10-20 17:54:23 +03:00
Calle Wilund	940058d25a	transport::server: Handle nested exceoptions in cql execution/query Fixes #9491 CQL server, when encountering a "general" exception (i.e. not thrown by cql error checks), reports a wire error with simply the what() part of exception. However, if we have nested exceptions, we will most likely lose info here (hello encryption). General exception case should unwind exception and give back full, concatenated message to avoid confusion. Closes #9492	2021-10-20 17:54:17 +03:00
Nadav Har'El	e4a6569258	config: experimental flag UNUSED_CDC shouldn't be distinct from UNUSED When an experimental feature graduates from being experimental, we want to continue allow the old "--experimental-features=..." option to work, in case some user's configuration uses it - just do nothing. The way we do it is to map in db::experimental_features_t::map() the feature's name to the UNUSED value - this way the feature's name is accepted, but doesn't change anything. When the CDC feature graduated from being experimental, a new bit UNUSED_CDC was introduced to do the same thing. This separate bit was not actually necessary - if we ever check for UNUSED_CDC bit anywhere in the code it means the flag isn't actually unused ;-) And we don't check it. So simplify the code by conflating UNUSED_CDC into UNUSED. This will also make it easy to build from db::experimental_features_t::map() a list of current experimental features - now it will simply be those that do not map to UNUSED. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211013105107.123544-1-nyh@scylladb.com>	2021-10-20 17:54:17 +03:00
Nadav Har'El	88afcc7fe3	Merge 'cql-pytest: Forbid deletions based on secondary index' from Piotr Sarna This series fixes a bug which allowed using a secondary index in a restriction for a DELETE statement, which resulted in generating incorrect slices and deleting the whole partition instead. Secondary indexes are not meant to be used for deletes, which this series enforces by marking the indexes as not queriable. It also comes with a reproducing test case, originally provided by @fee-mendes (thanks!). Fixes #9495 Tests: unit(release) Closes #9496 * github.com:scylladb/scylla: cql-pytest: add reproducer for deleting based on secondary index cql3: forbid querying indexes for deletions	2021-10-20 17:54:17 +03:00
Botond Dénes	995a41d422	test/perf/perf_sstable: add support for compaction strategies So the compaction perf of different compaction strategies can be compared. Data timestamps are diversified such that they fall into four different bucket if TWCS is used, in order to be able to stress the timestamp based splitting code path. Closes #9488	2021-10-20 17:54:17 +03:00
Benny Halevy	dc091fc952	effective_replication_map, abstract_replication_strategy: get_ranges: call on_internal_error in empty sorted_tokens case Accessing tm.sorted_tokens().back() causes undefined behavior if tm.sorted_tokens is empty. Check that first and throw/abort using on_internal_error in this case. This will prevent the segfault but it doesn't fix the root cause which is getting here with empty token_metadata. That will be fixed by the following patch. Refs #9494 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211019075710.1626808-1-bhalevy@scylladb.com>	2021-10-19 18:52:59 +03:00
Piotr Sarna	7c35d47690	cql3: make column names readable for invalid delete statements This commit makes the column names from an invalid delete statement human readable. Before that, they were printed in their hex representation, which is not convenient for debugging. Before: InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid where clause contains non PRIMARY KEY columns: 76616c" After: InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid where clause contains non PRIMARY KEY columns: val" Message-Id: <52923335e8837295fd5ba2dfd0921196e21f7f16.1634626777.git.sarna@scylladb.com>	2021-10-19 10:13:43 +03:00
Piotr Sarna	83722b5563	cql-pytest: add reproducer for deleting based on secondary index This commit adds a test case for a bug reported by Felipe <felipemendes@scylladb.com>. The bug involves trying to delete an entry from a partition based on a secondary index created on a column which is part of the compound clustering key, and the unfortunate result is that the whole partition gets wiped. Cassandra's behavior is in this case correct - deletion based on a secondary index column is not allowed. Refs #9495	2021-10-19 08:50:20 +02:00
Piotr Sarna	7e3649202e	cql3: forbid querying indexes for deletions Using secondary indexes for the purpose of a DELETE statement was never expected to be well-defined, but an edge case in #9495 showed that the index may sometimes be inadvertently used, which causes the whole partition to be deleted. In order to prevent such errors, it's now explicitly defined that an index is not queriable if it's going to be used for the purpose of a DELETE statement.	2021-10-19 08:49:58 +02:00
Raphael S. Carvalho	4271c4edcd	sstables: Fix metric currently_open_for_writing metric currently_open_for_writing, used to inform # of sstables opened for writing, holds the same value as total_open_for_writing. that means we aren't actually decreasing the counter, so it is bogus. Moved to sstable_writer, because sstable is used by writer to open files, which are then extracted from sstable object, and later the same object is reused for read-only mode. Fixes #9455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>	2021-10-18 18:29:33 +03:00
Avi Kivity	e44057d5e1	cdc: don't allow background streams description rewrite to delay too far If we're upgrading from an older version with the previous CDC streams format, we'll upgrade it in the background. Background update is needed since we need the cluster to be available when performing the upgrade, but at this point we're just starting a node, and may not succeed in forming a cluster before we shut down. However, running in the background is dangerous since the objects we use may stop existing. The code is careful to use reference counting, but this does not guarantee that other dependencies are still alive, especially since not all dependencies are expressed via constructor parameters. Fix by waiting for the rewrite work in generation_service::stop(). As long as generation_service is up, the required dependencies should be working too. Note that there is another change here besides limiting the background work: checks that were previously done in the foreground (limited to local tables) are now also done in the background. I don't think this has any impact. Note: I expect this to have no real impact. Any CDC users will have long since ugpraded. This is just preparing for other patches that bring in other dependencies, which cannot be passed via reference counted pointers, so they expose the existing problem.	2021-10-18 16:56:59 +03:00
Kamil Braun	22061831c1	Merge 'cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy' from Benny Halevy It was auto-expanded only if the strategy name was the short "NetworkTopologyStrategy" name. Fixes #9302. Closes #9304. * 'prepare_options' of https://github.com/bhalevy/scylla: cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy abstract_replication_strategy: add to_qualified_class_name	2021-10-18 16:40:57 +03:00
Raphael S. Carvalho	ec1a55ffae	compaction/TWCS: reduce write amp for reshape of sstables spanning multiple windows TWCS can reshape at most 32 sstables spanning multiple windows, in a single compaction round. Which sstables are compacted together, when there are more than 32 sstables, is random. If sstables with overlapping windows are compacted together, then write amplification can be reduced because we may be able to push all the data to a window W in a single compaction round, so we'll not have to perform another compaction round later in W, to reduce its number of files. This is also very good to reduce the amount of transient file descriptors opened, because TWCS reshape first reshapes all sstables spanning multiple windows, so if all windows temporarily grow large in number of files, then there's a risk which file descriptors can be exhausted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211013203046.233540-3-raphaelsc@scylladb.com>	2021-10-18 16:40:57 +03:00
Raphael S. Carvalho	062436829c	compaction/TWCS: optimize reshape for disjoint sstables spanning multiple windows After `a4053dbb72`, data segregation is postponed to offstrategy, so reshape procedure is called with disjoint sstables which belong to different windows, so let's extend the optimization for disjoint sstables which span more than one window. In this way, write amplification is reduced for offstrategy compaction, as all disjoint sstables will be compacted at once. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>	2021-10-18 16:40:57 +03:00
Raphael S. Carvalho	aa4aba40aa	sstables: sstable_run: introduce estimate_droppable_tombstone_ratio Make it possible to estimate dropppable tombstones for sstable runs. The result is averaged by number of fragments composing the run. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211014143424.353357-1-raphaelsc@scylladb.com>	2021-10-18 12:24:08 +03:00
Benny Halevy	b9aa92edd4	cql3: keyspace prepare_options: expand replication_factor also for fully qualified NetworkTopologyStrategy It was auto-expanded only if the strategy name was the short "NetworkTopologyStrategy" name. Fixes #9302 Test: cql_query_test.test_rf_expand(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-18 12:18:07 +03:00
Benny Halevy	e4dc81ec04	abstract_replication_strategy: add to_qualified_class_name And use it from cql3 check_restricted_replication_strategy and keyspace_metadata ctor that defined their own `replication_class_strategy`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-18 12:13:25 +03:00
Piotr Sarna	4bfaa7d9fc	Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani This mini series contains two fixes that are bundled together since the second one assumes that the first one exists (or it will not fix anything really...), the two problems were: 1. When certain operations are called on a service level controller which doesn't have it's data accessor set, it can lead to a crash since some operations will still try to dereference the accessor pointer. 2. The cql environment test initialized the accessor with a sharded<system_distributed_data>& however this sharded class as itself is not initialized (sharded::start wasn't called), so for the same that were unsafe for null dereference the accessor will now crash for trying to access uninitialized sharded instance. Closes #9468 * github.com:scylladb/scylla: CQL test environment: Fix bad initialization order Service Level Controller: Fix possible dereference of a null pointer	2021-10-18 08:53:53 +02:00
Nadav Har'El	1d751491a3	test/alternator: recognize when Scylla crashes Before this patch, if Scylla crashes during some test in test/alternator, all tests after it will fail because they can't connect to Scylla - and we can get a report on hundreds of failures without a clear sign of where the real problem was. This patch introduces an autouse fixture (i.e., a fixture automatically used by every test) which tries to run a do-nothing health-check request after each test. If this health-check request fails, we conclude that Scylla crashed and report the test in which this happened - and exit pytest instead of failing a hundred more tests. The failure report looks something like this: ``` ! _pytest.outcomes.Exit: Scylla appears to have crashed in test test_batch.py::test_batch_get_item ! ``` And the entire test run fails. These extra health checks are not free, but they come fairly close to being free: In my tests I measured less than 0.1 seconds slowdown of the entire test suite (which has 618 tests) caused by the extra health checks. Fixes #9489 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211017123222.217559-1-nyh@scylladb.com>	2021-10-17 20:45:30 +03:00
Avi Kivity	4b9a34051c	storage_service: coroutinize stop() In preparation for adding more stuff, convert stop() to a coroutine to avoid an unreadable chain of continuations. The code uses a finally() block which might not be needed (since .join() should not fail). Rather than risking getting it wrong I wrapped it in a try/catch and added logging.	2021-10-17 18:02:08 +03:00
Avi Kivity	e6b34527c1	main: stop storage_service on shutdown Just like other services, storage_service needs to be stopped on shutdown. cql_test_env already stops it, so there is some precedent for it working. I tested a shutdown while cassandra-stress was running and it worked okay for a few trials.	2021-10-17 18:02:08 +03:00
Nadav Har'El	86e8979ff2	test/alternator, test/cql-pytest: enable specific experimental features Issue #9467 deprecated the blanket "--experimental" option which we used to enable all experimental Scylla features for testing, and suggests that individual experimental features should be enabled instead. So this is what we do in this patch for the Scylla-running scripts in test/alternator and test/cql-pytest: We need to enable UDF for the CQL tests, and to enable Alternator Streams and Alternator TTL for the Alternator tests. Refs #9467 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211012110312.719654-2-nyh@scylladb.com>	2021-10-15 16:36:35 +03:00
Nadav Har'El	ddba510e64	config: add name for the experimental Alternator TTL feature Earlier we added experimental (and very incomplete) support for Alternator's TTL feature, but forgot to set a name for this experimental feature. As a result, this feature can be enabled only with the blanket "--experimental" option and not with a specific "--experimental-features=..." option. Since issue #9467 deprecated the blanket "--experimental" option and users are encouraged to only enable specific experimental features, it is important that we have a name for it. So the name chosen in this patch is "alternator-ttl". Eventually this feature might evolve beyond Alternator-only, but for now, I think it's a good name and we'll probably graduate the experimental Alternator TTL feature before supporting CQL, so it will be a new experimental feature anyway. Refs #9467. db/config.cc Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211012110312.719654-1-nyh@scylladb.com>	2021-10-15 16:36:23 +03:00
Avi Kivity	acfe0a3803	build: reinstate -Wunknown-attributes The warning was disabled during the migration to clang, but now it appears unnecessary (perhaps clang added support for the attributes it did not have then). It is valuable for detecting misspelled attributes, so enable it again. Closes #9480	2021-10-14 14:26:56 +03:00
Tomasz Grabiec	cc56a971e8	database, treewide: Introduce partition_slice::is_reversed() Cleanup, reduces noise. Message-Id: <20211014093001.81479-1-tgrabiec@scylladb.com>	2021-10-14 12:39:16 +03:00
Nadav Har'El	cad039421a	config: automate help-string listing experimental features The help string from the "--experimental-features" command-line option lists the available experimental features, to helping a user who might want to enable them. But this help string was manually written, and has since drifted from reality: * Two of the listed "experimental" features, cdc and lwt, have actually graduated from being experimental long ago. Although technically a user may still use the words "cdc" and "lwt" in the "experimental-features" parameter, doing so is pointless, and worse: This text in the help string can mislead a user into thinking that these two features are still experimental - while they are not! * One experimental feature - alternator-ttl - is missing from this list. Instead of updating the help string text now - and needing to do this again and again in the future as we change experimental features - what this patch does is to construct the list of features automatically from the map of supported feature names - excluding any features which map to UNUSED. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211013122635.132582-1-nyh@scylladb.com>	2021-10-14 10:39:58 +03:00
Avi Kivity	4f3b8f38e2	Merge "Add effective_replication_map" from Benny " The current api design of abstract_replication_strategy provides a can_yield parameter to calls that may stall when traversing the token metadata in O(n^2) and even in O(n) for a large number of token ranges. But, to use this option the caller must run in a seastar thread. It can't be used if the caller runs a coroutine or plain async tasks. Rather than keep adding threads (e.g. in storage_service::load_and_stream or storage_service::describe_ring), the series offers an infrastructure change: precalculating the token->endpoints map once, using an async task, and keeping the results in a `effective_replication_map` object. The latter can be used for efficient and stall-free calls, like get_natural_endpoints, or get_ranges/get_primary_range, replacing their equivalents in abstract_replication_strategy, and dropping the public abstract_replication_strategy::calculate_natural_endpoints and its internal cached_endpoints map. Other than the performance benefits of: 1. The current calls require running a thread to yield. Precalculating the map (using async task) allows us to use synchronous calls without stalling the rector. 2. The replication maps can and should be shared between keyspaces that use the same replication strategy. (Will be sent as a follow-up to the series) The bigger benefits (courtesy of Avi Kivity) are laying the groundwork for: 1. atomic replication metadata - an operation can capture a replication map once, and then use consistent information from the map without worrying that it changes under its feet. We may even be able to s/inet_address/replica_ptr/ later. 2. establish boundaries on the use of replication information - by making a replication map not visible, and observing when its reference count drops to zero, we can tell when the new replication map is fully in use. When we start writing to a new node we'll be able to locate a point in time where all writes that were not aware of the new node were completed (this is the point where we should start streaming). Notes: * The get_natural_endpoints method that uses the effective_replication_map is still provided as a abstract_replication_strategy virtual method so that local_strategy can override it and privide natural endpoints for any search token, even in the absence of token_metadata, when\ called early-on, before token_metadata has been established. The effective_replication_map materializes the replication strategy over a given replication strategy options and token_metadata. Whenever either of those change for a keyspace, we make a new effective_replication_map and keep it in the keyspace for latter use. Methods that depend on an ad-hoc token_metadata (e.g. during node operations like bootstrap or replace) are still provided by abstract_replication_strategy. TODO: - effective_replication_map registry - Move pending ranges from token_metadata to replication map - get rid of abstract_replication_strategy::get_range_addresses(token_metadata&) - calculate replication map and use it instead. Test: unit(dev, debug) Dtest: next-gating, bootstrap_test.py update_cluster_layout_tests.py alternator_tests.py -a 'dtest-full,!dtest-heavy' (release) " * tag 'effective_replication_strategy-v6' of github.com:bhalevy/scylla: (44 commits) effective_replication_map: add get_range_addresses abstract_replication_strategy: get rid of shared_token_metadata member and ctor param abstract_replication_strategy: recognized_options: pass const topology& abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map token_metadata: get rid of now-unused sync methods abstract_replication_strategy: get rid of do_calculate_natural_endpoints abstract_replication_strategy: futurize get_address_ranges abstract_replication_strategy: futurize get_range_addresses abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr) abstract_replication_strategy: move get_ranges and get_primary_ranges to effective_replication_map compaction_manager: pass owned_ranges via cleanup/upgrade options abstract_replication_strategy: get rid of cached_endpoints all replication strategies: get rid of do_get_natural_endpoints storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map storage_service: bootstrap: add log messages storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings shared_token_metadata: set: check version monotonicity token_metadata: use static ring version token_metadata: get rid of copy constructor and assignment operator ...	2021-10-13 20:28:30 +03:00
Tomasz Grabiec	d8832b9fd8	Merge 'Memtable make reversing reader' from Michał Radwański Make a reader that reads from memtable in reverse order. This draft PR includes two commits, out of which only the second is relevant for review. Described in #9133. Refs #1413. Closes #9174 * github.com:scylladb/scylla: partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible. memtable: enable native reversing partition_snapshot_reader: reverse ck_range when needed by Reversing memtable, partition_snapshot_reader: read from partition in reverse partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration partition_snapshot_reader: split responsibility of ck_range partition_snapshot_reader: separate _schema into _query_schema and _partition_schema query: reverse clustering_range test: cql_query_test: fix test_query_limit for reversed queries	2021-10-13 20:24:02 +03:00
Nadav Har'El	ee8dc6847c	scylla.yaml: refresh list of experimental features Our scylla.yaml contains a comment listing the available experimental features, supposedly helping a user who might want to enable them. I think the usefuless of this comment is dubious, but as long as we have one, let's at least make it accurate: * Two of the listed "experimental" features, cdc and lwt, have actually graduated from being experimental long ago. Although technically a user may still use the words "cdc" and "lwt" in the "experimental-features" list, doing so is pointless, and worse: This comment suggests that these two features are still experimental - while they are not! * One experimental feature - alternator-ttl - is missing from this list. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211013083247.13223-1-nyh@scylladb.com>	2021-10-13 20:24:02 +03:00
Benny Halevy	17296cba4b	effective_replication_map: add get_range_addresses Equivalent to abstract_replication_strategy get_range_addresses, yet synchronous, as it uses the precalculated map. Call it from storage_service::get_new_source_ranges and range_streamer::get_all_ranges_with_sources_for. Consequently, get_new_source_ranges and removenode_add_ranges can become synchronous too. Unfortunately we can't entirely get rid of abstract_replication_strategy::get_range_addresses as it's still needed by range_streamer::get_all_ranges_with_strict_sources_for. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	8c85197c6c	abstract_replication_strategy: get rid of shared_token_metadata member and ctor param It is not used any more. Methods either use the token_metadata_ptr in the effective_replication_map, or receive an ad-hoc token_metadata. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	91f2fd5f2c	abstract_replication_strategy: recognized_options: pass const topology& Prepare for deleting the _shared_token_metadata member. All we need for recognized_options is the topology (for network_topology_strategy). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	4d2561ff75	abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	d953e7b01a	token_metadata: get rid of now-unused sync methods Now that abstract_replication_strategy methods are all async clone_only_token_map_sync, and update_normal_tokens_sync are unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	bdce6f93ca	abstract_replication_strategy: get rid of do_calculate_natural_endpoints It is no longer in use. And with it, the virtual calculate_natural_endpoint_sync method of which it was the only caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	cbe58345b9	abstract_replication_strategy: futurize get_*address_ranges Remaining callers of get_address_ranges and get_pending_address_ranges are all either from a seastar thread or from a coroutine so we can make the methods always async and drop the can_yield param. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	91581ba23a	abstract_replication_strategy: futurize get_range_addresses All remaining use sites are called in a seastar thread so we drop the can_yield param and make get_range_addresses always async. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	3040e0a038	abstract_replication_strategy: futurize get_ranges(inet_address ep, token_metadata_ptr) It is called only from repair, in a thread, so it can be made always async and the need_preempt param can be dropped. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Benny Halevy	dfdc8d4ddb	abstract_replication_strategy: move get_ranges and get_primary_ranges* to effective_replication_map Provide a sync get_ranges method by effective_replication_map that uses the precalculated map to get all token ranges owned by or replicated on a given endpoint. Reuse do_get_ranges as common infrastructure for all 3 cases: get_ranges, get_primary_ranges, and get_primary_ranges_within_dc. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:09:51 +03:00
Laura Novich	23886b2219	fix runtime errors	2021-10-13 15:08:24 +03:00
Laura Novich	d3e4b15530	upgrade theme to v1.x	2021-10-13 14:56:27 +03:00
Benny Halevy	5483269dfb	compaction_manager: pass owned_ranges via cleanup/upgrade options So they can be easily computed using an async task before constructing the compaction object in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:17:46 +03:00
Benny Halevy	0e5bb94e84	abstract_replication_strategy: get rid of cached_endpoints Now that do_get_natural_endpoints is gone, the cached endpoints are no longer in use. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:15:34 +03:00
Benny Halevy	25227ab5ea	all replication strategies: get rid of do_get_natural_endpoints Now that all falvors of get_natural_endpoints methods were moved to effective_replication_map, do_get_natural_endpoints and its overrides are unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:13:51 +03:00
Benny Halevy	facd5035f1	storage_proxy: use effective_replication_map token_metadata_ptr along with endpoints Use the same token_metadata used for get_natural_endpoints_without_node_being_replaced where used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:11:43 +03:00
Benny Halevy	aab363753f	abstract_replication_strategy: move get_natural_endpoints_without_node_being_replaced to effective_replication_map Use the precalculated endpoints map there as well as the token_metadata_ptr. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:10:01 +03:00
Benny Halevy	548719aac1	storage_service: bootstrap: add log messages Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:07:59 +03:00
Benny Halevy	08fef2a702	storage_service: get_mutable_token_metadata_ptr: always invalidate_cached_rings We should invalidate the cached rings every time the token metadata changes, not only on topology changes to invalidate cached token/replication mappings when the modified token_metadata is committed. Currently we can do without it (apparently) but this will become a requirement for keep versions of the effective_replication_map in a registry, indexed by the token_metadata ring version, among other things. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:05:57 +03:00
Benny Halevy	bb0ea0b1c0	shared_token_metadata: set: check version monotonicity Setting the ring version backwards means it got out of sync. Possibly concurrent updates weren't serialized properly using token_metadata_lock / mutate_token_metadata. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:03:51 +03:00
Benny Halevy	43160abaec	token_metadata: use static ring version For generating unique _ring_version. Currently when we clone a mutable token_metadata_ptr it remains with the same _ring_version and the ring version is updated only when the topology changes. To be able to distinguish these traqnsient copies from the ones that got applied, be stricter about the ring version and change it to a unique number using a static counter. Next patch will update the ring version (and consequently invalidate the cached_endpoints on the replication strategy) every time the token_metadata changes, not only when the topology changes. Note that the _cached_endpoints will go away once the transition to effective_replication_map is finished, so this will not degrade performance. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:03:17 +03:00
Benny Halevy	685f5e7704	token_metadata: get rid of copy constructor and assignment operator Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:00:55 +03:00
Benny Halevy	d74ecfbc29	abstract_replication_strategy: get rid of legacy get_natural_endpoints implementation Now that all users of it were converted to use the effective_replication_map, the legacy abstract_replication_strategy::get_natural_endpoints method can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:58:18 +03:00
Benny Halevy	4afe8cad3c	repair: use effective_replication_map to get_natural_endpoints Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:57:16 +03:00
Benny Halevy	cddd16f22d	db: view: use effective_replication_map to get_natural_endpoints Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:55:50 +03:00
Benny Halevy	96aa6161d8	db: hints manager: use effective_replication_map to get_natural_endpoints Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:54:52 +03:00
Benny Halevy	c10a439f6c	storage_service: optimize get_effective_replication_map multi-usage Currently, we call find_keyspace and then get_effective_replication_map on the _same_ keyspace to get_natural_endpoints for multiple tokens. Get the effective_replication_map once in these cases and use it for each token. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:53:18 +03:00
Benny Halevy	fdaa891332	storage_service, sstables_loader: use effective_replication_map to get_natural_endpoints Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:50:27 +03:00
Benny Halevy	4b838197e2	storage_service: update keyspaces effective_replication_map on token_metadata change Every time the token_metadata changes we need to update the effective_replication_map on all non-system keyspaces. Do that in replicate_to_all_cores after the updated token_metadata has been replicated to all cores. We first prepare and clone the token_metadata, then prepare and clone the new effective_replication_maps. Any failure at this stage is recoverable, handle via rollback and the exception is returned. Note that any failure to _apply_ the pending token_metadata or the effective_replication_map will cause scylla to abort. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:05:28 +03:00
Benny Halevy	3393df45eb	token_metadata, storage_service: unify token_metadata_lock and merge_lock. Serialize the metadata changes with keyspace create, update, or drop. This will become necessary in the following patch when we update the effective_replication_map on all keyspaces and we want instances on all shards end up with the same replication map. Note that storage_service::keyspace_changed is called from the scheme_merge path so it already holds the merge_lock. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 13:01:25 +03:00
Benny Halevy	4cba7195ee	storage_service: coroutinize mutate_token_metadata And fold with_token_metadata_lock into it, as it's its only caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:59:58 +03:00
Benny Halevy	045806cae7	storage_service: replicate_to_all_cores: use local pending_token_metadata_ptr Rather than a _pending_token_metadata_ptr member in the storeage_service class. This is now much easier that the function was converted to a coroutine. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:58:30 +03:00
Benny Halevy	52f48f47f6	storage_service: coroutinize replicate_to_all_cores Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:57:05 +03:00
Benny Halevy	991a6a8664	keyspace: update_effective_replication_map And use it to get_natural_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:55:34 +03:00
Benny Halevy	970b0a50b5	keyspace: futurize create_replication_strategy And functions that use it, like: keyspace::update_from database::update_keyspace database::create_in_memory_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:53:41 +03:00
Benny Halevy	eb752c3f69	test: network_topology_strategy_test: use effective_replication_map to get_natural_endpoints Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:53:09 +03:00
Benny Halevy	1e1d7d7df5	abstract_replication_strategy: introduce effective_replication_map effective_replication_map holds the full replication_map resulting from applying the effective replication strategy over the given token_metadata and replication_strategy_config_options. It is calculated once, in make_effective_replication_map(), and then it can be used for retrieving the endpoints/token_ranges synchronously from the precalculated map. A new virtual get_natural_endpoints(const token&, const effective_replication_map&) method has been added to abstract_replication_strategy so that local_strategy and everywhere_replication_strategy can override it as they may be needed before the token_metadata is established. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:53:03 +03:00
Benny Halevy	d96a67eb57	abstract_replication_strategy: use shared_ptr in registry Enable creating shared_ptr<BaseClass> in nonstatic_class_registry using BaseClass::ptr_type and use that for abstract_replication_strategy. While at it, also clean up compressor with that respect to define compressor::ptr_type as shared_ptr<compressor> thus simplifying compressor_registry. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	4511c9acdb	database.hh: convert ifdef block to pragma once Besides being more modern and more efficient for the compiler, this #ifndef block confuses my editor that greys out the whole block. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	a1c573e6d3	abstract_replication_strategy: make calculate_natural_endpoints_sync private And with that rename calculate_natural_endpoints(const token& search_token, const token_metadata&, can_yield) to do_calculate_natural_endpoints and make it protected, With this patch, all its external users call the async version, so rename it back to calculate_natural_endpoints, and make calculate_natural_endpoints_sync private since it's being called only within abstract_replication_strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	a1098c0094	replication strategies: calculate_natural_endpoints: split into sync and async variants calculate_natural_endpoints_sync and _async are both provided temporarily until all users of them are converted to use the async version which will remain. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	32c7314b80	network_topology_strategy: refactor calculate_natural_endpoints Extract natural_endpoints_tracker out of calculate_natural_endpoints so we easily split the function to sync and async variants. Test: network_topology_strategy_test(dev, debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	416531cce7	network_topology_strategy: use rslogger to debug-log configuration Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	330d9772d4	abstract_replication_strategy: move logger to locator namespace To be used by network_topology_strategy and later, by effective_replication_map_registry. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	7401d03e8c	abstract_replication_strategy: define replication_map Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Benny Halevy	5001d261d4	abstract_replication_strategy: define replication_strategy_config_options To be used for searching effective replication strategy instances. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 12:39:36 +03:00
Eliran Sinvani	56981f2259	CQL test environment: Fix bad initialization order The service level controller was initialized with a data accessor that uses the system distributed keyspace before the later have been initialized. If there is a use of this accessor (for example by calling to: service_level_controller::get_distributed_service_levels()) if will fail miserably and crash. Not initializing the data accessor doesn't mean the same thing since we can deal with such call when the accessor is not initialized. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-12 13:27:59 +03:00
Eliran Sinvani	6d3e8055f9	Service Level Controller: Fix possible dereference of a null pointer If the service level controller don't have his data accessor set, calls for getting of distributed information might dereference this unset pointer for the accessor. Here we add code that will return a result as if there is no data available to the accessor (a behaviour which is roughly equivalent to a null data accessor). Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-12 13:27:50 +03:00
Pavel Solodovnikov	8b917f7c99	db: mark `--experimental` option deprecated The documentation for --experimental config option states that it enables all experimental features, but this is no longer true, i.e.: raft feature is not enabled with it and should be explicitly enabled via `--experimental-features=raft` switch (we don't want to enable it by default alongside other features). Since the flag doesn't do what it's intended to, we should mark it as "deprecated", because documenting each exception (there could be more than only raft in the future) will be a burden and docs will constantly go out-of-sync with the code. Adjust the description for the option to reflect that, mark it "deprecated" and suggest using --experimental-features, instead. Fixes: #9467 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20211012093005.20871-2-pa.solodovnikov@scylladb.com>	2021-10-12 13:22:12 +03:00
Pavel Solodovnikov	162f1899e8	db: update the list of supported experimental features `raft` and `alternator-streams` features were missing from the description for `experimental-features` config flag. Update `scylla.yaml` template comments to reflect that, too. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20211012093005.20871-1-pa.solodovnikov@scylladb.com>	2021-10-12 13:22:11 +03:00
Avi Kivity	0d48c39cb3	Merge 'tools/scylla-sstable: allow opening sstables from any path' from Botond Dénes Currently it is required that sstables (in particular la/mx ones) are located at a valid path. This is required because `sstables::entry_descriptor::make_descriptor()` extracts the keyspace and table names from the sstable dir components. This PR relaxes this by using a newly introduced `sstables::entry_descriptor::make_descriptor()` overload which allows the caller to specify keyspace and table names, not necessitating these to be extracted from the path. Tests: unit(dev), manual(testing that `scylla-sstables` can indeed load sstables from invalid path) Closes #9466 * github.com:scylladb/scylla: tools/scylla-sstable: allow loading sstables from any path sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf	2021-10-12 12:50:11 +03:00
Takuya ASADA	06c28585f9	dist: raise fs.file-max and fs.nr_open to enough size for scylla Currently, we configure LimitNOFILE on scylla-server.service, but we don't configure fs.nr_open and fs.file-max. When fs.nr_open or fs.file-max are smaller than LimitNOFILE, we may fail to allocate FDs. To fix this issue, raise fs.file-max and fs.nr_open to enogh size for scylla. Fixes #9461 Closes #9461	2021-10-12 12:47:35 +03:00
Botond Dénes	cc65c9d0da	compaction: scrub/segregate: adjust partition-estimate as buckets accumulate Scrub compaction in segregate mode can split the input sstable into as many as hundreds or even thousands of output sstables in the extreme case. But even at a few dozen output sstables, most of these will only have a few partitions with a few rows. These sstables however will still have their bloom filter allocated according to the original partition-count estimate, causing memory bloat or even OOM in the extreme case. This patch solves this by aggressively adjusting the partition count downwards after the second bucket has been created. Each subsequent bucket will halve the partition estimate, which will quickly reach 1. Fixes: #9463 Closes #9464	2021-10-12 12:44:42 +03:00
Botond Dénes	d535346a6e	tools/scylla-sstable: allow loading sstables from any path Currently it is required that sstables (in particular la/mx ones) are located at a valid path. This is required because `sstables::entry_descriptor::make_descriptor()` extracts the keyspace and table names from the sstable dir components. This patch relaxes this by using the freshly introduced `sstables::entry_descriptor::make_descriptor()` overload which allows the caller to specify keyspace and table names.	2021-10-12 11:47:58 +03:00
Botond Dénes	1b7b3a81e6	sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf Not necessitating these to be extracted from the sstable dir path. This practically allows for la/mx sstables at non-standard paths to be opened. This will be used by the `scylla-sstable` tool which wants to be flexible about where the sstables it opens are located.	2021-10-12 11:43:23 +03:00
Nadav Har'El	e4bc97349c	cql-pytest: XFAILing test was fixed by a Python driver fix Issue #8203 describes a bug in a long scan which returns a lot of empty pages (e.g., because most of the results are filtered out). We have two cql-pytest test cases that reproduced this bug - one for a whole-table scan and one for a single-partition scan. It turned out that the bug was not in the Scylla server, but actually in the Python driver which incorrectly stopped the iteration after an empty page even though this page did contain the "more pages" flag. This driver bug was already fixed in the Datastax driver (see `6ed53d9f70`, and in the Scylla fork of the driver: `1d9077d3f4` So in this patch we drop the XFAIL, and if the driver is not new enough to contain this fix - the test is skipped. Since our Jenkins machines have the latest Scylla fork of the driver and it already contains this fix, these tests will not be skipped - and will run and should pass. Developers who run these tests on their development machine will see these tests either passing or skipped - depending on which version of the driver they have installed. Closes #8203 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211011113848.698935-1-nyh@scylladb.com>	2021-10-12 10:04:02 +02:00
Nadav Har'El	33f8ec09df	Merge 'treewide: improve compatibility with gcc 11' from Avi Kivity Our source base drifted away from gcc compatibility; this mostly restores the ability to build with gcc. An important exception is coroutines that have an initializer list [1]; this still doesn't work. We aim to switch back to gcc 11 if/when this gives us better C++ compatibility and performance. Test: unit (dev) [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056 Closes #9459 * github.com:scylladb/scylla: test: radix_tree_printer: avoid template specialization in class context test: raft: avoid ignored variable errors test: reader_concurrency_semaphore_test: isolate from namespace of source_location test: cql_query_test: drop unused lambda assert_replication_not_contains test: commitlog_test: don't use deprecated seastar::unaligned_cast test: adjust signed/unsigned comparisons in loops and boost tests build: silence some gcc 11 warnings sstables: processing_result_generator: make coroutine support palatable for C++20 compilers managed_bytes: avoid compile-time loop in converting constructor service: service_level_controller: drop unused variable sl_compare raft: disambiguate promise name in raft::active_read locator: azure_snitch: use full type name in definition of globals cql3: statements: create_service_level_statement: don't ignore replace_defaults() cql3: statement_restrictions: adjust call to std::vector deduction guide types: remove recursive constraint in deserialize_value cql3: restrictions: relax constraint on visitor_with_binary_operator_content treewide: handle switch statements that return cql3: expr: correct type of captured map value_type cdc: adjust type of streams_count alternator: disambiguate attrs_to_get in table_requests	2021-10-11 16:54:01 +03:00
Nadav Har'El	5e4c60e19a	Merge: Unload storage service from irrelevant APIs Meged patch series from Pavel Emelyanov: There's a long-term (well, likely mid-term already) goal to keep a single role for the storage_service, namely -- managing the state of a node in the ring. Then rename it once it happens to stop people from loading new stuff into storage_service. There are at least three REST API endpoints that stand on the way. 1. load_new_ss_tables. This part is moved to a new sharded sstables loader that wraps existing distributed_loader 2. view_build_statuses. Satuses are maintained by view_builder so must be retrieved from the same place 3. enable_\|disable_auto_compaction. This is purely database knob that used to be such some time ago This change also removes view_update_generator from storage_service list of dependencies and leaves the system_distributed_keyspace be the start-only one (another not yet published branch makes use of it and removes s.d.ks from storage service at all). branch: https://github.com/xemul/scylla/tree/br-unload-storage-service-api-3 tests: unit(dev) refs: #5489 * 'br-unload-storage-service-api-3' of github.com:xemul/scylla: storage_service, api: Move set-tables-autocompaction back into API api: Fix indentation after previous patch api, database, storage_service: Unify auto-compaction toggle api: Remove storage service from new APIs view_builder: Accept view_build_statuses storage_service: Move view_build_statuses code api, storage_service: Keep view builder API handlers separate storage_service: Remove view update generator from sstables_loader: Accept the sstables loading code storage_service: Move the sstables loading code storage_service, api: Keep sstables loading API handlers separate sstables_loader: Introduce distributed_loader, utils: Move verify_owner_and_mode distributed_loader: Fix methods visibility	2021-10-11 15:22:06 +03:00
Kamil Braun	339b9bc38a	sstables: mx: partition_reversing_data_source: close internal data consumers `partition_reversing_data_source` uses `continuous_data_consumer`s internally (`partition_header_context`, `row_body_skipping_context`) which hold `input_stream`s opened to sstable data files. These `input_stream`s must be closed before destruction. Right now they would sometimes cause "Assertion `_reads_in_progress == 0' failed" on destruction. Close the `continuous_data_consumer`s before they are destroyed so they can close their `input_stream`s. Fixes #9444. Closes #9451	2021-10-11 12:35:54 +02:00
Pavel Emelyanov	f0b5ab1c61	storage_service, api: Move set-tables-autocompaction back into API The global autocompaction toggle is no longer tied to the storage service. It naturally belongs to the database, but is small and tidy enough not to pollute database methods and can be placed into the api/ dir itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:13:59 +03:00
Pavel Emelyanov	fece1a2f9f	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:13:56 +03:00
Pavel Emelyanov	c5128eea67	api, database, storage_service: Unify auto-compaction toggle There are two knobs here -- global and per-table one. Both were added without any synchronisation, but the former one was later fixed to become serialized and not to be available "too early". This patch unifies both toggles to be serialized with each-other and not be enabled too early. The justification for this change is to move the global toggle from out of the storage service, as it really belongs to the database, not the storage service. Respectively, the current synchronization, that depends on storage service internals, should be replaced with something else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:12:39 +03:00
Pavel Emelyanov	c53c74258a	api: Remove storage service from new APIs The APIs that had been recently switched to using relevant services no longer need the storage service reference capture, so remove it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:11:52 +03:00
Pavel Emelyanov	c504361c15	view_builder: Accept view_build_statuses The code itself is already in relevant .cc file, not move it to the relevant class. The only significant change is where to get token metadata from. In its old location tokens were provided by the storage service itself, now when it's in the view builder there's no "native" place to get them from, however the rest of the view building code gets tokens from global storage proxy, so do the same here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:11:40 +03:00
Pavel Emelyanov	3b6e8c7d93	storage_service: Move view_build_statuses code This code belongs to view builder, so put it into its .cc. No changes, just move. This needs some ugly namespace breakage, but they will be patched away with the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:11:29 +03:00
Pavel Emelyanov	540c6fa5ae	api, storage_service: Keep view builder API handlers separate There's the 'storage_service/view_build_statuses' endpoint. It's handler code sits in the storage_service, but the functionality belongs purely to view_builder. Same as with sstables loader, detach the enpoint's API set/unset code, next patches will fix the handler to use view_builder. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:09:07 +03:00
Pavel Emelyanov	99d8994835	storage_service: Remove view update generator from It's not used by storage service any longer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:09:02 +03:00
Pavel Emelyanov	68ecec0197	sstables_loader: Accept the sstables loading code The code was moved in the relevant .cc file by previous patch, now make it sit in the relevant class. One "significant" change is that the messaging service is available by local reference already, not by the sharded one. Other dependencies are already satisfied by the patch that introduced the sstables_loader class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:08:21 +03:00
Pavel Emelyanov	42f83f6669	storage_service: Move the sstables loading code Just cut-n-paste the code into sstables_loader.cc. No other changes but replace storage service logger with its own one. For now the code stays in storage_service class, but next patch will relocate the code into the sstables_loader one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:07:39 +03:00
Pavel Emelyanov	7e49359720	storage_service, api: Keep sstables loading API handlers separate Right now the handlers sit in one boat with the rest of the storage service APIs. Next patches will switch this particular endpoint to use previously introduced sstables_loader, before doing so here's the respective API set/unset stubs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:05:45 +03:00
Pavel Emelyanov	13ab22d3c7	sstables_loader: Introduce It's a sharded service that will be responsible for loading sstables via the respective REST API (the endpoint in question is in turn handling the nodetool refresh command). This patch adds the loader, equips with the needed dependencies and starts/stops one from main. Next patches will move the loader code from storage_service into this new one. The list of dependencies that are introduced in this patch is exactly what's needed by the mentioned code move. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:03:54 +03:00
Pavel Emelyanov	581382edad	distributed_loader, utils: Move verify_owner_and_mode This method sits in dist.loader, but really belongs to util/ as it just works on an "abstract" path and doesn't need to know what this path is about. Another sign of layering violation is the inclusion of dist.loader code into util/ stuf. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:03:51 +03:00
Pavel Emelyanov	e106e0571a	distributed_loader: Fix methods visibility Most of the methods are marked public, but only few of them should. Test needs a bit more, however, so the distributed_loader_for_tests is declared as friend class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-11 11:03:29 +03:00
Michał Radwański	c04dffbc01	partition_snapshot_reader: pop_range_tombstone returns reference (instead of value) when possible.	2021-10-10 20:38:18 +02:00
Michał Radwański	771f3b12bd	memtable: enable native reversing This commit consists of changes, which need to reside in a single commit, so that the tests pass on each of the commits. 1. Remove do_make_flat_reader which disabled reverse reads by making the slice a forward one. Remove call to get_ranges which would do superfluous reversal of clustering ranges. 2. test: cql_query_test: remove expectation that the test_query_limit fails for reversed queries, since reversed queries no longer require linear memory wrt. the result size, when paginated.	2021-10-10 20:38:18 +02:00
Michał Radwański	cc5ea66957	partition_snapshot_reader: reverse ck_range when needed by Reversing Previous commits made it possible to split the responsibility of two kinds of clustering key ranges in read_next and next_range_tombstone. Here, the actual reversal takes place and we start passing the actually reversed ck_range, if Reversing. This reversed ck_range is stored as a class member, so that the reversal happens just once for each range.	2021-10-10 20:38:18 +02:00
Michał Radwański	5449982a0b	memtable, partition_snapshot_reader: read from partition in reverse In this commit, I add the ability to read from partition snapshots in reverse order. Before these changes, a reverse read from memtable has been handled as follows: - A reader higher in the hierarchy of readers performs a read from memtable in the forward order, which is not aware of the intention to read in reverse. - Later, some reader reverses the received mutation fragments. Memtable decides based on options in `slice`, whether to read forward or in reverse. Note that previous commit creates a killswitch which clears the `reverse` option from slice before running the logic of whether to reverse or not. This is due to the fact, that this commit doesn't all the required code changes. The reversing partition snapshot reader maintains two schemas - one that is the reversed schema (called _query_schema) for the output, and the other one (forward one, called _snapshot_schema), which is used to access the memtable tree (which needs to be the same as the schema used to create memtable). The `partition_slice` provided by callers is provided in 'half-reversed' format for reversed queries, where the order of clustering ranges is reversed, but the ranges themselves are not.	2021-10-10 20:38:18 +02:00
Michał Radwański	6813c39927	partition_snapshot_reader: rows_position and rows_iter_type supporting reverse iteration Iterating in reverse is useful for native reverse memtable reader.	2021-10-10 20:38:18 +02:00
Avi Kivity	ef45a208ef	test: radix_tree_printer: avoid template specialization in class context gcc complains that it's illegal. It's unnecessary too - we can replace it with a simple overload.	2021-10-10 18:17:53 +03:00
Avi Kivity	11cc772388	test: raft: avoid ignored variable errors Avoid instantiating unused variables, and in one case ignore it, to avoid a gcc warning.	2021-10-10 18:17:53 +03:00
Avi Kivity	cdb50b1972	test: reader_concurrency_semaphore_test: isolate from namespace of source_location More modern gcc uses std::source_location instead of std::experimental::source_location. Rely on seastar::compat to get it right for us.	2021-10-10 18:17:53 +03:00
Avi Kivity	a08bcc0528	test: cql_query_test: drop unused lambda assert_replication_not_contains gcc complains that it exists.	2021-10-10 18:17:53 +03:00
Avi Kivity	9166d1ab1d	test: commitlog_test: don't use deprecated seastar::unaligned_cast unaligned_cast is deprecated, and gcc complains that it violates strict aliasing rules. Switch to std::copy_n() instead.	2021-10-10 18:17:53 +03:00
Avi Kivity	9907303bf5	test: adjust signed/unsigned comparisons in loops and boost tests gcc complains about comparing a signed loop induction variable with an unsigned limit, or comparing an expected value and measured value. Fix by using unsigned throughout, except in one case where the signed value was needed for the data_value constructor.	2021-10-10 18:16:50 +03:00
Avi Kivity	15ffd84473	build: silence some gcc 11 warnings These warnings are valuable, but limit the noise for now by disabling them.	2021-10-10 18:16:50 +03:00
Avi Kivity	029560c232	sstables: processing_result_generator: make coroutine support palatable for C++20 compilers clang implement the coroutine technical specification, in namespace std::experimental. gcc implements C++20 coroutines, in namespace std. Detect which one is in use and select the namespace accordingly.	2021-10-10 18:16:50 +03:00
Avi Kivity	c38f18163e	managed_bytes: avoid compile-time loop in converting constructor managed_bytes_basic_view is a template with a constructor that converts from one instantiation of the template to another. Unfortunately when gcc encounters the associated constraint, it instantiates the template which forces it to evaluate the constraint again, sending it into a loop. Fix that by making the converting constructor a template itself, delaying instantiation. The constraint is strengthened so the set of types on which the constructor works is unchanged.	2021-10-10 18:16:50 +03:00
Avi Kivity	f6d59c33ff	service: service_level_controller: drop unused variable sl_compare Reported by gcc 11.	2021-10-10 18:16:50 +03:00
Avi Kivity	cd4af0c722	raft: disambiguate promise name in raft::active_read gcc complains tha the name 'promise' changes meaning (from type to variable) within active_read. Help it by disambiguating the use as type.	2021-10-10 18:16:50 +03:00
Avi Kivity	3f9ec5302a	locator: azure_snitch: use full type name in definition of globals Some globals in azure_snitch use std::string in the declaration and auto in the definition. gcc 11 complains. I don't know if it's correct, but it's easy to use the type in both declaration and definition.	2021-10-10 18:16:50 +03:00
Avi Kivity	d83b565938	cql3: statements: create_service_level_statement: don't ignore replace_defaults() We call replace_defaults on an object named 'slo', but then ignore it. Use the new object that replace_defaults() returned. Reported by gcc 11.	2021-10-10 18:16:50 +03:00
Avi Kivity	25f8e9c078	cql3: statement_restrictions: adjust call to std::vector deduction guide gcc 11 has a hard time parsing a deduction guide use with braced initializer. The bug [1] was already fixed in gcc 12, and I've requested a backport, but reduce friction meanwhile by switching to a form that works in gcc 11. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89062	2021-10-10 18:16:50 +03:00
Avi Kivity	df73d12272	types: remove recursive constraint in deserialize_value deserialize_value() has a constraint that depends on another deserialize_value() implementation. Apprently gcc wants to instantiate the deserialize_value() instance we're constraining while evaluating the constraint, leading to a loop. Since this deserialize_value() is just an internal helper, drop the constraint rather than fighting it.	2021-10-10 18:16:50 +03:00
Avi Kivity	58a0e80021	cql3: restrictions: relax constraint on visitor_with_binary_operator_content We require that v.current_binary_operator is a 'const binary_operator', but it's really a 'const binary_operator&'. Relax the constraint so it works with both gcc and clang.	2021-10-10 18:16:50 +03:00
Avi Kivity	fd8beeaea9	treewide: handle switch statements that return A switch statement where every case returns triggers a gcc warning if the surrounding function doesn't return/abort. Fix by adding an abort(). The abort() will never trigger since we have a warning on unhandled switch cases.	2021-10-10 18:16:50 +03:00
Michał Radwański	a672b8b86f	partition_snapshot_reader: split responsibility of ck_range Previously, next_range_tombstone took as an argument a clustering key range, which served two purposes. One was for accesing only specified key ranges from the partition, the other was for deciding in which order the mutation fragments should be emitted. This commits separates these responsibilities, since in the advent of native memtable reader, these two responsibilities are no longer common. The split is propagated to the rest of the partition_snapshot_reader.hh to avoid confusion.	2021-10-07 17:04:44 +02:00
Michał Radwański	fc51d2cc8c	partition_snapshot_reader: separate _schema into _query_schema and _partition_schema After memtable starts supporting reverse order queries, the schema provided to the readers will be reversed (reverse clustering order). Reading from memtable in reverse requires two schemas - one to access the memtable internal data structures (_partition_schema), and the other one (_query_schema), the schema imposing clustering order on returned mutation fragments. This commit prepares for introduction of native reverse queries for memtable, by separating these responsibilities. For now, they are still initialized with the schema passed from query.	2021-10-07 17:04:44 +02:00
Mikołaj Sielużycki	235c38e78f	sstables, gdb: Retire usage of sstable_tracker sstables_manager superseeds previous implementation of sstables_tracker for tracking lifetime of the tables. Update scylla-gdb.py to use sstables_manager in a backwards compatible way, as sstables_manager is not available in Scylla Enterprise 2020.1. Add explicit test for "scylla sstables" command, as previously only "scylla active-sstables" was tested. Closes #9439	2021-10-07 14:40:47 +02:00
Piotr Sarna	59bd25d1ea	transport: respond with overloaded exception during shedding This commit makes shedding always respond - with overloaded exception, instead of ignoring the request. Fixes #9442 Closes #9443	2021-10-07 15:38:40 +03:00
Nadav Har'El	d1505762df	cql-pytest: add to README an example of repeating a test pytest supports - if the "repeat" extension is installed - a convenient and efficient way to repeat the same test (or all of them) multiple times. Since it's very useful, let's document it in cql-pytest/README.md. By the way, our test.py also has a "--repeat" option, but it can only run all cql-pytest tests, not just repeat a single small test, and it is also slower (and arguably, different) because it restarts Scylla instead of running a test 100 times on the same Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211007122146.624210-1-nyh@scylladb.com>	2021-10-07 15:30:41 +03:00
Michael Livshin	e88891a8af	avoid race between compaction and table stop Also add a debug-only compaction-manager-side assertion that tests that no new compaction tasks were submitted for a table that is being removed (debug-only because not constant-time). Fixes #9448. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>	2021-10-07 14:36:39 +03:00
Kamil Braun	96f18c4bb0	test: test_sstable_reversing_reader_random_schema: fix the workaround for #9352 The test generates random mutations and eliminates mutations whose keys tokenize to 0, in particular it eliminates mutations with empty partition keys (which should not end up in sstables). However it would do that after using the randomly generated mutations to create their reversed versions. So the reversed versions of mutations with empty partition keys would stay. Fix by placing the workaround earlier in the test. Closes #9447	2021-10-07 14:01:43 +03:00
Raphael S. Carvalho	59693e6da3	compaction_manager: make rewrite_sstables() bail out when asked to stop rewrite_sstables() can be asked to stop either on shutdown or on an user-triggered comapction which forces all ongoing compaction to stop, like scrub. turns out we weren't actually bailing out from do_until() when task cannot proceed. So rewrite_sstables() potentially runs into an infinite loop which in turn causes shutdown or something else waiting on it to hang forever. found this while auditting code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>	2021-10-07 10:46:22 +03:00
Benny Halevy	90fd4d5ed7	test: sstable_conforms_to_mutation_source_test: test_sstable_reversing_reader_random_schema: auto-close reader on exception I stumbled upon this failure in dev mode: ``` test/boost/sstable_conforms_to_mutation_source_test.cc(0): Entering test case "test_sstable_reversing_reader_random_schema" sstable_conforms_to_mutation_source_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed. Aborting on shard 0. ``` Since dev mode has no debug symbols I can't decode the stack trace so I'm not 100% sure about the root cause and I couldn't reproduce it in release or debug modes yet. One vulnerability in the current code is that r1 won't be closed if an exception is thrown before r1 and r2 are moved to `compare_readers` so this change adds a deferred close of r1 in this case. Test: sstable_conforms_to_mutation_source_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211006144009.696412-1-bhalevy@scylladb.com>	2021-10-06 17:53:49 +03:00
Avi Kivity	b08c299713	cql3: expr: correct type of captured map value_type A map's value_type has const key, but in two places we omitted the const. This causes construction of a new value, plus gcc complaining that we're refering to a temporary. Fix by using the correct type.	2021-10-06 14:57:43 +03:00
Avi Kivity	eac95e2370	cdc: adjust type of streams_count streams_count has signed type, but it's compared against an unsigned type, annoying gcc. Since a count should be positive, convert it to an unsigned type.	2021-10-06 14:56:00 +03:00
Avi Kivity	5a5a47c4c7	alternator: disambiguate attrs_to_get in table_requests There is a table_requests::attrs_to_get type, and also a type named attrs_to_get used in the same struct, and gcc doesn't like this. Disambiguate the type by fully qualifying it.	2021-10-06 14:55:48 +03:00
Takuya ASADA	3b798afc1e	scylla_io_setup: handle nr_disks on GCP correctly nr_disks is int, should not be string. Fixes #9429 Closes #9430	2021-10-06 12:31:38 +03:00
Nadav Har'El	0f8d3ea459	cql-pytest: translate Cassandra's tests for ORDER BY This is a translation of Cassandra's CQL unit test source file validation/operations/SelectOrderByTest.java into our our cql-pytest framework. This test file includes 17 tests for various features and corners of SELECT's "ORDER BY" feature. All these tests pass on Cassandra, but three fail on Scylla and are marked as xfail: One previously-unknown Scylla bug: Refs #9435: SELECT with IN, ORDER BY and function call does not obey the ORDER BY And two new reproducers for already known bugs: Refs #2247: ORDER BY should allow skipping equality-restricted clustering columns Refs #7751: Allow selecting map values and set elements, like in Cassandra 4.0 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211005174140.571056-1-nyh@scylladb.com>	2021-10-06 12:31:38 +03:00
Avi Kivity	0ea79559a6	Merge 'IDL: support generating boilerplate code for RPC verbs' from Pavel Solodovnikov Introduce new syntax in IDL compiler to allow generating registration/sending code for RPC verbs: ``` verb [[attr1, attr2...] my_verb (args...) -> return_type; ``` `my_verb` RPC verb declaration corresponds to the `netw::messaging_verb::MY_VERB` enumeration value to identify the new RPC verb. For a given `idl_module.idl.hh` file, a registrator class named `idl_module_rpc_verbs` will be created if there are any RPC verbs registered within the IDL module file. These are the methods being created for each RPC verb: ``` static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&); static future<> unregister_my_verb(netw::messaging_service* ms); static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...); ``` Each method accepts a pointer to an instance of `messaging_service` object, which contains the underlying seastar RPC protocol implementation, that is used to register verbs and pass messages. There is also a method to unregister all verbs at once: ``` static future<> unregister(netw::messaging_service* ms); ``` The following attributes are supported when declaring an RPC verb in the IDL: * `[[with_client_info]]` - the handler will contain a const reference to an `rpc::client_info` as the first argument. * `[[with_timeout]]` - an additional `time_point` parameter is supplied to the handler function and `send` method uses `send_message__timeout` variant of internal function to actually send the message. * `[[one_way]]` - the handler function is annotated by `future<rpc::no_wait_type>` return type to designate that a client doesn't need to wait for an answer. The `-> return_type` clause is optional for two-way messages. If omitted, the return type is set to be `future<>`. For one-way verbs, the use of return clause is prohibited and the signature of `send` function always returns `future<>`. No existing code is affected. Ref: #1456 Closes #9359 github.com:scylladb/scylla: idl: support generating boilerplate code for RPC verbs idl: allow specifying multiple attributes in the grammar message: messaging_service: extract RPC protocol details and helpers into a separate header	2021-10-05 18:05:24 +03:00
Michał Radwański	dac2509a7f	query: reverse clustering_range	2021-10-05 16:47:04 +02:00
Tzach Livyatan	bd87c7d362	Update docker-hub text Mention aarch64 support Closes #9436	2021-10-05 17:35:02 +03:00
Raphael S. Carvalho	342bfbd65a	compaction: Make major compaction on keyspace resilient if low on space Let's major compact the smallest tables first, increasing chances of success if low on disk space. parallel_for_each() didn't have any effect on space requirement as compaction_manager serializes major compaction in a shard. As parallel_for_each() is no longer used, find_column_family() is now used before each compact_all_sstables() to avoid a race with table drop. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211005135257.31931-1-raphaelsc@scylladb.com>	2021-10-05 17:04:34 +03:00
Nadav Har'El	77bd4afda7	test/alternator: avoid client-side validation Ever since we started testing Alternator with tests written in Python and using Amazon's "boto3" library, one limitation kept annoying us: Boto3 verifies the validity of the request parameters before passing them on to the server. It verifies that mandatory parameters are not missing, that parameters have the right types, and sometimes even the right ranges - all in the library before ever sending the request. This meant that in many cases, we couldn't get good test coverage for Alternator's server-side handling of wrong parameters. As it turns out, it is trivial to tell boto3 to not do its client-side request validation, with the `parameter_validation=False` config flag. We just never noticed that such a flag existed :-) So this patch adds this flag. It then fixes a few tests which expected ParameterValidationError - this error is the client-side validation failure, but should now be replaced by checking the server-side error. The patch also adds a couple of invalid parameter checks that we couldn't do before because of boto3's eagerness to check them on the client side. We can add a lot more of these error tests in the future, now that we got rid of client-side validation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211005095514.537226-1-nyh@scylladb.com>	2021-10-05 13:26:51 +02:00
Nadav Har'El	6dee86eade	test/alternator: another test for adding a GSI to an existing table This patch adds yet another test for Alternator's unimplemented feature of adding a GSI to an already existing table (issue #5022), but this test is for a very specific corner case - tables which contain string attributes with an empty value - the corner case described in issue #9424: DynamoDB used to forbid any string attributes from being set to an empty string, but this changed in May 2020, and since then empty strings are allowed - but NOT as keys. So although it is legal to set a string attribute to an empty string, if this table has a GSI whose key is that specific attribute, the update command is refused. We already had a test for this - test_gsi_empty_value. However, the case in this patch is the case where a GSI is added to a table after the table already has data. In this case (as this test demonstrates), we are supposed to drop the items which have the empty string key from the GSI. Even when #5022 (the ability to add GSIs to existing tables) will be done, this test will continue to fail. The unique problem of this test is that Scylla's materialized views do allow empty strings as clustering keys (right now) and even partition keys (after #9375 will be solved), while we don't want them to enter the GSI. We will probably need to add to the view's filter, which right now contains (as required) "x IS NOT NULL" also the filter "x != ''" (when x's type is a string or binary) so that items with empty-string keys will be dropped. Refs #5022 Refs #9375 Refs #9424 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211003170636.477582-1-nyh@scylladb.com>	2021-10-05 13:26:43 +02:00
Nadav Har'El	b136104298	alternator/test: test for invalid numeric values DynamoDB has a rather baroque definition of numbers, and in particular it does not allow numeric attributes to be set to infinity or NaN. Although I did check invalid numbers in the past, manually, I was never able to write a unit test for this in the past - because the boto3 library catches such errors on the client side, and prevents the test from sending broken requests to the server. So in this patch, I finally came up with a solution - a context manager client_no_transform() which yields a client which does NOT do any transformation or validation on the request's parameters, allowing us to use boto3 to create improper requests - and test the server's handling of them. The test in this patch passes - it did not discover a new bug, but it is a useful regression test and the client_no_transform() trick can be used in more error-case tests which until now we were unable to write. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211004161809.520236-1-nyh@scylladb.com>	2021-10-05 13:13:45 +02:00
Avi Kivity	2d25705db0	cql3: deinline non-trivial methods in selection.hh This allows us to forward-declare raw_selector, which in turn reduces indirect inclusions of expression.hh from 147 to 58, reducing rebuilds when anything in that area changes. Includes that were lost due to the change are restored in individual translation units. Closes #9434	2021-10-05 12:58:55 +02:00
Avi Kivity	d3f8148807	utils: untie rjson.hh from base64.hh base64.hh pulls in the huge rjson.hh, so if someone just wants a base64 codec they have to pull in the entire rapidjson library. Move the json related parts of base64.hh to rjson.hh and adjust includes and namespaces. In practice it doesn't make much difference, as all users of base64 appear to want json too. But it's cleaner not to mix the two. Closes #9433	2021-10-05 12:57:54 +02:00
Kamil Braun	cab3f2e2d2	test: raft: randomized_nemesis_test: perform reconfigurations in basic_generator_test We use a dedicated thread (similarly to the nemesis thread) to periodically perform reconfigurations.	2021-10-05 11:59:29 +02:00
Kamil Braun	fde26eb476	test: raft: randomized_nemesis_test: improve the bouncing algorithm The bouncing algorithm tries to send a request to other servers in the configuration after it receives a `not_a_leader` response. Improve the algorithm so it doesn't try the same server twice.	2021-10-05 11:54:16 +02:00
Kamil Braun	3ac8216a7b	test: raft: randomized_nemesis_test: handle more error types With reconfigurations the `commit_status_unknown` error may start appearing.	2021-10-05 11:54:16 +02:00
Kamil Braun	98add5a4fc	test: raft: randomized_nemesis_test put `variant` and `monostate` `ostream` `operator<<`s into `std` namespace As a preparation for the following commits. Otherwise the definitions wouldn't be found during argument-dependent lookup (I don't understand why it worked before but won't after the next commit).	2021-10-05 11:54:16 +02:00
Kamil Braun	4956217341	test: raft: randomized_nemesis_test: `reconfiguration` operation The operation sends a reconfiguration request to a Raft cluster. It bounces a few times in case of `not_a_leader` results. A side effect of the operation is modifying a `known` set of nodes which the operation's state has a reference to. This `known` set can then be used by other operations (such as `raft_call`s) to find the current leader. For now we assume that reconfigurations are performed sequentially. If a reconfiguration succeeds, we change `known` to the new configuration. If it fails, we change `known` to be the set sum of the previous configuration and the current configuration (because we don't know what the configuration will eventually be - the old or the attempted one - so any member of the set sum may eventually become a leader).	2021-10-05 11:54:16 +02:00
Avi Kivity	3a67c661d4	Merge "Improve parallelizm of mutation source tests" from Pavel E " There's a run_mutation_source_tests lib helper that runs a bunch of tests sequentially. The problem is that it does 4 different flavors of it each being a certain decoration over provided reader. This amplification makes some test cases run enormous amount of time without any chance for parallelizm. The simplest way to help running those cases in parallel is to teach the slowest cases to run different flavors of mutation source tests in dedicated cases. This patch makes it so. The resulting timings are dev debug sequential run: 2m1s 53m50s --parallel-cases (+ this patch): 1m3s 31m15s tests: unit(dev, debug) " * 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla: test: Split multishard combining reader case test: Split database test case test: Split run_mutation_source_tests	2021-10-05 12:22:52 +03:00
Kamil Braun	0c24c18d0c	test: cql_query_test: fix test_query_limit for reversed queries (Single-partition) reversed queries are no longer unlimited but some places still treat them as such. This causes, for example, shorter pages for such queries, which breaks a test that expects certain results to come in a single page.	2021-10-05 11:22:39 +02:00
Tomasz Grabiec	17430795e8	Merge "test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot`" from Kamil It's possible that the server drops the snapshot in the same iteration of `io_fiber` loop as it tries to send it (the sending of messages happens after snapshot dropping). Handle this case by throwing an exception. As a preparation we also fix the code in `server_impl::send_snapshot` so it works correctly when `rpc::send_snapshot` throws or returns a ready future. Refs #9407. * kbr/snapshot-handle-errors: test: raft: randomized_nemesis_test: remove an obsolete comment test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot` raft: server: handle `rpc::send_snapshot` returning instantly	2021-10-05 11:19:14 +02:00
Kamil Braun	c9a7778497	test: raft: randomized_nemesis_test: remove an obsolete comment	2021-10-05 11:04:11 +02:00
Kamil Braun	961f5a904c	test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot` It's possible that the server drops the snapshot in the same iteration of `io_fiber` loop as it tries to send it (the sending of messages happens after snapshot dropping). Handle this case. Refs #9407.	2021-10-05 11:04:11 +02:00
Kamil Braun	36f3e26374	raft: server: handle `rpc::send_snapshot` returning instantly If `rpc::send_snapshot` returned immediately with a ready future, or if it threw, the code in `server_impl::send_snapshot` would not update `_snapshot_transfers` correctly. The code assumed that the continuation attached to `rpc::send_snapshot` (with `then_wrapped`) was executed after `_snapshot_transfer` below the `rpc::send_snapshot` call was updated. That would not necessarily be true (the continuation may even not have been executed at all if `rpc::send_snapshot` threw). Fix that by wrapping the `rpc::send_snapshot` call into a continuation attached to `later()`. Originally authored by Gleb <gleb@scylladb.com>, I added a comment.	2021-10-05 11:04:11 +02:00
Pavel Emelyanov	b742e6cbb6	test: Split multishard combining reader case All the cases in this test also run mutation source tests and the case with single-fragment buffer takes times more time to execute than the others. Splitting this single case so that it runs mutation source tests flavours in different cases improves the test parallelizm. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:57:02 +03:00
Pavel Emelyanov	30075094ac	test: Split database test case The test_database_with_data_in_sstables_is_a_mutation_source case runs the mutation source tests in one go. The problem is that on each step a whole new ks:cf is created which takes the majority of the tests time. In the end of the day this case is the slowest one in the suite being up to two times longer (depending on mode) than the #2 on this list. This patch splits the case into 4 so that each mutation source flavor is run in separate case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:53:18 +03:00
Pavel Emelyanov	1e09a2c925	test: Split run_mutation_source_tests There are 4 flavours of mutation source tests that are all ran sequentially -- plain, reversed and upgrade/downgrade ones that check v1<->v2 conversions. This patch splits them all into individual calls so that some tests may want to have dedicated cases for each. "By default" they are all run as they were. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:51:43 +03:00
Pavel Emelyanov	4b4ce015aa	system-keyspace: Keep UUID value when saving The set_local_host_id() accepts UUID references and starts to save it in local keyspace and in all shards' local cache. Before it was coroutinized the UUID was copied on captures and survived, after it it remains references. The problem is that callers pass local variables as arguments that go away "really soon". Fix it to accept UUID as value, it's short enough for safe and painless copy. fixes: #9425 tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211004145421.32137-1-xemul@scylladb.com>	2021-10-04 18:21:44 +03:00
Tomasz Grabiec	cd9b4d95fc	Merge "test: raft: randomized_nemesis_test: better liveness check at the end of generator test" from Kamil The previous check would find a leader once and assume that it does not change, and that the first attempt at sending a request to this leader succeeds. In reality the leader may change at the end of the test (e.g. it may be in the middle of stepping down when we find it) and in general it may take some time for the cluster to stabilize. The new check tries a few times to find a leader and perform a request - until a time limit is reached. * kbr/nemesis-liveness-check: test: raft: randomized_nemesis_test: better liveness check at the end of generator test test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader`	2021-10-04 16:05:37 +02:00
Kamil Braun	17e771c5f5	test: raft: randomized_nemesis_test: better liveness check at the end of generator test The previous check would find a leader once and assume that it does not change, and that the first attempt at sending a request to this leader succeeds. In reality the leader may change at the end of the test (e.g. it may be in the middle of stepping down when we find it) and in general it may take some time for the cluster to stabilize. The new check tries a few times to find a leader and perform a request - until a time limit is reached. The commit also removes an incorrect assertion inside in `wait_for_leader`.	2021-10-04 15:57:54 +02:00
Kamil Braun	478a58e86d	test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader` To be used in the next commit, where we call `wait_for_leader` in a loop with the same deadline `time_point`.	2021-10-04 15:56:54 +02:00
Tomasz Grabiec	e89b9799b8	Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun Until now reversed queries were implemented inside `querier::consume_page` (more precisely, inside the free function `consume_page` used by `querier::consume_page`) by wrapping the passed-in reader into `make_reversing_reader` and then consuming fragments from the resulting reversed reader. The first couple of commits change that by pushing the reversing down below the `make_combined_reader` call in `table::query`. This allows working on improving reversing for memtables independently from reversing for sstables. We then extend the `index_reader` with functions that allow reading the promoted index in reverse. We introduce `partition_reversing_data_source`, which wraps an sstable data file and returns data buffers with contents of a single chosen partition as if the rows were stored in reverse order. We use the reversing source and the extended index reader in `mx_sstable_mutation_reader` to implement efficient (at least in theory) reversed single-partition reads. The patchset disables cache for reversed reads. Fast-forwarding is not supported in the mx reader for reversed queries at this point. Details in commit messages. Read the commits in topological order for best review experience. Refs: #9134 (not saying "Fixes" because it's only for single-partition queries without forwarding) Closes #9281 * github.com:scylladb/scylla: table: add option to automatically bypass cache for reversed queries test: reverse sstable reader with random schema and random mutations sstables: mx: implement reversed single-partition reads sstables: mx: introduce partition_reversing_data_source sstables: index_reader: add support for iterating over clustering ranges in reverse clustering_key_filter: clustering_key_filter_ranges owning constructor flat_mutation_reader: mention reversed schema in make_reversing_reader docstring clustering_key_filter: document clustering_key_filter_ranges::get_ranges	2021-10-04 15:37:34 +02:00
Kamil Braun	703aed3277	table: add option to automatically bypass cache for reversed queries Currently the new reversing sstable algorithms do not support fast forwarding and the cache does not yet handle reversed results. This forced us to disable the cache for reversed queries if we want to guarantee bounded memory. We introduce an option that does this automatically (without specifying `bypass cache` in the query) and turn it on by default. If the user decides that they prefer to keep the cache at the cost of fetching entire partitions into memory (which may be viable if their partitions are small) during reversed queries, the option can be turned off. It is live-updateable.	2021-10-04 15:24:12 +02:00
Kamil Braun	9bf6be5509	test: reverse sstable reader with random schema and random mutations The test generates a random set of mutations and creates two readers: - one by reversing the mutations, creating an sstable out of the result, and querying it in reverse, - one by creating an sstable directly from the mutations and querying it in forward mode. It checks that the readers give equal results. The test already managed to find a bug where offsets returned by the sstable index were interpreted incorrectly as absolute instead of relative. It also helped find another bug unrelated to reversing (#9352). Surprisingly few tests use the random schema and random mutation utilities which seem to be quite powerful.	2021-10-04 15:24:12 +02:00
Kamil Braun	27238eaa0f	sstables: mx: implement reversed single-partition reads We use partition_reversing_data_source and the new `index_reader` methods to implement single-partition reads in `mx_sstable_mutation_reader`. The parsing logic does not need to change: the buffers returned by the source already contain rows in reversed clustering order. Some changes were required in `mp_row_consumer_m` which processes the parsed rows and emits appropriate mutation fragments. The consumer uses `mutation_fragment_filter` underneath to decide whether a fragment should be ignored or not (e.g. the parsed fragment may come from outside the requested clustering range), among other things. Previously `mutation_fragment_filter` was provided a `partition_slice`. If the slice was reversed, the filter would use `clustering_key_filter_ranges::get_ranges` to obtain the clustering ranges from the slice in unreversed order (they were reversed in the slice) since we didn't perform any reversing in the reader. Now the reader provides the ranges directly instead of the slice; furthermore, the ranges are provided in native-reversed format (the order of ranges is reversed and the ranges themselves are also reversed), and the schema provided to the filter is also reversed. Thus to the filter everything appears as if it was used during a non-reversed query but on a table with reversed schema, which works correctly given the fact that the reader is feeding parsed rows into the consumer in reversed order. During reversed queries the reader uses alternative logic for skipping to a later range (or, speaking in non-reversed terms, to an earlier range), which happens in `advance_context`. It asks the index to advance its upper bound in reverse so that the reversing_data_source notices the change of the index end position and returns following buffers with rows from the new range. There is a slight difference in behavior of the reader from `mp_row_consumer_m`'s point of view. For non-reversed reads, after the consumer obtains the beginning of a row (`consume_row_start`) - which contains the row's position but not the columns - and tells the reader that the row won't be emitted because we need to skip to a later range, the reader would tell the data source (the 'context') immediately to skip to a later range by calling `skip_to`. This caused the source not to return the rest of the row, and the rest of the row would not be fed to the consumer (`consume_row_end`). However, for reversed reads, the data source performs skipping 'on its own', after it notices that the index end position has changed. This may happen 'too late', causing the rest of the row to be returned anyway. We are prepared for this situation inside `mp_row_consumer` by consulting the mutation fragment filter again when the rest of the row arrives. Fast forwarding is not supported at this point, which is fine given that the cache is disabled for reversed queries for now (and the cache is the only user of fast forwarding). The `partition_slice` provided by callers is provided in 'half-reversed' format for reversed queries, where the order of clustering ranges is reversed, but the ranges themselves are not. This means we need to modify the slice sometimes: for non-single-partition queries the mx reader must use a non-reversed slice, and for single-partition queries the mx reader must use a native-reversed slice (where the clustering ranges themselves are reversed as well). The modified slice must be stored somewhere; we store it inside the mx reader itself so we don't need to allocate more intermediate readers at the call sites. This causes the interface of `mx::make_reader` to be a bit weird: for non-single-partition queries where the provided slice is reversed the reader will actually return a non-reversed stream of fragments, telling the user to reverse the stream on their own. The interface has been documented in detail with appropriate comments.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	64e703bb54	sstables: mx: introduce partition_reversing_data_source This patch adds an implementation of a data source that wraps an sstable data file and returns data buffers with contents of one partition in the sstable as if the rows of the partition were present in a reversed order. In other words, to the user of the source the partition appears to be reversed. We shall call this an 'intermediary' data source. As part of the interface of the intermediary source the user is also given read access to the source's current position over the data file, and the constructor of the source takes a reference to `index_reader`. This is necessary because the index operates directly on data file offsets and we want the user to be able to use the index to skip sequences of rows. In order to ask the source to skip a sequence of rows - e.g. when jumping between clustering ranges - the user must advance the index' upper bound in reverse (to an earlier position). The source will then notice that the end position of the index has changed and take appropriate action. An alternative would be to translate the data positions of `index_reader` to 'reversed positions' of the intermediary and then use `skip_to` for skipping, as we do for forward reads. However this solution would introduce more complexity to `index_reader` and the intermediary source. One reason for the complexity in the input stream is that we would have two kinds of skips: a single row skip, and a skip to a clustering range. We know the offset of the next row, so we could check that to differentiate them. We would also need to add an information about the position of first clustering row and end of the last one in the index_reader. Skipping by checking the index seems to be overall simpler. For simplicity, the intermediary stream always starts with parsing the partition header and (if present) the static row, and returning the corresponding bytes as a result of the first read. After partition header and static row we must find the last row entry of the requested range. If the range ends before the partition end (i.e. there are more row entries after the range) we can use the 'previous unfiltered size' of the row following the range; otherwise we must scan the last promoted index block and take its last row. After finding the data range of the last row, we parse rows consecutively in reversed order. We must parse the rows partially to learn their lengths and the positions of previous rows. We're using similar constructs as in the sstable parser, but it only contains a small part of the parsing coroutine and doesn't perform any correctness checks. The parser for rows still turned out rather big mostly because we can't always deduce the size of the clustering blocks without reading the block header. The parser allows reading rows while skipping their bodies also in non-reversed order, which we are making use of while reading the last promoted index block. The intermediary data source has one more utility: reversing range tombstones. When we read a tombstone bound/boundary, we modify the data buffer so that the resulting bound/boundary has the reversed kind (so we don't read ends before starts) and the boundaries have their before/after timestamps swapped.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	8385f3eb21	sstables: index_reader: add support for iterating over clustering ranges in reverse In the sstable reader, we iterate over clustering ranges using the index_reader, which normally only accepts advancing to increasing positions. In this patch we add methods for advancing the index reader in reverse. To simplify our job we restrict our attention to a single implementation of the promoted index block cursor: `bsearch_clustered_cursor`. The `index_reader` methods for advancing in reverse will thus assume that this implementation is used. The assumption is correct given that we're working only with sstables of versions >= mc, which is indeed the intended use case. We add some documentation in appropriate places to make this obvious. We extend `bsearch_clustered_cursor` with two methods: `advance_past(pos)`, which advances the cursor to the first block after `pos` (or to the end if there is no such block), and `last_block_offset()`, which returns the data file offset of the first row from the last promoted index block. To efficiently find the position in the data file of the last row of the partition (which we need when performing a reversed query) the sstable reader may need to read the span of the entire last promoted index block in the data file. To learn where the block starts it can use `index_reader::last_block_offset()`, which is implemented in terms of `bsearch_clustered_cursor::last_block_offset()`. When performing a single partition read in forward order, the reader asks the index to position its lower bound at the start of the partition and its upper bound after the end of the slice. It starts by reading the first range. After exhausting a range it jumps to the next one by asking the index to advance the lower bound. For reverse single partition reads we'll take a similar approach: the initial bound positions are as in the forward case. However, we start with the last range and after exhausting a range we want to jump to a previous one; we will do it by advancing the upper bound in reverse (i.e. moving it closer to the beginning of the partition). For this we introduce the `index_reader::advance_reverse` function.	2021-10-04 15:24:12 +02:00
Avi Kivity	3cb865103d	Update seastar submodule * seastar 0ba6c36cc3...994b4b5a0c (2): > Merge "Improve smp::invoke_on_others" from Pavel E > file: keep hint pointer alive when calling fcntl()	2021-10-04 15:36:45 +03:00
Avi Kivity	148a12f3da	Merge "Keep storage_service less aware of cdc internals" from Pavel E " The storage_service is involved in the cdc_generation_service guts more than needed. - the bool _for_testing bit is cdc-only - there's API-only cdc_generation_service getter - cdc_g._s. startup code partially sits in s._s. one This patch cleans most of the above leaving only the startup _cdc_gen_id on board. tests: unit(dev) refs: #2795 " * 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla: api: Use local sharded<cdc::generation_service> reference main: Push cdc::generation_service via API storage_service: Ditch for_testing boolean cdc: Replace db::config with generation_service::config cdc: Drop db::config from description_generator cdc: Remove all arguments from maybe_rewrite_streams_descriptions cdc: Move maybe_rewrite_streams_descriptions into after_join cdc: Squash two methods into one cdc: Turn make_new_cdc_generation a service method cdc: Remove ring-delay arg from make_new_cdc_generation cdc: Keep database reference on generation_service	2021-10-04 14:56:05 +03:00
Piotr Dulikowski	6093c2378b	hints: assign _last_written_rp in ep manager's move constructor The end_point_hints_manager's field _last_written_rp is initialized in its regular constructor, but is not copied in the move constructor. Because the move constructor is always involved when creating a new endpoint manager, the _last_written_rp field is effectively always initialized with the zero-argument constructor, and is set to the zero value. This can cause the following erroneous situation to occur: - Node A accumulates hints towards B. - Sync point is created at A. It will be used later to wait for currently accumulated hints. - Node A is restarted. The endpoint manager A->B is created which has bogus value in the _last_written_rp (it is set to zero). - Node A replays its hints but does not write any new ones. - A hint flush occurs. If there are no hint segments on disk after flush, the endpoint manager sets its last sent position to the last written position, which is by design. However, the last written position has incorrect value, so the last sent position also becomes incorrect and too low. - Try to wait for the sync point created earlier. The sync point waiting mechanism waits until last sent hint position reaches or goes past the position encoded in the sync point, but it will not happen because the last sent position is incorrect. The above bug can be (sometimes) reproduced in hintedhandoff_sync_point_api_test dtest. Now, the _last_written_rp field is properly initialized in the move constructor, which prevents the bug described above. Fixes: #9320 Closes #9426	2021-10-04 13:21:34 +02:00
Kamil Braun	b2f33b3e0b	test: raft: randomized_nemesis_test: abort environment before ticker We must abort the environment before the ticker as the environment may require time to keep advancing during abort in order for all operations to finish, e.g. operations that can finish only due to timeout. Currently such operations may cause the test to hang indefinitely at the end. The test requires a small modification to ensure that `delivery_queue::push` is not called after the queue was aborted. Message-Id: <20210930143539.157727-1-kbraun@scylladb.com>	2021-10-04 12:31:26 +02:00
Avi Kivity	1bac93e075	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-10-04 13:09:31 +03:00
Avi Kivity	93b765f655	scripts/pull_github_pr.sh: don't guess git remote name The script assumes the remote name is "origin", a fair assumption, but not universally true. Read it from configuration instead of guessing it. Closes #9423	2021-10-04 12:32:39 +03:00
Nadav Har'El	414b672e22	test/alternator: verify that empty-string keys are NOT allowed Since May 2020 empty strings are allowed in DynamoDB as attribute values (see announcment in [1]). However, they are still not allowed as keys. We had tests that they are not allowed in keys of LSI or GSI, but missed tests that they are not allowed as keys (partition or sort key) of base tables. This patch add these missing tests. These tests pass - we already had code that checked for empty keys and generated an appropriate error. Note that for compatibility with DynamoDB, Alternator will forbid empty strings as keys even though Scylla does support this possibility (Scylla always supported empty strings as clustering key, and empty partition keys will become possible with issue #9352). [1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/ Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211003122842.471001-1-nyh@scylladb.com>	2021-10-04 08:40:43 +02:00
Botond Dénes	61e7d3de90	Merge 'Cleanup compaction_stop_exception' from Benny Halevy The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception` and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown or api event vs. compaction aborting due to scrub validation error. While at it, cleanup the existing retry logic related to `compaction_stop_exception`. Test: unit(dev) Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267) Closes #9321 * github.com:scylladb/scylla: compaction: split compaction_aborted_exception from compaction_stopped_exception compaction_manager: maybe_stop_on_error: rely on retry=false default compaction_manager: maybe_stop_on_error: sync return value with error message. compaction: drop retry parameter from compaction_stop_exception compaction_manager: move errors stats accounting to maybe_stop_on_error	2021-10-04 07:27:11 +03:00
Takuya ASADA	9c830297ac	scylla_util.py: add persistent disk support for GCE Just like EBS disks for EC2, we want to use persistent disk on GCE. We won't recommend to use it, but still need to support it. Related scylladb/scylla-machine-image#215 Closes #9395	2021-10-03 17:58:18 +03:00
Takuya ASADA	d87b80ad14	scylla_util.py: add persistent disk support for Azure Just like EBS disks for EC2, we want to use persistent disk on Azure. We won't recommend to use it, but still need to support it. Related https://github.com/scylladb/scylla-machine-image/issues/218 Closes #9417	2021-10-03 17:56:31 +03:00
Avi Kivity	adcd5a69d6	Update seastar submodule * seastar e6db0cd587...0ba6c36cc3 (6): > semaphore: add try_get_units > build: adjust compilation for libfmt 8+ > alloc_failure_injector: add explicit zero-initialization > Change wakeup() from private to public in reactor.hh > app-template: separate seastar options into --seastar-help > files: Don't ignore FS info for read-only files	2021-10-03 13:14:43 +03:00
Piotr Sarna	1d353bd6e7	docs: mention scripts/pull_github_pr.sh The pull_github_pr.sh script is preferred over colorful github buttons, because it's designed to always assign proper authors. It also works for both single- and multi-patch series, which makes the merging process more universal. Message-Id: <b982b650442456b988e1cea59aa5ad221207b825.1633101849.git.sarna@scylladb.com>	2021-10-03 10:19:26 +03:00
Michał Radwański	0d5a2067ad	test/lib/failure_injecting_allocation_strategy: remove UB... by setting _alloc_count initially to 0. The _alloc_count hasn't been explicitely specified. As the allocator has been usually an automatic variable, _alloc_count had initially some unspecified contents. This probalby means that cases where the first few allocations passed and the later one failed, might haven't ever been tested. Good thing is that most of the users have been transferred to the Seastar failure injector, which (by accident) has been correct. Closes #9420	2021-10-01 13:25:05 +02:00
Pavel Emelyanov	85d86cc85f	scripts: Fix origin repo URL parsing It assumes that the origin URL is git@ one while it can be the https:// one as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211001082116.7214-1-xemul@scylladb.com>	2021-10-01 13:22:06 +02:00
Piotr Sarna	ec52e05eab	tracing: unify prepared statement info into a single struct The tracing code assumes that query_option_names and query_option_values vectors always have the same length as the prepared_statements vector, but it's not true. E.g. if one of the statements in a batch is incorrect, it will create a discrepancy between the number of prepared statements and the number of bound names and values, which currently leads to a segmentation fault. To overcome the problem, all three vectors are integrated into a single vector, which makes size mismatches impossible. Tested manually with code that triggers a failure while executing a batch statement, because the Python driver performs driver-side validation and thus it's hard to create a test case which triggers the problem. closes: #9221	2021-10-01 10:57:38 +03:00
Eliran Sinvani	c38ceafdcf	Service Level Controller: Add an extention point to the API (#9374 ) In order to ease future extensions to the information being sent by the service level configuration change API, we pack the additional parameters (other the the service level options) to the interface in a structure. This will allow an easy expansion in the future if more parameters needs to be sent to the observer.i Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-01 10:20:28 +03:00
Raphael S. Carvalho	9067a13eac	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:57 -03:00
Raphael S. Carvalho	87ce0c5d43	compaction_manager: use task when stopping a given compaction type compaction_info will eventually only be used for exporting data about ongoing compactions, so task must be used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:52 -03:00
Raphael S. Carvalho	cbd78be2dd	compaction: remove start_size and end_size from compaction_info those stats aren't used in compaction stats API and therefore they can be removed. end_size is added to compaction_result (needed for updating history) and start_size can be calculated in advance. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:45 -03:00
Raphael S. Carvalho	18f703e94b	compaction_manager: introduce helpers for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:41 -03:00
Raphael S. Carvalho	d4572a1bb5	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:37 -03:00
Raphael S. Carvalho	38df9c68f8	compaction: kill sstables field in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:33 -03:00
Raphael S. Carvalho	90cfe895d4	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:29 -03:00
Raphael S. Carvalho	4ce745e0b6	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:21 -03:00
Raphael S. Carvalho	efed06e2e4	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:15:00 -03:00
Raphael S. Carvalho	1f5b17fdc5	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:13:20 -03:00
Raphael S. Carvalho	52302c3238	compaction_manager: prevent unbounded growth of pending tasks There will be unbounded growth of pending tasks if they are submitted faster than retiring them. That can potentially happen if memtables are frequently flushed too early. It was observed that this unbounded growth caused task queue violations as the queue will be filled with tons of tasks being reevaluated. By avoiding duplication in pending task list for a given table T, growth is no longer unbounded and consequently reevaluation is no longer aggressive. Refs #9331. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>	2021-09-30 16:49:52 +03:00
Pavel Emelyanov	037135316e	api: Use local sharded<cdc::generation_service> reference And remove the getter from storage_service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	5d8e05e7ae	main: Push cdc::generation_service via API This is not to mess with storage service in this API call. Next patch will make use of the passed reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	f669fbd230	storage_service: Ditch for_testing boolean Nowadays it purely controls whether or not to inject delays into timestamps generation by cdc. The same effect can be achieved by configuring the cdc::generation_service directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	db623c5f64	cdc: Replace db::config with generation_service::config This is to push the service towards general idea that each component should have its own config and db::config to stay in main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	b879d3f3a5	cdc: Drop db::config from description_generator It only needs one for murmur3_partitioner_ignore_msb_bits value, provide it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	2e7364b94f	cdc: Remove all arguments from maybe_rewrite_streams_descriptions All of them are references taken from 'this', since the function is the generation_service method it can use 'this' directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 16:04:12 +03:00
Pavel Emelyanov	6fe31d8eac	cdc: Move maybe_rewrite_streams_descriptions into after_join The generation service already has all it needs to do it. This keeps storage_service smaller and less aware about cdc internals. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 15:34:03 +03:00
Pavel Emelyanov	3b51c5c96a	cdc: Squash two methods into one The recently introduced make_new_generation() method just calls another one by passing more this->... stuff as arguments. Relax the flow by teaching the latter to use 'this' directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 15:34:03 +03:00
Pavel Emelyanov	7a7a87f24a	cdc: Turn make_new_cdc_generation a service method It has everything needed onboard. Only two arguments are required -- the booststrap tokens and whether or not to inject a delay. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 15:34:03 +03:00
Pavel Emelyanov	b867a19da1	cdc: Remove ring-delay arg from make_new_cdc_generation It already has the db::config from where to get one (and even this will change soon). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 15:34:03 +03:00
Pavel Emelyanov	5e2a049266	cdc: Keep database reference on generation_service The service effectively depends on it when rewrites streams descriptions. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 15:34:03 +03:00
Piotr Sarna	e2fe8559ca	configure: temporarily disable wasm support for aarch64 There seems to be a problem with libwasmtime.a dependency on aarch64, causing occasional segfaults during tests - specifically, tests which exercise the path for halting wasm execution due to fuel exhaustion. As a temporary measure, wasm is disabled on this architecture to unblock the flow. Refs #9387 Closes #9414	2021-09-30 14:57:04 +03:00
Pavel Emelyanov	bbcf671276	config: Remove unused replacing options The --replace-token and --replace-node were added some time ago, but have never been used since then, just parsed and immediatelly aborted. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210930102222.16294-1-xemul@scylladb.com>	2021-09-30 14:56:04 +03:00
Kamil Braun	5b011b1c2f	clustering_key_filter: clustering_key_filter_ranges owning constructor	2021-09-30 12:10:52 +02:00
Kamil Braun	43dac07253	flat_mutation_reader: mention reversed schema in make_reversing_reader docstring	2021-09-30 12:10:52 +02:00
Kamil Braun	1777d5de46	clustering_key_filter: document clustering_key_filter_ranges::get_ranges	2021-09-30 12:10:52 +02:00
Piotr Jastrzebski	79de151158	cache_tracker: remove unused parameter from on_remove Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f66ad391d86963b43b2a01e957887ea597e591e8.1632992165.git.piotr@scylladb.com>	2021-09-30 13:03:13 +03:00
Avi Kivity	83237894b7	Merge "Keep local_host_id in local_cache" from Pavel E " Most of the code gets local_host_id by querying system keyspace. There's one place (counters) that want future-less getter and that caches host id on database for that (it used to be cached on storage_service some time ago). This set relocates the value on local cache and frees the starting code from the need to mess with database for setting it. Also this cuts hints->qctx hidden dependency. tests: unit(dev) " * 'br-host-id-in-local-cache' of https://github.com/xemul/scylla: storage_proxy: Use future-less local_host_id getting database: Get local host id from system_keyspace system_keyspace: Keep local_host_id on local_cache code: Rename get_local_host_id() into load_...() system_keyspace: Coroutinize get_/set_local_host_id	2021-09-30 12:38:52 +03:00
Pavel Emelyanov	8a93a6de78	storage_proxy: Use future-less local_host_id getting The methods in question are called from the API handlers which are registered after start (and after host id load and cache), so they can safely be switched. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:55:20 +03:00
Pavel Emelyanov	e9002e1e61	database: Get local host id from system_keyspace It's now cached on database itself, and it can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:55:20 +03:00
Pavel Emelyanov	9f5fd8b5c0	system_keyspace: Keep local_host_id on local_cache Some places in the code want to have future-less access to the host id, now they do it all by themselves. Local cache seems to be a better place (for the record -- some time ago the "better place" argument justified cached host id relocation from the storage_service onto the database). While at it -- add the future-less getter for the host_id to be used further. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:54:38 +03:00
Nadav Har'El	1edcc3a218	test/alternator: add test for reverse queries This patch adds a reproducer for issue #7586 - that Alternator queries (Query) operating in reverse order (ScanIndexForward = false) are artificially limited to 100 MB partitions because of their memory use. This test generates a partition over 100 MB in size and then tries various reverse queries on it - with or without Limit, starting at the end or the middle of the partition. The test currently fails when a reverse query refuses to operate on such a large partition - the log reports this: ERROR ... Memory usage of reversed read exceeds hard limit of 104857600 (configured via max_memory_for_unlimited_query_hard_limit), while reading partition K1H6ON3A1C With yet-uncommitted reverse-scan improvements, the test proceeds further, but still fails where we test that a reverse query with Limit not explicitly specified should still be limited to a certain size (e.g. 1MB) and cannot return the entire 100 MB partition in one response. Please note that this is not a comprehensive test for Scylla's reverse scan implementation: In particular we do not have separate tests for reverse scan's implementation on different sources - memtables, sstables, or the cache. Nor do we check all sorts of edge cases. We assume that Scylla's reverse scan implementation will have its own unit tests elsewhere that will check these things - and this test can focus on the Alternator use case. This test is marked "xfail" because it still fails on Alternator. It is marked "veryslow" because it's a (relatively) slow test, taking multiple seconds to set up the 100 MB partition. So run the test with the pytest options "--runxfail --runveryslow" to see how it fails. Refs #7586 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210930063700.407511-1-nyh@scylladb.com>	2021-09-30 09:34:39 +02:00
Pavel Emelyanov	beb345c00a	code: Rename get_local_host_id() into load_...() There will appear the future-less method which better deserves the get_ prefix, so give the existing method the load_ one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:33:57 +03:00
Pavel Emelyanov	e49dc4ed0d	system_keyspace: Coroutinize get_/set_local_host_id Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:33:57 +03:00
Pavel Emelyanov	e6b920017a	main: Replace cql_config_updater with updateable_value The cql_config_updater is a sharded<> service that exists in main and whose goal is to make sure some db::config's values are propagated into cql_config. There's a more handy updateable_value<> glue for that. tests: unit(dev) refs: #2795 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210927090402.25980-1-xemul@scylladb.com>	2021-09-30 07:23:43 +03:00
Pavel Solodovnikov	88f9f2e9d0	idl: support generating boilerplate code for RPC verbs Introduce new syntax in IDL compiler to allow generating registration/sending code for RPC verbs: verb [[attr1, attr2...] my_verb (args...) -> return_type; `my_verb` RPC verb declaration corresponds to the `netw::messaging_verb::MY_VERB` enumeration value to identify the new RPC verb. For a given `idl_module.idl.hh` file, a registrator class named `idl_module_rpc_verbs` will be created if there are any RPC verbs registered within the IDL module file. These are the methods being created for each RPC verb: static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&); static future<> unregister_my_verb(netw::messaging_service* ms); static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...); Each method accepts a pointer to an instance of `messaging_service` object, which contains the underlying seastar RPC protocol implementation, that is used to register verbs and pass messages. There is also a method to unregister all verbs at once: static future<> unregister(netw::messaging_service* ms); The following attributes are supported when declaring an RPC verb in the IDL: * [[with_client_info]] - the handler will contain a const reference to an `rpc::client_info` as the first argument. * [[with_timeout]] - an additional `time_point` parameter is supplied to the handler function and `send` method uses `send_message__timeout` variant of internal function to actually send the message. * [[one_way]] - the handler function is annotated by `future<rpc::no_wait_type>` return type to designate that a client doesn't need to wait for an answer. The `-> return_type` clause is optional for two-way messages. If omitted, the return type is set to be `future<>`. For one-way verbs, the use of return clause is prohibited and the signature of `send*` function always returns `future<>`. No existing code is affected. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-30 02:21:57 +03:00
Michał Radwański	b68a6c63e9	flat_mutation_reader: remove unused reserve_one method Closes #9410	2021-09-29 17:22:29 +02:00
Nadav Har'El	43b3c1b75d	CODEOWNERS: some fixes and additions Fixed some errors in .github/CODEOWNERS (which is used by Github to recommend who should review which pull request), and also add a few additional ownerships I thought of. This file could still use more work - if you can think of specific files or directories you'd like to review changes in, please send a patch for this file to add yourself to the appropriate paths. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210929141118.378930-1-nyh@scylladb.com>	2021-09-29 18:07:07 +03:00
Botond Dénes	970fe9a339	mutation_writer: partition_based_splitting_writer: limit number of max buckets Recently we observed an OOM caused by the partition based splitting writer going crazy, creating 1.7K buckets while scrubbing an especially broken sstable. To avoid situations like that in the future, this patch provides a max limit for the number of live buckets. When the number of buckets reach this number, the largest bucket is closed and replaced by a bucket. This will end up creating more output sstables during scrub overall, but now they won't all be written at the same time causing insane memory pressure and possibly OOM. Scrub compaction sets this limit to 100, the same limit the TWCS's timestamp based splitting writer uses (implemented through the classifier - time_window_compaction_strategy::max_data_segregation_window_count). Fixes: #9400 Tests: unit(dev) Closes #9401	2021-09-29 16:31:29 +03:00
Avi Kivity	b3c95a1fc6	commitlog: reduce inclusions of commitlog.hh due to db::commitlog::force_sync (#9379 ) There are now 231 translation units that indirectly include commitlog.hh due to the need to have access to db::commitlog::force_sync. Move that type to a new file commitlog_types.hh and make it available without access to the commitlog class. This reduces the number of translation units that depend on commitlog.hh to 84, improving compile time.	2021-09-29 16:13:44 +03:00
Nadav Har'El	5cbe9178fd	alternator: add missing BatchGetItem metric Unfortunately, defining metrics in Scylla requires some code duplication, with the metrics declared in one place but exported in a different place in the code. When we duplicated this code in Alternator, we accidentally dropped the first metric - for BatchGetItem. The metric was accounted in the code, but not exported to Prometheus. In addition to fixing the missing metric, this patch also adds a test that confirms that the BatchGetItem metric increases when the BatchGetItem operation is used. This test failed before this patch, and passes with it. The test only currently tests this for BatchGetItem (and BatchWriteItem) but it can be later expanded to cover all the other operations as well. Fixes #9406 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210929121611.373074-1-nyh@scylladb.com>	2021-09-29 14:16:54 +02:00
Tomasz Grabiec	11a3b411c5	Merge 'mutation_source_test: test reverse reads' from Botond Dénes Currently no mutation-source supports reading in reverse natively but we are working on changing that, adding native reverse read support to memtable, cache and sstable readers. To ensure that all mutation sources work in a correct and uniform manner when reading in reverse, we add a reverse test to the mutation source test suite. This test reverses the data that it passes to `populate()`, then reads in forward order (in reverse compared to the data order). For this we use the currently established reverse read API: reverse schema (schema order == query order) and half-reversed (legacy) slice. All mutation sources are prepared to work with reversed reads, using the `make_reversing_reader()` adapter. As we progress with our native reverse support, we will replace these adapters with native reversing support. As part of this, we push down the reversing reader adapter currently existing on the `query::consume_page()` level, to the individual mutation sources. Closes #9384 * github.com:scylladb/scylla: test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set querier: consume_page(): remove now unused max_size parameter test/lib: mutation_source_test: test reading in reverse test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema treewide: move reversing to the mutation sources mutation_query: reconcilable_result_builder: document reverse query preconditions sstable_set: time_series_sstable_set: reverse mode mutlishard_mutation_query: set max result size on used permits db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema flat_mutation_reader: make_reversing_reader(): add convenience stored slice mutation_reader: evictable_reader: add reverse read support flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions query-request: introduce `half_reverse_slice` flat_mutation_reader_assertions: log what's expected	2021-09-29 12:57:57 +02:00
Avi Kivity	d4aa6c2746	Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael " Backlog tracker isn't updated correctly when facing a schema change, and may leak a SSTable if compaction strategy is changed, which causes backlog to be computed incorrectly. Most of these problems happen because sstable set and tracker are updated independently, so it could happen that tracker lose track (pun intended) of changes applied to set. The first patch will fix the leak when strategy is changed, and the third patch will make sure that tracker is updated atomically with sstable set, so these kind of problems will not happen anymore. Fixes #9157 " * 'fixes_to_backlog_tracker_v4' of github.com:raphaelsc/scylla: compaction: Update backlog tracker correctly when schema is updated compaction: Don't leak backlog of input sstable when compaction strategy is changed compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() compaction: simplify removal of monitors	2021-09-29 13:55:37 +03:00
Kamil Braun	075a894a89	test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set	2021-09-29 12:15:48 +03:00
Botond Dénes	42b677ef6f	querier: consume_page(): remove now unused max_size parameter	2021-09-29 12:15:48 +03:00
Botond Dénes	bc49c27a06	test/lib: mutation_source_test: test reading in reverse To ensure all mutation sources uniformly support the current API of reverse reading: reversed schema and half-reversed slice. This test will also ensure that once we switch to native-reverse slice, all mutation-sources will keep on working.	2021-09-29 12:15:48 +03:00
Kamil Braun	7d5273b044	test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse For reversed reads we must adjust the lower/upper bounds used by the `position_reader_queue` and `clustering_combined_reader`. The bounds are calculated using the mutation schema, but we need bounds calculated using the query schema which is reversed.	2021-09-29 12:15:48 +03:00
Botond Dénes	9399f379ec	test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse The mutation source test suite will soon test reads in reverse. Prepare for this by checking the reversed flag on the slice and not reversing the data when set. The test will have two modes effectively: * Forward mode: data is reversed before read, the reversed again during read. * Reverse mode: data is already reversed and it is reversed back during read.	2021-09-29 12:15:48 +03:00
Botond Dénes	c048d854d9	test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema The two might not be the same in case the schema was upgraded or if we are reading in reverse. It is important to use the passed-in query schema consistently during a read.	2021-09-29 12:15:48 +03:00
Botond Dénes	41facb3270	treewide: move reversing to the mutation sources Push down reversing to the mutation-sources proper, instead of doing it on the querier level. This will allow us to test reverse reads on the mutation source level. The `max_size` parameter of `consume_page()` is now unused but is not removed in this patch, it will be removed in a follow-up to reduce churn.	2021-09-29 12:15:45 +03:00
Nadav Har'El	88177d7be7	test/alternator: add test for too many items in BatchWriteItem DynamoDB limits the number of items that a BatchWriteItem call can write to 25. As noted in issue #5057, in Alternator we don't have this limit or any limit on the number of items in a BatchWriteItem - which probably isn't wise. This patch adds a simple xfailing test for this. Refs #5057 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210912140736.76995-1-nyh@scylladb.com>	2021-09-29 10:48:58 +02:00
Nadav Har'El	a1bab2c4c9	Merge 'cql3: improve expression ergonomics' from Avi Kivity The `expression` type (an std::variant) suffers from bad ergonomics: - std::variant has poor/no constraints, so compiler error messages are long and uninformative - it cannot be forward-declared (since std::variant does not support incomplete types) - the type name is long, polluting compiler error messages and debug symbols - it requires an artificial `nested_expression` when one expression is nested inside another This series fixes those drawbacks by wrapping the variant in a class, adding constraints, and adding an extra indirection. Test: unit (dev) Closes #9402 * github.com:scylladb/scylla: cql3: expr: drop nested_expression cql3: expr: make expression forward declarable, easier to use cql3: expr: construct column_value explicitly cql3: expr: introduce as/as_if/is cql3: expr: introduce expr::visit, replacing std::visit	2021-09-29 10:47:39 +03:00
Takuya ASADA	cd7fe9a998	scylla_cpuscaling_setup: disable ondemand.service on Ubuntu On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils. This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave. cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service. To configure scaling_governor correctly, we have to disable this service. Fixes #9324 Closes #9325	2021-09-29 10:32:34 +03:00
Avi Kivity	c72906a2ee	cql3: expr: drop nested_expression Now that expression can be nested in its component types directly, we can remove nested_expression. Most of the patch adjusts uses to drop the dereference that was needed for nested_expression.	2021-09-28 23:49:21 +03:00
Avi Kivity	448c06f150	cql3: expr: make expression forward declarable, easier to use Make expression a class, holding a unique_ptr to a variant, instead of just a variant. This has some advantages: - the constructor can be properly constrained - the type can be forward-declared - the type name is just "expression", rather than a huge variant. This makes compiler error messages easier to read. - the internal indirection allows removal of nested_expression (later in the series)	2021-09-28 23:49:21 +03:00
Avi Kivity	d43e72a747	cql3: expr: construct column_value explicitly We have a few cases where a column_definition* is converted directly to an expression without an explicit call to column_value{}. The new expression implementation will not allow this, so make these cases explicit. IMO this is better form than to rely on the compiler picking the right expression subtype.	2021-09-28 23:49:21 +03:00
Avi Kivity	be44b579a1	cql3: expr: introduce as/as_if/is Simple wrappers for std::get, std::get_if, std::holds_alternative. The new names are shorter and IMO more readable. Call sites are updated. We will later replace the implementation.	2021-09-28 23:49:11 +03:00
Avi Kivity	e7db3def4f	cql3: expr: introduce expr::visit, replacing std::visit The new expr::visit() is just a wrapper around std::visit(), but has better constraints. A call to expr::visit() with a visitor that misses an overload will produce an error message that points at the missing type. This is done using the new invocable_on_expression concept. Note it lists the expression types one by one rather than using template magic, since otherwise we won't get the nice messages. Later, we will change the implementation when expression becomes our own type rather than std::variant. Call sites are updated.	2021-09-28 23:48:42 +03:00
Botond Dénes	c7619de929	mutation_query: reconcilable_result_builder: document reverse query preconditions	2021-09-28 17:03:57 +03:00
Kamil Braun	7dc4ee35c9	sstable_set: time_series_sstable_set: reverse mode `time_series_sstable_set` uses `clustering_combined_reader` to implement efficient single-partition reads. It provides a `position_reader_queue` to the reader. This queue returns readers to the sstables from the set in order of the sstables' lower bounds, and with each reader it provides an upper bound for the positions-in-partition returned by the reader. Until now we would assume non-reversed queries only. Reversed queries were implemented by performing forward query in the lower layers and reversing the results at the upper-most layer of the reader stack. Before pushing the reversing down to the sources (in particular, to sstable readers), we need to support the reverse mode in `time_series_sstable_set` and the queue it provides to `clustering_combined_reader`. This requires using different lower and upper bounds in the queue. For non-reversed reads we used `sstable::min_position()` as the lower bound and `sstable::max_position()` as the upper bound. For reversed reads all comparisons performed by `clustering_combined_reader` will be reversed, as it will use a reversed schema. We can then use `sstable::max_position().reversed()` for the lower bound and `sstable::min_position().reversed()` for the upper bound.	2021-09-28 17:03:57 +03:00
Botond Dénes	22e216563a	mutlishard_mutation_query: set max result size on used permits `08042c1688` added the query max result size to the permit but only set it for single partition queries. This patch does the same for range-scans in preparation of `query::consume_page()` not propagating max size soon.	2021-09-28 17:03:57 +03:00
Botond Dénes	dec282e050	db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema The two might not be the same in case the schema was upgraded (unlikely for virtual tables) or if we are reading in reverse. It is important to use the passed-in query schema consistently during a read.	2021-09-28 17:03:57 +03:00
Botond Dénes	f5ef88c0c5	flat_mutation_reader: make_reversing_reader(): add convenience stored slice This serves as a convenience slice storage for reads that have to store an edited slice somewhere. This is common for reads that work with a native-reversed slice and so have to convert the one used in the query -- which is in half-reversed format.	2021-09-28 17:03:57 +03:00
Botond Dénes	2bd295ee80	mutation_reader: evictable_reader: add reverse read support Evictable reader has to be made aware of reverse reads as it checks/edits the slice. This shouldn't require reverse awareness normally, it is only required because we still use the half-reversed (legacy) slice format for reversed reads. Once we switch to the native format this commit can be reverted.	2021-09-28 17:03:57 +03:00
Botond Dénes	eeebe4ab63	flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support Implemented with the `make_reversing_reader()` adaptor.	2021-09-28 17:03:57 +03:00
Botond Dénes	cc222e5332	flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support Implemented with the `make_reversing_reader()` adaptor.	2021-09-28 17:03:57 +03:00
Botond Dénes	1a2bdba25f	flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions	2021-09-28 17:03:57 +03:00
Kamil Braun	4bd601c6fd	query-request: introduce `half_reverse_slice` A utility function for converting between forward and half-reversed (or 'legacy'-reversed) slices to be used in the next commit.	2021-09-28 17:03:57 +03:00
Kamil Braun	270093b251	flat_mutation_reader_assertions: log what's expected	2021-09-28 17:03:57 +03:00
Tomasz Grabiec	c4328ffc4d	tests: mutation_test: Add test for position_in_partition::reversed() Message-Id: <20210927154942.44236-1-tgrabiec@scylladb.com>	2021-09-28 13:09:39 +02:00
Tomasz Grabiec	6bf873b663	Merge "raft: misc documentation edits" from Kostja * scylla-dev/raft-misc-v4-docedit: raft: document pre-voting and protection against disruptive leaders raft: style edits of README.md. raft: document snapshot API	2021-09-28 12:12:46 +02:00
Konstantin Osipov	0adff23c21	raft: document pre-voting and protection against disruptive leaders	2021-09-27 22:04:18 +03:00
Konstantin Osipov	0e63e99b5a	raft: style edits of README.md.	2021-09-27 22:04:04 +03:00
Konstantin Osipov	de2beac6ca	raft: document snapshot API	2021-09-27 22:03:38 +03:00
Raphael S. Carvalho	9718173598	compaction: Update backlog tracker correctly when schema is updated Currently the following can happen: 1) there's ongoing compaction with input sstable A, so sstable set and backlog tracker both contains A. 2) ongoing compaction replaces input sstable A by B, so sstable set contains only B now. 3) schema is updated, so a new backlog tracker is built without A because sstable set now contains only B. 4) ongoing compaction tries to remove A from tracker, but it was excluded in step 3. 5) tracker can now have a negative value if table is decreasing in size, which leads to log(<negative number>) == -NaN This problem happens because backlog tracker updates are decoupled from sstable set updates. Given that the essential content of backlog tracker should be the same as one of sstable set, let's move tracker management to table. Whenever sstable set is updated, backlog tracker will be updated with the same changes, making their management less error prone. Fixes #9157 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:15:29 -03:00
Raphael S. Carvalho	afd45b9f49	compaction: Don't leak backlog of input sstable when compaction strategy is changed The generic backlog formula is: ALL + PARTIAL - COMPACTING With transfer_ongoing_charges() we already ignore the effect of ongoing compactions on COMPACTING as we judge them to be pointless. But ongoing compactions will run to completion, meaning that output sstables will be added to ALL anyway, in the formula above. With stop_tracking_ongoing_compactions(), input sstables are never removed from the tracker, but output sstables are added, which means we end up with duplicate backlog in the tracker. By removing this tracking mechanism, pointless ongoing compaction will be ignored as expected and the leaks will be fixed. Later, the intention is to force a stop on ongoing compactions if strategy has changed as they're pointless anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:03:28 -03:00
Raphael S. Carvalho	05126cfe29	compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() This new function makes it easier to remove monitor of exhausted sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:01:40 -03:00
Raphael S. Carvalho	35050a8217	compaction: simplify removal of monitors by switching to unordered_map, removal of generated monitors is made easier. this is a preparatory change for patch which will remove monitor for all exhausted sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 13:59:30 -03:00
Tomasz Grabiec	2b3ae6aca4	position_in_partition: Introduce reversed() transformation It transforms the position from a forward-clustering-order schema domain to a reversed-clustering-order schema domain. The object still refers to the same element of the space of keys under this transformation. However, the identification of the position, the position_in_partition object, is schema-dependent, it is always interpreted relative to some schema. Hence the need to transform it when switching schema domains. Message-Id: <20210917102612.308149-1-tgrabiec@scylladb.com>	2021-09-27 14:23:09 +03:00
Gleb Natapov	78774a485a	raft: drop local snapshot if it cannot be installed If a locally taken snapshot cannot be installed because newer one was received meanwhile it should be dropped, otherwise it will take space needlessly. Message-Id: <YUrWXxVfBjEio1Ol@scylladb.com>	2021-09-27 13:03:23 +02:00
Asias He	1657e7be14	gossiper: Send generation number with shutdown message Consider: - n1, n2 in the cluster - n2 shutdown - n2 sends gossip shutdown message to n1 - n1 delays processing of the handler of shutdown message - n2 restarts - n1 learns new gossip state of n2 - n1 resumes to handle the shutdown message - n1 will mark n2 as shutdown status incorrectly until n2 restarts again To prevent this, we can send the gossip generation number along with the shutdown message. If the generation number does not match the local generation number for the remote node, the shutdown message will be ignored. Since we use the rpc::optional to send the generation number, it works with mixed cluster. Fixes #8597 Closes #9381	2021-09-27 11:08:43 +03:00
Avi Kivity	d7ac699a55	Revert "Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael" This reverts commit `b5cf0b4489`, reversing changes made to `e8493e20cb`. It causes segmentation faults when sstable readers are closed. Fixes #9388.	2021-09-26 18:31:49 +03:00
Avi Kivity	bf94c06fc7	Revert "Merge "simplifications and layer violation fix for compaction manager" from Raphael" This reverts commit `7127c92acc`, reversing changes made to `88480ac504`. We need to revert `b5cf0b4489` to fix #9388, and this stands in the way. Ref #9388.	2021-09-26 18:30:36 +03:00
Piotr Sarna	06f724857f	transport: remove unused map of stream_id->query states The map is never touched, so it only occupies precious space for each connection. Closes #9383	2021-09-26 13:41:58 +03:00
Avi Kivity	936de92876	Merge 'cql3: Add evaluate(expression) and use instead of term::bind()' from Jan Ciołek This PR adds the function: ```c++ constant evaluate(const expression&, const query_options&); ``` which evaluates the given expression to a constant value. It binds all the bound values, calls functions, and reduces the whole expression to just raw bytes and `data_type`, just like `bind()` and `get()` did for `term`. The code is often similar to the original `bind()` implementation in `lists.cc`, `sets.cc`, etc. * For some reason in the original code, when a collection contains `unset_value`, then the whole collection is evaluated to `unset_value`. I'm not sure why this is the case, considering it's impossible to have `unset_value` inside a collection, because we forbid bind markers inside collections. For example here: `cc8fc73761/cql3/lists.cc (L134)` This seems to have been introduced by Pekka Enberg in `50ec81ee67`, but he has left the company. I didn't change the behaviour, maybe there is a reason behind it, although maybe it would be better to just throw `invalid_request_exception`. * There was a strange limitation on map key size, it seems incorrect: `cc8fc73761/cql3/maps.cc (L150)`, but I left it in. * When evaluating a `user_type` value, the old code tolerated `unset_value` in a field, but it was later converted to NULL. This means that `unset_value` doesn't work inside a `user_type`, I didn't change it, will do in another PR. * We can't fully get rid of `bind()` yet, because it's used in `prepare_term` to return a `terminal`. It will be removed in the next PR, where we finally get rid of `term`. Closes #9353 * github.com:scylladb/scylla: cql3: types: Optimize abstract_type::contains_collection cql3: expr: Convert evaluate_IN_list to use evaluate(expression) cql3: expr: Use only evaluate(expression) to evaluate term cql3: expr: Implement evaluate(expr::function_call) cql3: expr: Implement evaluate(expr::usertype_constructor) cql3: expr: Implement evaluate(expr::collection_constructor) cql3: expr: Implement evaluate(expr::tuple_constructor) cql3: expr: Implement evaluate(expr::bind_variable) cql3: Add contains_collection/set_or_map to abstract_type cql3: expr: Add evaluate(expression, query_options) cql3: Implement term::to_expression for function_call cql3: Implement term::to_expression for user_type cql3: Implement term::to_expression for collections cql3: Implement term::to_expression for tuples cql3: Implement term::to_expression for marker classes cql3: expr: Add data_type to *_constructor structs cql3: Add term::to_expression method cql3: Reorganize term and expression includes	2021-09-26 12:58:11 +03:00
Eliran Sinvani	0b2861d014	Prepare for inheriting from reader_concurrency_semaphore Some future and enterprise features requires us to inherit from reader_concurrency_semaphore, this might require additional "wrap up" operations to be done on stop which serves as a barrier for the semaphore. Here we simply make stop virtual so it is inherited and can be augmented. This change have no significant impact on performance since stop can get called once in a lifetime of a semaphore. The approach is to add two extenction points to the reader_concurrency_semaphore class, one just before the stop code is executed and one just after. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #9373	2021-09-26 12:57:48 +03:00
Avi Kivity	2d352820f4	Update tools/java and tools/jmx submodules * tools/java 9c5c0ad1fd...05ec511bbb (2): > reloc/build_reloc.sh: Add missing space > reloc: stop removing entire BUILDDIR * tools/jmx 658818b...5c383b6 (1): > reloc: stop removing entire $BUILDDIR	2021-09-26 12:33:55 +03:00
Pavel Emelyanov	88e5b7c547	database: Shutdown in tests There's a circular dependency: query processor needs database database owns large_data_handler and compaction_manager those two need qctx qctx owns a query_processor Respectively, the latter hidden dependency is not "tracked" by constructor arguments -- the query processor is started after the database and is deferred to be stopped before it. This works in scylla, because query processor doesn't really stop there, but in cql_test_env it's problematic as it stops everything, including the qctx. Recent database start-stop sanitation revealed this problem -- on database stop either l.d.h. or compaction manager try to start (or continue) messing with the query processor. One problem was faced immediatelly and pluged with the `75e1d7ea` safety check inside l.d.h., but still cql_test_env tests continue suffering from use after free on stopped query processor. The fix is to partially revert the `4b7846da` by making the tests stop some pieces of the database (inclusing l.d.h. and compaction manager) as it used to before. In scylla this is, probably, not needed, at least now -- the database shutdown code was and still is run right before the stopping one. tests: unit(debug) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210924080248.11764-1-xemul@scylladb.com>	2021-09-26 11:09:01 +03:00
Benny Halevy	7498ac4869	dht: boot_strapper: bootstrap: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923144206.1690576-2-bhalevy@scylladb.com>	2021-09-26 11:09:01 +03:00
Benny Halevy	798aee6747	dht: boot_strapper: coroutinize bootstrap Prepare for futurizing get_pending_address_ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923144206.1690576-1-bhalevy@scylladb.com>	2021-09-26 11:09:01 +03:00
Kamil Braun	bf823e34a4	raft: disable sticky leadership rule The Raft PhD presents the following scenario. When we remove a server from the cluster configuration, it does not receive the configuration entry which removes it (because the leader appending this entry uses that entry's configuration to decide to which servers to send the entry to, and the entry does not contain the removed server). Therefore the server keeps believing it is a member but does not receive heartbeats from leaders in the new configuration. Therefore it will keep becoming a candidate, causing existing leaders to step down, harming availability. With many such candidates the cluster may even stop being able to proceed at all. We call such servers "disruptive". More concretely, consider the following example, adapted from the PhD for joint configuration changes (the original PhD considered a different algorithm which can only add/remove one server at once): Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint configuration (C_old, C_new). D is the leader. D managed to append C_joint to every server and commit it. D appends C_new. At this point, D stops sending heartbeats to A because C_new does not contain A, but A's last entry is still C_joint, so it still has the ability to become a candidate. A can now become a candidate and cause D, or any other leader in C_new, to step down. Even if D manages to commit C_new, A can keep disrupting the cluster until it is shut down. Prevoting changes the situation, which the authors admit. The "even if" above no longer applies: if D manages to commit C_new, or just append it to a majority of C_new, then A won't be able to succeed in the prevote phase because a majority of servers in C_new has a longer log than A (and A must obtain a prevote from a majority of servers in C_new because A is in C_joint which contains C_new). But the authors continue to argue that disruptions can still occur during the small period where C_new is only appended on D but not yet on a majority of C_new. As they say: "we also did not want to assume that a leader will reliably replicate entries fast enough to move past the scenario (...) quickly; that might have worked in practice, but it depends on stronger assumptions that we prefer to avoid about the performance (...) of replicating log entries". One could probably try debunking this by saying that if entries take longer to replicate than the election timeout we're in much bigger trouble, but nevermind. In any case, the authors propose a solution which we call "sticky leadership". A server will not grant a vote to a candidate if it has recently received a heartbeat from the currently known leader, even if the candidate's term is higher. In the above example, servers in C_new would not grant votes to A as long as D keeps sending them heartbeats, thus A is no longer disruptive. In our case the situation is a bit different: in original Raft, "heartbeats" have a very specific meaning - they are append_entries requests (possibly empty) sent by leaders. Thus if a node stops being a leader it stops sending heartbeats; similarly, if a node leaves the configuration, it stops receiving heartbeats from others still in the configuration. We instead use a "shared failure detector" interface, where nodes may still consider other nodes alive regardless of their configuration/leadership situation, as part of the general "MultiRaft" framework. This pretty much invalidates the original argument, as seen on the above example: A will still consider D alive, thus it won't become a candidate. Shared failure detector combined with sticky leadership actually makes the situation worse - it may cause cluster unavailability in certain scenarios (fortunately not a permanent one, it can be solved with server restarts, for example). Randomized nemesis testing with reconfigurations found the following scenario: Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration C1, B is the leader. B commits joint (C1, C2), then new C2 configuration. Note that C does not learn about the last entry (since it's not part of C2) but it keeps believing that B is alive, so it keeps believing that B is the leader. We then partition {A} from {B, C}. A appends (C2, C3) joint configuration to its log. It's not able to append it to B or C due to the partition. The partition holds long enough for A to revert to candidate state (or we may restart A at this point). Eventually the partition resolves. The only node which can become a candidate now is A: C does not become a candidate because it keeps believeing that B is the leader, and B does not become a candidate because it saw the C2 non-joint entry being committed. However, A won't become a leader because C won't grant it a vote due to the sticky leadership rule. The cluster will remain unavailable until e.g. C is restarted. Note that this scenario requires allowing configuration changes which remove and then readd the same servers to the configuration. One may wonder if such reconfigurations should be allowed, but there doesn't seem to be any example of them breaking safety of Raft (and the PhD doesn't seem to mention them at all; perhaps it implicitly accepts them). It is unknown whether a similar scenario may be produced without such reconfigurations. In any case, disabling sticky leadership resolves the problem, and it is the last currently known availability problem found in randomized nemesis testing. There is no reason to keep this extension, both because the original Raft authors' argument does not apply for shared failure detector, and because one may even argue with the authors in vanilla Raft given that prevoting is enabled (see end of third paragraph of this commit message). Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>	2021-09-26 11:09:01 +03:00
Jan Ciolek	e9f24edc9b	cql3: types: Optimize abstract_type::contains_collection contains_collection() and contains_set_or_map() used to be calculated on each call(). Now the result is calculated only once during type creation. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 13:45:38 +02:00
Jan Ciolek	c672c0b42d	cql3: expr: Convert evaluate_IN_list to use evaluate(expression) evaluate_IN_list used term::bind(), but now it's possible to make it use term::to_expression() and then evaluate(expression) Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	7ab14ca9c1	cql3: expr: Use only evaluate(expression) to evaluate term Finally we don't need term::bind() to evaluate a term. We can just convert the term to expression and call evaluate(expression). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	ea02fd82bc	cql3: expr: Implement evaluate(expr::function_call) function_call can be evaluated now. The code matches the one from functions::function_call::bind. I needed to add cache id to function_call in order for it ot work properly. See the blurb in struct function_call for more information. New code corresponds to bind() in cql3/functions/functions.cc. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	4a035b07d3	cql3: expr: Implement evaluate(expr::usertype_constructor) usertype_constructor can now be evaluated. To evaluate an usertype_constructor we need to know the type, because the fields have to be in the correct order. Type has been added to usertype_constructor. New code corresponds to old bind() of user_types::delayed_value in cql3/user_types.cc. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	f7ee40aa01	cql3: expr: Implement evaluate(expr::collection_constructor) collection_constructor can now be evaluated. There is a bit of a problem, because we don't know the type of an empty collection_constructor, but luckily empty collection constructors get converted to constants during preparation. For some reason in the original code when a collection contains unset_value, the whole collection is automatically evaluated to unset_value. I didn't change this behaviour. New code corresponds to old bind() of lists::delayed_value in cql3/lists.cc, sets::delayed_value etc. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	0f20d301d8	cql3: expr: Implement evaluate(expr::tuple_constructor) Tuple constructors can now be evaluated. New code corresponds to old bind() of tuples::delayed_value::marker in cql3/tuples.cc Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	5589f348e7	cql3: expr: Implement evaluate(expr::bind_variable) Implement evaluating a bind_variable. To be able to evaluate a bind_variable we need to know the type of the bound value. This is why a data_type has been added to the bind_variable struct. There are some quirks when evaluating a bind_variable. The first problem occurs when the variable has been sent with an older cql serialization format and contains collections. In that case the value has to be reserialized to use the newest cql serialization format. The second problem occurs when there is a set or a map in the value. The set value sent by the driver might not have the elements in the correct order, contain duplicates etc. When a set or map is detected in the value it is reserialized as well. collection_type_impl::reserialize doesn't work for this purpose, because it uses data_value which does not perform sorting or removal. New code corresponds to old bind() of lists::marker in cql3/lists.cc, sets::marker etc. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	e621cbaa32	cql3: Add contains_collection/set_or_map to abstract_type Sometimes we need to know whether some type contains some collection, set, or map inside. Introduce two functions that provide this information. Information about collection is useful for reserializing values with old serialization format. Information about set/map is useful for reserializing sets and maps to remove duplicates. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	f0e238f0a6	cql3: expr: Add evaluate(expression, query_options) Add a function that takes an expression and evaluates it to a constant. Evaluating specific expression variants will be implemented in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	4ee4dc10ed	cql3: Implement term::to_expression for function_call Each functions::function_call can now be converted to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	abd11b6fb4	cql3: Implement term::to_expression for user_type Each user_type::delayed_value can now be converted to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	d61b2dbf8a	cql3: Implement term::to_expression for collections Each collection delayed_value can now be converted to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	f17d003808	cql3: Implement term::to_expression for tuples Each tuples::delayed_value can now be converted to expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	c40f227c14	cql3: Implement term::to_expression for marker classes Implement to_expression for non terminals that represent a bind marker. For now each bind marker has a shape describing where it is used, but hopefully this can be removed in the future. In order to evaluate a bind_variable we need to know its type. The type is needed to pass to constant and to validate the value. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	499c9235fc	cql3: expr: Add data_type to _constructor structs It is useful to have a data_type in _constructor structs when evaluating. The resulting constant has a data_type, so we have to find it somehow. For tuple_constructor we don't have to create a separate tuple_type_impl instance. For collection_constructor we know what the type is even in case of an empty collection. For usertype_constructor we know the name, type and order of fields in the user type. Additionally without a data_type we wouldn't know whether the type is reversed or not. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	f86a1270b0	cql3: Add term::to_expression method Add a method that converts given term to the matching expression. It will be used as an intermediate step when implementing evaluate(expression). evaluate(term) will convert the term to the expression and then call evaluate(expression). For terminals this is simply calling get() to serialize the value. For non-terminals the implementation is more complicated and will be implemeted in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	746e9c620f	cql3: Reorganize term and expression includes Make term.hh include expression.hh instead of the other way around. expression can't be forward declared. expression is needed in term.hh to declare term::to_expression(). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Tomasz Grabiec	f582bfd453	Merge "test: raft: randomized_nemesis_test: generator test with linearizability checking" from Kamil The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity. We use AppendReg instead of ExReg in `basic_generator_test` with a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line serializability checker with O(1) amortized complexity on each operation completion. We also enforce linearizability by checking that every completed operation was previously invoked. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call. * kbr/linearizability-v2: test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test test: raft: randomized_nemesis_test: introduce append register	2021-09-23 23:55:13 +02:00
Benny Halevy	7e9ca101ae	storage_service: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-31-bhalevy@scylladb.com>	2021-09-23 17:36:43 +03:00
Benny Halevy	ecbe9f1ef6	storage_service: coroutinize rebuild Prepare for futurizing get_ranges_for_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-30-bhalevy@scylladb.com>	2021-09-23 17:36:42 +03:00
Benny Halevy	c8b12afe1b	storage_service: effective_ownership: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-29-bhalevy@scylladb.com>	2021-09-23 17:35:32 +03:00
Benny Halevy	add78a8cc0	storage_service: coroutinize effective_ownership Prepare for futurizing get_ranges_for_endpoint. Dtest: nodetool_additional_test:TestNodetool.status_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-28-bhalevy@scylladb.com>	2021-09-23 17:34:56 +03:00
Avi Kivity	7127c92acc	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-09-23 17:29:19 +03:00
Raphael S. Carvalho	5bf51ced14	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:56:18 -03:00
Raphael S. Carvalho	6e7729fa21	compaction_manager: use task when stopping a given compaction type compaction_info will eventually only be used for exporting data about ongoing compactions, so task must be used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:53:53 -03:00
Raphael S. Carvalho	6d1170ac94	compaction: remove start_size and end_size from compaction_info those stats aren't used in compaction stats API and therefore they can be removed. end_size is added to compaction_result (needed for updating history) and start_size can be calculated in advance. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:41:13 -03:00
Raphael S. Carvalho	2353f40f63	compaction_manager: introduce helpers for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:39 -03:00
Raphael S. Carvalho	6820fbf460	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:36 -03:00
Raphael S. Carvalho	d73a241a4e	compaction: kill sstables field in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:32 -03:00
Raphael S. Carvalho	b6b4042faf	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:11 -03:00
Raphael S. Carvalho	98f8673d4e	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:25:51 -03:00
Raphael S. Carvalho	0885376a85	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:00:49 -03:00
Raphael S. Carvalho	7688d0432c	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 09:56:01 -03:00
Piotr Sarna	88480ac504	cql-pytest: relax another condition for a failed wasm execution The previous commit already relaxed the condition for test_fib, but the same should be done for test_fib_called_on_null for an identical reason - more than 1 error can be expected in the case of calling heavily recursive function, and either fuel exhaustion, or hitting the stack limit, or any other InvalidRequest exception should be accepted. Closes #9363	2021-09-23 14:11:02 +03:00
Benny Halevy	ad46ff8e5e	database: coroutinize create_keyspace Prepare for futurizing on create_in_memory_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-10-bhalevy@scylladb.com>	2021-09-23 14:05:44 +03:00
Benny Halevy	91091e9d89	database: update_keyspace: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-9-bhalevy@scylladb.com>	2021-09-23 14:05:18 +03:00
Benny Halevy	c71cd2bed3	database: coroutinize update_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-8-bhalevy@scylladb.com>	2021-09-23 14:05:18 +03:00
Piotr Sarna	62948b7404	Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek Add new struct to the `expression` variant: ```c++ // A value serialized with the internal (latest) cql_serialization_format struct constant { cql3::raw_value value; data_type type; // Never nullptr, for NULL and UNSET might be empty_type }; ``` and use it where possible instead of `terminal`. This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely. We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression. `bool` is removed from `expression`, now `constant` is used instead. This is a redesign of PR #9203, there is some discussion about the chosen representation there. Closes #9371 * github.com:scylladb/scylla: cql3: term: Remove get_elements and multi_item_terminal from terminals cql3: Replace most uses of terminal with expr::constant cql3: expr: Remove repetition from expr::get_elements cql3: expr: Add expr::get_elements(constant) cql3: term: remove term::bind_and_get cql3: Replace all uses of bind_and_get with evaluate_to_raw_view cql3: expr: Add evaluate_IN_list cql3: tuples: Implement tuples::in_value::get cql3: Move data_type to terminal, make get_value_type non-virtual cql3: user_types: Implement get_value_type in user_types.hh cql3: tuples: Implement get_value_type in tuples.hh cql3: maps: Implement get_value_type in maps.hh cql3: sets: Implement get_value_type in sets.hh cql3: lists: Implement get_value_type in lists.hh cql3: constants: Implement get_value_type in constants.hh cql3: expr: Add expr::evaluate cql3: Make collection term get() use the internal serialization format cql3: values: Add unset value to raw_value_view::make_temporary cql3: expr: Add constant to expression	2021-09-23 13:02:29 +02:00
Avi Kivity	369afe3124	treewide: use coroutine::maybe_yield() instead of co_await make_ready_future() The dedicated API shows the intent, and may be a tiny bit faster. Closes #9382	2021-09-23 12:28:56 +02:00
Avi Kivity	6702711d9c	Merge "Gossiper start-stop sanitation (+ bonus track)" from Pavel E " The main challenge here is to move messaging_service.start_listen() call from out of gossiper into main. Other changes are pretty minor compared to that and include - patch gossiper API towards a standard start-shutdown-stop form - gossiping "sharder info" in initial state - configure cluster name and seeds via gossip_config tests: unit(dev) dtest.bootstrap_test.start_stop_test_node(dev) manual(dev): start+stop, nodetool enable-/disablegossip refs: #2737 refs: #2795 refs: #5489 " * 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla: code: Expell gossiper.hh from other headers storage_service: Gossip "sharder" in initial states gossiper: Relax set_seeds() gossiper, main: Turn init_gossiper into get_seeds_from_config storage_service: Eliminate the do-bind argument from everywhere gossiper: Drop ms-registered manipulations messaging, main, gossiper: Move listening start into main gossiper: Do handlers reg/unreg from start/stop gossiper: Split (un)init_messaging_handler() gossiper: Relocate stop_gossiping() into .stop() gossiper: Introduce .shutdown() and use where appropriate gossiper: Set cluster_name via gossip_config gossiper, main: Straighten start/stop tests/cql_test_env: Open-code tst_init_ms_fd_gossiper tests/cql_test_env: De-global most of gossiper gossiper: Merge start_gossiping() overloads into one gossiper: Use is_... helpers gossiper: Fix do_shadow_round comment gossiper: Dispose dead code	2021-09-23 12:18:38 +03:00
Avi Kivity	bae9c042c2	Merge 'Add compaction stats to tracing data' from Botond Dénes Too many tombstones (row or range) are a common source of query performance problems, yet currently we have no visibility into the amount of tombstones a query has to process while constructing the results. This series addresses this by collecting stats about the compacted data in `compact_mutation_state`. This contains the number of partitions, static rows (live and dead), clustering rows (live and dead) and range tombstones. This data is then added to tracing on each query path. Example trace: ``` activity \| timestamp \| source \| source_elapsed \| client ---------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2021-09-22 12:06:24.089000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0] \| 2021-09-22 12:06:24.089552 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement [shard 0] \| 2021-09-22 12:06:24.089674 \| 127.0.0.1 \| 122 \| 127.0.0.1 Creating read executor for token -4069959284402364209 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] \| 2021-09-22 12:06:24.089724 \| 127.0.0.1 \| 173 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2021-09-22 12:06:24.089727 \| 127.0.0.1 \| 175 \| 127.0.0.1 Start querying singular range {{-4069959284402364209, pk{000400000001}}} [shard 0] \| 2021-09-22 12:06:24.089732 \| 127.0.0.1 \| 181 \| 127.0.0.1 Querying cache for range {{-4069959284402364209, pk{000400000001}}} and slice {(-inf, +inf)} [shard 0] \| 2021-09-22 12:06:24.089751 \| 127.0.0.1 \| 199 \| 127.0.0.1 Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 4 clustering row(s) (3 live, 1 dead) and 1 range tombstone(s) [shard 0] \| 2021-09-22 12:06:24.089838 \| 127.0.0.1 \| 286 \| 127.0.0.1 Querying is done [shard 0] \| 2021-09-22 12:06:24.089847 \| 127.0.0.1 \| 295 \| 127.0.0.1 Done processing - preparing a result [shard 0] \| 2021-09-22 12:06:24.089862 \| 127.0.0.1 \| 311 \| 127.0.0.1 Request complete \| 2021-09-22 12:06:24.089326 \| 127.0.0.1 \| 326 \| 127.0.0.1 ``` Tests: unit(dev) Fixes: https://github.com/scylladb/scylla/issues/5471 Closes #9372 * github.com:scylladb/scylla: multishard_mutation_query: add tracepoint with compaction stats querier: add tracepoint with compaction stats mutation_compactor: collect stats about compacted data	2021-09-22 19:24:19 +03:00
Kamil Braun	ea172fe531	test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test Use AppendReg instead of ExReg for the state machine. Use a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line linearizability checker with O(1) amortized complexity on each operation completion. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call.	2021-09-22 17:56:23 +02:00
Avi Kivity	c0afdf3f15	Update seastar submodule * seastar c04a12edbd...e6db0cd587 (13): > Merge "Add kernel stack trace reporting for stalls" from Avi Ref #8828 > Merge "Keep XFS' dioattr cached" from Pavel E > coroutines: de-template maybe_yield() > sharded: Add const versions of map_reduce's > apps/io_tester: remove unused lambda capture > doc: exclude seastar::coroutine::internal namespace > deprecate unaligned_cast<> from unaligned.hh > reactor: adjust max_networking_aio_io_control_blocks to lower size when fs.aio-max-nr is small > build: clarify choice of C++ dialect, and change default to C++20 > coding_style: update concepts style to snake_case > Merge "Teach io_tester to submit requests-per-second flow" from Pavel E > cmake: find and link against Boost::filesystem > coroutine: add maybe_yield	2021-09-22 18:55:25 +03:00
Nadav Har'El	92570ea7d9	cql-pytest: add tests on behavior of empty-string keys We know (verified by existing tests) that null keys are not allowed - neither as partition keys nor clustering keys. In issue #9352 a question was raised of whether an empty string is allowed as as a key on a base table (not a materialized view or index). The following tests confirm that the current situation is as follows: 1. An empty string is perfectly legal as a clustering key. 2. An empty string is NOT ALLOWED as a partition key - the error "Key may not be empty" is reported if this is attempted. 3. If the partition key is compound (multiple partition-key columns) then any or all of them may be empty strings. These tests pass the same on both Cassandra and Scylla, showing that this bizarre (and undocumented) behavior is identical in both. Refs #9352. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922131310.293846-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	083279d9ab	Merge "Generalize sstable creation for tests" from Pavel E " There's a whole lot of places that create an sstable for tests like this auto sst = env.make_sstable(...); sst->write_components(...); sst->load(); Some of them are already generalized with the make_sstable_easy helper, but there are several instances of them. Found while hunting down the places that use default IO sched class behind the scenes. tests: unit(dev) " * 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla: test: Generalize make_sstable() and make_sstable_easy() test: Use now existing helpers elsewhere test: Generalize all make_sstable_easy()-s test: Set test change estimation to 1 test: Generalize make_sstable_easy in mutation tests test: Generalize make_sstable_easy in set tests test: Reuse make_sstable_easy in datafile tests test: Relax make_sstable_easy in compaction tests	2021-09-22 18:55:25 +03:00
Nadav Har'El	a99a774731	cql-pytest: test for secondary-index on empty-string value When a string column is indexed with a secondary index, the empty value for this column (an empty string '') is perfectly legal, and should be indexed as well. This is not the same as an unset (null) value which isn't indexed. The following test demonstrates that this case works in Cassandra, but does not in Scylla (so the test is marked "xfail"). In Scylla, a query that returns the expected results with ALLOW FILTERING suddenly returns a different (and wrong) result when an index is added on the table. This test reproduces issue #9364. Refs #9364. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922121510.291826-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	b5cf0b4489	Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael " Backlog tracker isn't updated correctly when facing a schema change, and may leak a SSTable if compaction strategy is changed, which causes backlog to be computed incorrectly. Most of these problems happen because sstable set and tracker are updated independently, so it could happen that tracker lose track (pun intended) of changes applied to set. The first patch will fix the leak when strategy is changed, and the third patch will make sure that tracker is updated atomically with sstable set, so these kind of problems will not happen anymore. Fixes #9157 test: mode(debug) " * 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla: compaction: Update backlog tracker correctly when schema is updated compaction: Don't leak backlog of input sstable when compaction strategy is changed compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() compaction: simplify removal of monitors	2021-09-22 18:55:25 +03:00
Nadav Har'El	e8493e20cb	cql-pytest: test for empty-string as partition key in materialized view Scylla and Cassandra do not allow an empty string as a partition key, but a materialized view might "convert" a regular string column into a partition key, and an empty string is a perfectly valid value for this column. This can result in a view row which has an empty string as a partition key. This case works in Cassandra, but doesn't in Scylla (the row with the empty string as a partition key doesn't appear). The following test demonstrates this difference between Scylla and Cassandra (it passes on Cassandra, fails on Scylla, and accordingly marked "xfail"). Refs #9375. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922115000.290387-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Piotr Jastrzebski	56888c8954	docs: clean up codeowners Recently we had to say goodbye to our dear friend Pekka. He orphaned a few subsystems that can't call for his help in code reviews anymore. This patch makes sure no one will bother Pekka in his afterlife. It also cleanups HACKING.md a little bit by removing Pekka and Duarte from the maintainer/reviewer lists. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <98ba1aed9ee8a87b9037b5032b82abc5bfddbd66.1632301309.git.piotr@scylladb.com>	2021-09-22 18:55:25 +03:00
Botond Dénes	3f4f408bcf	schema: add get_reversed() A variant of make_reversed() which goes through the schema registry, teaching the schema to the registry if necessary. This effectively caches the result of the reversing and as an added bonus double reversing yields the very same schema C++ object that was the starting point. Closes #9365	2021-09-22 18:55:25 +03:00
Kamil Braun	81b7ed23bb	test: raft: randomized_nemesis_test: introduce append register The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity.	2021-09-22 17:54:07 +02:00
Botond Dénes	922295dd8e	multishard_mutation_query: add tracepoint with compaction stats Add the content of the compaction stats introduced in the previous patch to the tracing data. This will help diagnose query performance related problems caused by tombstones.	2021-09-22 14:00:24 +03:00
Botond Dénes	eba46e353d	querier: add tracepoint with compaction stats Add the content of the compaction stats introduced in the previous patch to the tracing data. This will help diagnose query performance related problems caused by tombstones.	2021-09-22 14:00:05 +03:00
Botond Dénes	f0ead81250	mutation_compactor: collect stats about compacted data Stats contain the number of partitions, static rows, clustering rows and range tombstones. For rows dead/live are counted separately.	2021-09-22 13:59:19 +03:00
Pavel Emelyanov	598841a5dd	code: Expell gossiper.hh from other headers This needs to add forward declarations of the gossiper class and re-include some other headers here and there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	6875a4b292	storage_service: Gossip "sharder" in initial states Right now the number of shards and ignore-msb-bits are gossiped with a separate call. It's simpler to include this data into the initial gossiping state. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	968e117315	gossiper: Relax set_seeds() It's much shorter and simpler to pass the seeds, obtained from the config, into gossiper via gossip_config rahter than with the help of a special call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	2b63c4c16f	gossiper, main: Turn init_gossiper into get_seeds_from_config Looking into init_gossiper() helper makes it clear that what it does is gets seeds, provider and listen_address from config and generates a set of seeds for the gossiper. Then calls gossiper.set_seeds(). This patch renames the helper into get_seeds_from_config(), removes all but db::config& argunebts from it and moves the call to set_seed() into main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	7680274e02	storage_service: Eliminate the do-bind argument from everywhere The same as in previous patch -- the gossiper doesn't need to know if it should call messaging.start_listen() or not, neither should do the storage_service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	0607a2b84f	gossiper: Drop ms-registered manipulations Now it's no-op and can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	ca316f32f0	messaging, main, gossiper: Move listening start into main Before preparing the cluster join process the messaging should be put into listening state. Right now it's done "on-demand" by the call to the do_shadow_round(), also there's a safety call in the start_gossiping(). Tests, however, should not start listening, so the do_bind boolean exists and is passed all the way around. Make the main() code explicitly call the messaging.start_listen() and leave tests without it. This change makes messaging start listening a bit earlier, but in between these old and new places there's nothing that needs messaging to stay deaf. As the do_bind becomes useless, the wait_for_gossip_to_settle() is also moved into main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	f644eb1cf7	gossiper: Do handlers reg/unreg from start/stop On start handlers can be registered any time before the messaging starts to listen. On stop handlers can remain registered any long, since the messaging service stops early in drain_on_shutdown(). One tricky place is API start_/stop_gossiping(). The latter calls gossiper::stop() thus unregistering the handlers. So to make the start_gossiping() work it must call gossiper::start() in advance. Overall the gossiper start/stop becomes this: gossiper.start() `- registers handlers gossiper.start_gossiping() `- // starts gossiping gossiper.shutdown() `- // stops gossiping gossiper.stop() `- calls shutdown() // re-entrable `- unregisters handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	9aba3e6f9f	gossiper: Split (un)init_messaging_handler() As a preparation for the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	dfe54207cb	gossiper: Relocate stop_gossiping() into .stop() The helper in question is called in two places: 1. In main() as a fuse against early exception before creating the drain_on_shutdown() defer 2. In the stop_gossiping() API call Both can be replaced with the stop_gossiping() call from the .stop() method, here's why: 1. In main the gossiper::stop() call is already deferred right after the gossiper is started. So this change moves it above. It may happen that an exception pops up before the old fuse was deferred, but that's OK -- the stop_gossiping() is safe against early- and re- entrances 2. The stop_gossiping() change is effectlvey a rename -- it calls the stop_gossiping() as it did before, but with the help of the .stop() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	e24c5034b5	gossiper: Introduce .shutdown() and use where appropriate The start/stop sequence we're moving towards assumes a shutdown (or drain) method that will be called early on stop to notify the service that the system is going down so it could prepare. For gossiper it already means calling stop_gossiping() on the shard-0 instance. So by and large this patch renames a few stop_gossiping() calls into .shutdown() ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	25210334b6	gossiper: Set cluster_name via gossip_config It's taken purely from the db::config and thus can be set up early. Right now the empty name is converted into "Test Cluster" one, but remains empty in the config and is later used by the system_keyspace code. This logic remains intact. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	084abb824e	gossiper, main: Straighten start/stop Turn the gossiper start/stop sequence into the canonical form gossiper.start(std::ref(dependencies)...).get(); auto stop_gossiper = defer({ gossiper.invoke_on_all(&gossiper::stop).get(); }); gossiper.invoke_on_all(&gossiper::start).get(); The deferred call should be gossiper.stop(); but for now keep the instances memory alive. This trick is safe at this point, because .start() and .stop() methods are both empty (still). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:05 +03:00
Jan Ciolek	3d23d6f9dd	cql3: term: Remove get_elements and multi_item_terminal from terminals terminal now isn't used as a final value anywhere. Remove things that are no longer needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:33:00 +02:00
Jan Ciolek	2523c9ba48	cql3: Replace most uses of terminal with expr::constant constant is now ready to replace terminal as a final value representation. Replace bind() with evaluate and shared_ptr<terminal> with constant. We can't get rid of terminal yet. Sometimes terminal is converted back to term, which constant can't do. This won't be a problem once we replace term with expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:28:15 +02:00
Jan Ciolek	c859ec2bdf	cql3: expr: Remove repetition from expr::get_elements There was some repeating code in expr::get_elements family of functions. It has been reduced into one function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:28:15 +02:00
Jan Ciolek	2cbed7a679	cql3: expr: Add expr::get_elements(constant) We need to be able to access elements of a constant. Adds functions to easily do it. Those functions check all preconditions required to access elements and then use partially_deserialize_* or similar. It's much more convenient than using partially_deserialize directly. get_list_of_tuples_elements is useful with IN restrictions like (a, b) IN [(1, 2), (3, 4)]. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:28:15 +02:00
Jan Ciolek	d39b085428	cql3: term: remove term::bind_and_get term::bind_and_get is not needed anymore, remove it. Some classes use bind_and_get internally, those functions are left intact and renamed to bind_and_get_internal. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:28:14 +02:00
Jan Ciolek	221ed38e94	cql3: Replace all uses of bind_and_get with evaluate_to_raw_view Start using evaluate_to_raw_value instead of bind_and_get. This is a step towards using only evaluate. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:20:30 +02:00
Jan Ciolek	adaf6e5eec	cql3: expr: Add evaluate_IN_list A list representing IN values might contain NULLs before evaluation. We can remove them during evaluation, because nothing equals NULL. If we don't remove them, there are gonna be errors, because a list can't contain NULLs. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:20:29 +02:00
Jan Ciolek	33882cc716	cql3: tuples: Implement tuples::in_value::get To convert a terminal to expr::constant we need to be able to serialize it. tuples::in_value didn't have serialization implemented, do it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:20:29 +02:00
Jan Ciolek	2936adc570	cql3: Move data_type to terminal, make get_value_type non-virtual Every class now has implementation of get_value_type(). We can simply make base class keep the data_type. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:20:28 +02:00
Jan Ciolek	e683bf0379	cql3: user_types: Implement get_value_type in user_types.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in user_types.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	0ac0f11d64	cql3: tuples: Implement get_value_type in tuples.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in tuples.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	48e5277b2f	cql3: maps: Implement get_value_type in maps.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in mapshh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	5aae370928	cql3: sets: Implement get_value_type in sets.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in sets.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	6bf6b03d12	cql3: lists: Implement get_value_type in lists.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in lists.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	da7ca5a760	cql3: constants: Implement get_value_type in constants.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in constants.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:36 +02:00
Jan Ciolek	a964827696	cql3: expr: Add expr::evaluate Adds the functions: constant evaluate(term, const query_options&); raw_value_view evaluate(term, const query_options&); These functions take a term, bind it and convert the terminal to constant or raw_value_view. In the future these functions will take expression instead of term. For that to happen bind() has to be implemented on expression, this will be done later. Also introduces terminal::get_value_type(). In order to construct a constant from terminal we need to know the type. It will be implemented in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:13:34 +02:00
Jan Ciolek	561e6b0a59	cql3: Make collection term get() use the internal serialization format A term should always be serialized using the internal cql serialization format. A term represents a value received from the driver, but for every use we are going to need it in the internal serialization format. Other places in the code already do this, for example see list_prepare_term, it calls value.bind(query_options::DEFAULT) to evaluate a collection_constructor. query_options::DEFAULT has the latest cql serialization format. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:05:09 +02:00
Jan Ciolek	2b9a9c8ff5	cql3: values: Add unset value to raw_value_view::make_temporary When unset_value is passed to make_temporary it gets converted to null_value. This looks like a mistake, it should be just unset_value. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:05:09 +02:00
Jan Ciolek	ad3d2ee47d	cql3: expr: Add constant to expression Adds constant to the expression variant: struct constant { raw_value value; data_type type; }; This struct will be used to represent constant values with known bytes and type. This corresponds to the terminal from current design. bool is removed from expression, now constant is used instead. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-21 16:05:09 +02:00
Pavel Emelyanov	c4d1022943	tests/cql_test_env: Open-code tst_init_ms_fd_gossiper The helper is called once. Keeping this code in the caller packs the code, helps it look more like main() and facilitates further patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 12:54:23 +03:00
Botond Dénes	c53c50e6f1	scylla-gdb.py: scylla memory: exclude too small object sizes Sizes too small to fit a ::seastar::memory::free_object won't contain any objects at all so they don't contribute anything to the listing beyond noise. Closes #9366	2021-09-21 11:21:10 +02:00
Pavel Emelyanov	83902f43ab	tests/cql_test_env: De-global most of gossiper Gossiper is still global and cql_test_env heavily exploits this fact. Clean that by getting the gossiper once and using the local reference everywhere else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	89adb0df90	gossiper: Merge start_gossiping() overloads into one There are two of them and one is only called from the API with the do_bind always set to "yes". This fact makes it possible to remove it by adding relevant defaults for the other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	e71bd23b3d	gossiper: Use is_... helpers There are several state booleans on the service and some helpers to manipulate/check those. Make the code consistent by always using these helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	efb0ddff21	gossiper: Fix do_shadow_round comment Shadow round is used during each boot, not only during node replacement Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	f7ab1aa876	gossiper: Dispose dead code The debug_show() is unused, as well as the advertise_myself(). The _features_condvar used to be listened on before `f32f08c9`, now it's signal-only. Feature frendship with gossiper is not required. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Piotr Sarna	d3edca4b43	Merge 'alternator: add stub implementation of TTL's API operations' ... from Nadav Har'El This small series adds a stub implementation of Alternator's UpdateTimeToLive and DescribeTimeToLive operations. These operations can enable, disable, or inquire about, the chosen expiration-time attribute. Currently, the information about the chosen attribute is only saved, with no actual expiration of any items taking place. Because this is an incomplete implementation of this feature, it is not enabled unless an experimental flag is enabled on all nodes in the cluster. See the individual patches for more information on what this series does. Refs #5060. Closes #9345 * github.com:scylladb/scylla: test/alternator: rename utility function test_table_name() alternator: stub TTL operations alternator: make three utility functions in executor.cc non-static test/alternator: test another corner case of TTL	2021-09-21 09:58:17 +02:00
Raphael S. Carvalho	ff38f59f67	compaction: Update backlog tracker correctly when schema is updated Currently the following can happen: 1) there's ongoing compaction with input sstable A, so sstable set and backlog tracker both contains A. 2) ongoing compaction replaces input sstable A by B, so sstable set contains only B now. 3) schema is updated, so a new backlog tracker is built without A because sstable set now contains only B. 4) ongoing compaction tries to remove A from tracker, but it was excluded in step 3. 5) tracker can now have a negative value if table is decreasing in size, which leads to log(<negative number>) == -NaN This problem happens because backlog tracker updates are decoupled from sstable set updates. Given that the essential content of backlog tracker should be the same as one of sstable set, let's move tracker management to table. Whenever sstable set is updated, backlog tracker will be updated with the same changes, making their management less error prone. Fixes #9157 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:54:41 -03:00
Raphael S. Carvalho	0a3049908c	compaction: Don't leak backlog of input sstable when compaction strategy is changed The generic back formula is: ALL + PARTIAL - COMPACTING With transfer_ongoing_charges() we already ignore the effect of ongoing compactions on COMPACTING as we judge them to be pointless. But ongoing compactions will run to completion, meaning that output sstables will be added to ALL anyway, in the formula above. With stop_tracking_ongoing_compactions(), input sstables are never removed from the tracker, but output sstables are added, which means we end up with duplicate backlog in the tracker. By removing this tracking mechanism, pointless ongoing compaction will be ignored as expected and the leaks will be fixed. Later, the intention is to force a stop on ongoing compactions if strategy has changed as they're pointless anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:36:05 -03:00
Raphael S. Carvalho	3dc1821287	compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() This new function makes it easier to remove monitor of exhausted sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:16:41 -03:00
Raphael S. Carvalho	28ba8bde80	compaction: simplify removal of monitors by switching to unordered_map, removal of generated monitors is made easier. this is a preparatory change for patch which will remove monitor for all exhausted sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:06:37 -03:00
Pavel Emelyanov	1cb2b65205	test: Generalize make_sstable() and make_sstable_easy() The former constructs a memtable from the vector of mutations and then does exactlty the same steps as the latter one -- creates an sstable corresponding to the memtable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	843dac0b8a	test: Use now existing helpers elsewhere There are several places in other tests that can make use of the new make_sstable_easy() helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	a2590368ce	test: Generalize all make_sstable_easy()-s There are already four of them. Those working with the mutation reader can be folded into one with some default args. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	e45f81ceb4	test: Set test change estimation to 1 The test intention is not to test how zero estimated partitions work, there's another case for than (in another test). Also it looks like 0 is doesn't flow anywhere far, it's std::max-ed into 1 early inside mc::writer constructor. This changes significantly simplifies the unification of the set of make_sstable_easy()-s in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	96feafabd4	test: Generalize make_sstable_easy in mutation tests The same trick as in the previous patch, but the new helper accepts a memtable instead of a mutation reader and makes the reader from the memtable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	ee91a8334c	test: Generalize make_sstable_easy in set tests There a bunch of places in the test that do the same sequence of steps to create an sstable. Generalize them into a helper that resembles the one from previous patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	28e5307ce2	test: Reuse make_sstable_easy in datafile tests This patch is two-fold. First it changes the signature of the local helper to facilitate next patching. Second, it makes more relevant places in the test use this helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Pavel Emelyanov	44294accb6	test: Relax make_sstable_easy in compaction tests The version argument can be omitted, the env.make_sstable will default it to highest version. The generation argument is left and defaulted to 1. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-20 15:44:14 +03:00
Piotr Sarna	dd9d6c081e	cql-pytest: relax error conditions for a failed wasm execution Originally, the expected failure for a recursive invocation test case was to expect that fuel gets exhausted, but it's also possible to hit a stack limit first. All errors are equally expected here as long as the execution is halted, so let's relax the condition and accept any wasm-related InvalidRequest errors. Closes #9361	2021-09-20 15:20:52 +03:00
Avi Kivity	8c0f2f9e3d	Revert "Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek" This reverts commit `e9343fd382`, reversing changes made to `27138b215b`. It causes a regression in v2 serialization_format support: collection_serialization_with_protocol_v2_test fails with: marshaling error: read_simple_bytes - not enough bytes (requested 1627390306, got 3) Fixes #9360	2021-09-20 15:15:09 +03:00
Avi Kivity	15819e0304	Merge "Database start/stop code sanitation" from Pavel E " Currently database start and stop code is quite disperse and exists in two slightly different forms -- one in main and the other one in cql_test_env. This set unifies both and makes them look almost the perfect way: sharded<database> db; db.start(<dependencies>); auto stop = defer([&db] { db.stop().get(); }); db.invoke_on_all(&database::start).get(); with all (well, most) other mentionings of the "db" variable being arguments for other services' dependencies. tests: unit(dev, release), unit.cross_shard_barrier(debug) dtest.simple_boot_shutdown(dev) refs: #2737 refs: #2795 refs: #5489 " * 'br-database-teardown-unification-2' of https://github.com/xemul/scylla: (26 commits) main: Log when database starts view_update_generator: Register staging sstables in constructor database, messaging: Delete old connection drop notification database, proxy: Relocate connection-drop activity messaging, proxy: Notify connection drops with boost signal database, tests: Rework recommended format setting database, sstables_manager: Sow some noexcepts database: Eliminate unused helpers database: Merge the stop_database() into database::stop() database: Flatten stop_database() database: Equip with cross-shard-barrier database: Move starting bits into start() database: Add .start() method main: Initialize directories before database main, api: Detach set_server_config from database and move up main: Shorten commitlog creation database: Extract commitlog initialization from init_system_keyspace repair: Shutdown without database help main: Shift iosched verification upward database: Remove unused mm arg from init_non_system_keyspaces() ...	2021-09-20 10:26:13 +03:00
Nadav Har'El	58078a3f84	test/alternator: rename utility function test_table_name() We have a utility function test_table_name() to create a unique name for a test table. The funny thing is, that because this function starts with the string "test_", pytest believes it's a test. This doesn't cause any problems (it's consider a passing test), but it's nevertheless strange to see it listed on the list of tests. So in this page, we trivially rename this function to unique_table_name(), a name why pytest doesn't think is the name of test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Nadav Har'El	4ffd8c1f2b	alternator: stub TTL operations This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive operations to Alternator. These operations can enable, disable, or inquire about, the chosen expiration-time attribute. Currently, the information about the chosen attribute is only saved, with no actual expiration of any items taking place. Some of the tests for the TTL feature start to pass, so their xfail tag is removed. Because this this new feature is incomplete, it is not enabled unless the "alternator-ttl" experimental feature is enabled. Moreover, for these operations to be allowed, the entire cluster needs to support this experimental feature, because all nodes need to participate in the data expiration - if some old nodes don't support Alternator TTL, some of the data they hold won't get expired... So we don't allow enabling TTL until all the nodes in the cluster support this feature. The implementation is in a new source file, alternator/ttl.cc. This source file will continue to grow as we implement the expiration feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Nadav Har'El	7404c7a9c1	alternator: make three utility functions in executor.cc non-static Make three of the utility functions in alternator/executor.cc, which until now were static (local to the source files) external symbols (in the alternator namespace). This will allow using them in other Alternator source files - like the one in the next patch for TTL support, which we'll want to put in a separate source file. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Nadav Har'El	82d2942ac8	test/alternator: test another corner case of TTL Usually the TTL feature's expiration-time attribute is a schema-less attribute, implemented in Alternator as a JSON-serialized item in a bigger map column. However, key attributes are a special case because they are implemented as separate columns. We already had test cases showing that this case works too - for the case of hash and range keys. In this test we test another possibility of an attribute that is implemented as a schema column - the case of an LSI key. As the other TTL tests, this test too passes on DynamoDB but xfails on Alternator because the TTL feature is not yet implemented. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Beni Peled	e873bdbfe9	docker: fix entrypoint issue This commit fixes [0] which is about extra (redundant) keyword adds to the `--entrypoint` and causes scylla-server to fail to start [0] https://github.com/scylladb/scylla-pkg/issues/2395 Closes #9350 Fixes #9355	2021-09-19 15:39:08 +03:00
Kamil Braun	e3f1667744	sstables: remove use_binary_search_in_promoted_index This was a global variable that was potentially modified from a performance benchmark. It would modify the behavior of `index_reader` in certain scenarios. Remove the variable so we can specify the behavior of `index_reader` functions without relying on anything other than what's passed into the constructor and the function parameters.	2021-09-19 13:59:25 +03:00
Kamil Braun	28193805e5	mutation_partition: fix exception message in append_clustered_row	2021-09-19 13:47:19 +03:00
Benny Halevy	fa46bf3499	compaction: split compaction_aborted_exception from compaction_stopped_exception Indicate whether the compaction job should be aborted due to an error using a new, compaction_aborted_exception type, vs. compaction_stopped_exception that indicates the task should be stopped due to some external event that doesn't indicate an error (like shutdown or api call). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	eebe14e7bc	compaction_manager: maybe_stop_on_error: rely on retry=false default No need to set retry to false again in various catch clauses. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	ca2bb89180	compaction_manager: maybe_stop_on_error: sync return value with error message. It is misleading to set retry to true in the following statement and return it later on when the `will_stop` parameter is true. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	a1fe40278b	compaction: drop retry parameter from compaction_stop_exception Drop the retry parameter from compaction_stop_exception as it is always false. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	9800dbe871	compaction_manager: move errors stats accounting to maybe_stop_on_error Currently, _stats.errors is not accounted for non-retryable errors like storage_io_error. Fixes #9354 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:22 +03:00
Benny Halevy	ce3fcc121e	paxos_state: prepare: handle exception getting data or digest This exception is ignored by design, but if it's left unhandled, it generates `Exceptional future ignored` warnings, like the following. Also, ignore f2 if f1 failed since we return early in this case. ``` [shard 5] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem), backtrace: 0x431689e 0x4316d40 0x43170e8 0x3f35486 0x218d14a 0x3f8002f 0x3f81217 0x3f9f868 0x3f4b76a /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012 N7seastar12continuationINS_8internal22promise_base_with_typeISt7variantIJN5utils4UUIDEN7service5paxos7promiseEEEEEZZZZNS7_11paxos_state7prepareEN7tracing15trace_state_ptrENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandERK13partition_keyS5_bNSI_16digest_algorithmENSt6chrono10time_pointINS_12lowres_clockENSQ_8durationIlSt5ratioILl1ELl1000EEEEEEENK3$_0clEvENUlvE_clEvENKUlSB_E_clESB_EUlT_E_ZNS_6futureISt5tupleIJNS13_IvEENS13_IS14_IJNSE_INSI_6resultEEE17cache_temperatureEEEEEEE14then_impl_nrvoIS12_NS13_IS9_EEEET0_OS11_EUlOSA_RS12_ONS_12future_stateIS1B_EEE_S1B_EE#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, unsigned long, seastar::lowres_clock::duration, std::result_of&&)::{lambda(seastar::basic_semaphore)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::basic_semaphore)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::future<std::variant<utils::UUID, service::paxos::promise> >&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::key_lock_map::with_locked_key<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(dht::token const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::result_of)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}>(service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}::operator()() const::{lambda(std::variant<utils::UUID, service::paxos::promise>)#1}, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_impl_nrvo<{lambda()#1}, {lambda()#1}<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > > >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, seastar::future<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >::finally_body<seastar::smp::submit_to<service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}>(unsigned int, se ``` Refs #7779 Refs #9331 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210919053007.13960-1-bhalevy@scylladb.com>	2021-09-19 11:58:21 +03:00
Takuya ASADA	5ab7fb7f10	reloc: stop removing entire BUILDDIR We found that user can mistakenly break system with --builddir option, something like './reloc/build_deb.sh --builddir /'. To avoid that we need to stop removing entire $BUILDDIR, remove directories only we have to clean up before building deb package. See: https://github.com/scylladb/scylla-python3/pull/23#discussion_r707088453 Closes #9351	2021-09-19 10:33:33 +03:00
Pavel Emelyanov	8d7a907a65	main: Log when database starts Just to be consistent with other "services". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	0de69136d4	view_update_generator: Register staging sstables in constructor First, it's to fix the discarded future during the register. The future is not actually such, as it's always the no-op ready one as at that stage the view_update_generator is neither aborted nor is in throttling state. Second, this change is to keep database start-up code in main shorter and cleaner. Registering staging sstables belongs to the view_update_generator start code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	a4118a70ee	database, messaging: Delete old connection drop notification Database no longer needs it. Since the only user of the old-style notification is gone -- remove it as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	bfd91d7b81	database, proxy: Relocate connection-drop activity On start database is subscribed on messaging-service connection drop notification to drop the hit-rate from column families. However, the updater and reader of those hit-rates is the storage_proxy, so it must be the _proxy_ who drops the hit-rate. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	dd498273dc	messaging, proxy: Notify connection drops with boost signal The messaging_service keeps track of a list of connection-drop listeners. This list is not auto-removing and is thus not safe on stop (fortunately there's only 1 non-stopping client of it so far). This patch adds a safter notification based on boost/signals. Also storage_proxy is subscribed on it in advance to demonstrate how it looks like altogether and make next patch shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	b78e9b51b7	database, tests: Rework recommended format setting Tests don't have sstable format selector and enforce the needed format by hands with the help of special database:: method. It's more natural to provide it via convig. Doing this makes database initialization in main and cql_test_env closer to each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	a42383b127	database, sstables_manager: Sow some noexcepts Setting sstables format into database and into sstables_manager is all plain assignments. Mark them as noexcept, next patch will become apparently exception safe after that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	9a76df96e3	database: Eliminate unused helpers There are some large-data-handler-related helpers left after previous patches, they can be removed altogehter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	4b7846da86	database: Merge the stop_database() into database::stop() After stop_database() became shard-local, it's possible to merge it with database::stop() as they are both called one after another on scylla stop. In cql-test-env there are few more steps in between, but they don't rely on the database being partially stopped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	469c734155	database: Flatten stop_database() The method need to perform four steps cross-shard synchronously: first stop compaction manager, then close user and, after it, system tables, finally shutdown the large data handler. This patch reworks this synchronization with the help of cross-shard barrier added to the database previously. The motivation is to merge .stop_database() with .stop(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	b1013e09b4	database: Equip with cross-shard-barrier Make sure a node-wide barrier exists on a database when scylla starts. Also provide a barrier for cql_test_env. In all other cases keep a solo-mode barrier so that single-shard db stop doesn't get blocked. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:49:06 +03:00
Pavel Emelyanov	634ea4b543	database: Move starting bits into start() Thse include large_data_handler::start, compaction_manager::enable and database::init_commitlog. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:48:48 +03:00
Pavel Solodovnikov	02f27260cc	idl: allow specifying multiple attributes in the grammar This patch extends the IDL grammar by allowing to use multiple `[[...]]` attribute clauses, as well, as specifying more than one attribute inside a single attribute clause, e.g.: `[[attr1, attr2]]` will be parsed correctly now. For now, in all existing use cases only the first attribute is taken into account and the rest is ignored. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-15 17:47:27 +03:00
Pavel Solodovnikov	7a8cadcca8	message: messaging_service: extract RPC protocol details and helpers into a separate header Introduce a new header `message/rpc_protocol_impl.hh`, move here the following things from `message/messaging_service.cc`: * RPC protocol wrappers implementation * Serialization thunks * `register_handler` and `send_message*` functions This code will be used later for IDL-generated RPC verbs implementation. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-15 17:47:11 +03:00
Pavel Emelyanov	e2308034ff	database: Add .start() method Called right after the sharded::start(). For now empty, to be populated by next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:44:48 +03:00
Pavel Emelyanov	80983951fb	main: Initialize directories before database This is to keep all database start (and stop) code together. Right now directories startup breaks this into two pieces. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:42:20 +03:00
Pavel Emelyanov	c05c58d2b1	main, api: Detach set_server_config from database and move up The api::set_server_config() depends on sharded database to start, but really doesn't need it -- it needs only the db::config object which's available earlier. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:37:10 +03:00
Pavel Emelyanov	127e4fe8de	main: Shorten commitlog creation This does three things in one go: - converts db.invoke_on_all([] (database& db) { return db.init_commitlog(); }); into a one-line version db.invoke_on_all(&database::init_commitlog); - removes the shard-0 pre-initialization for tests, because tests don't have the problem this pre- solves - make the init_commitlog() re-entrable to let regular start not check for shard-0 explicitly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:37:07 +03:00
Pavel Emelyanov	f6ab69b7f8	database: Extract commitlog initialization from init_system_keyspace The intention is to keep all database initialization code in one place. The init_system_keyspace() is one the obstacles -- it initializes db's commitlog as first step. This patch moves the commitlog initialization out of the mentioned helper. The result looks clumsy, but it's temporary, next patches will brush it up. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:36:42 +03:00
Pavel Emelyanov	d156a8993f	repair: Shutdown without database help The sharded database reference is passed into repair_shutdown() just to have something to call .invoke_on_all() onto. There's the more appropriate sharded repair_service for this, so use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:48 +03:00
Pavel Emelyanov	6c54c868b8	main: Shift iosched verification upward There's a block of CLI options sanity checks in the beginning of main starting lambda, it's better to have the iosched validation in this block. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:39 +03:00
Pavel Emelyanov	bd2b7dca0e	database: Remove unused mm arg from init_non_system_keyspaces() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:37 +03:00
Pavel Emelyanov	dc92f220e4	database: Drop get_available_memory() helper It's only used on start to provide the total_memory() value to the repair configuration code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:32 +03:00
Pavel Emelyanov	7e5abb5096	main, scylla-gdb, cql-test-env: Unify debug::the_database All the debug:: inhabitants have their names look like "the_<classname>" This patch brings the database piece to this standard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:30 +03:00
Pavel Emelyanov	e69969b6c7	scylla-gdb: Use find_db helper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:26 +03:00
Pavel Emelyanov	75e1d7ea74	large_data_handler: Prepare for stopped qctx All the large data handler methods rely on global qctx thing to write down its notes. This creates circular dependency: query processor -> database -> large_data_handler -> qctx -> qp In scylla this is not a technical problem, neither qctx nor the query processor are stopped. It is a problem in cql_test_env that stops everything, including resetting qctx to null. To avoid tests stepping on nullptr qctx add the explicit check. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:24 +03:00
Pavel Emelyanov	bb23986826	wasm: Localize it to database usage The wasm::engine exists as a sharded<> service in main, but it's only passed by local reference into database on start. There's no much profit in keeping it at main scope, things get much simpler if keeping the engine purely on database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:17 +03:00
Pavel Emelyanov	e324230648	utils: Introduce cross-shard barrier (with test) Add a synchronization facility to let shards wait for each other to pass through certain points in the code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-15 17:35:12 +03:00
Avi Kivity	cc8fc73761	Merge 'hints: fix bugs in HTTP API for waiting for hints found by running dtest in debug mode' from Piotr Dulikowski This series of commits fixes a small number of bugs with current implementation of HTTP API which allows to wait until hints are replayed, found by running the `hintedhandoff_sync_point_api_test` dtest in debug mode. Refs: #9320 Closes #9346 * github.com:scylladb/scylla: commitlog: make it possible to provide base segment ID hints: fill up missing shards with zeros in decoded sync points hints: propagate abort signal correctly in wait_for_sync_point hints: fix use-after-free when dismissing replay waiters	2021-09-15 12:55:54 +03:00
Avi Kivity	daf028210b	build: enable -Winconsistent-missing-override warning This warning can catch a virtual function that thinks it overrides another, but doesn't, because the two functions have different signatures. This isn't very likely since most of our virtual functions override pure virtuals, but it's still worth having. Enable the warning and fix numerous violations. Closes #9347	2021-09-15 12:55:54 +03:00
Botond Dénes	bd8e2e6691	tools/utils.hh: make self-sufficient Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210915055910.167091-1-bdenes@scylladb.com>	2021-09-15 12:55:54 +03:00
Michał Radwański	7c8b895285	utils/small_vector: remove `noexcept` from the copy constructor, which potentially throws The copy constructor of small vector has a noexcept specifier, however it calls `reserve(size_t)`, which can throw `std::bad_alloc`. This causes issues when using it inside tests that use alloc_failure_injector, but potentially could also float up in the production. Closes #9338	2021-09-15 12:55:54 +03:00
Piotr Dulikowski	91163fcfa5	commitlog: make it possible to provide base segment ID Adds a configuration option to the commitlog: base_segment_id. When provided, the commitlog uses this ID as a base of its segment IDs instead of calculating it based on the number of milliseconds between the epoch and boot time. This is needed in order for the feature which allows to wait for hints to be replayed to work - it relies on the replay positions monotonically increasing. Endpoint managers periodically re-creates its commitlog instance - if it is re-created when there are no segments on disk, currently it will choose the number of milliseconds between the epoch and boot time, which might result in segments being generated with the same IDs as some segments previously created and deleted during the same runtime.	2021-09-15 11:04:34 +02:00
Piotr Dulikowski	486421c58c	hints: fill up missing shards with zeros in decoded sync points Between encoding and decoding of a sync point, the node might have been restarted and resharded with increased shard count. During resharding, existing hints segments might have been moved to new shards. Because of that, we need to make sure that we wait for foreign segments to be replayed on the new shards too. This commit modifies the sync point decoding logic so that it places a zero replay position for new shards. Additionally, a (incorrect) shard count check is removed from `storage_proxy::wait_for_hint_sync_point` because now the shard count in decoded sync point is guaranteed to be not less than the node's current shard count.	2021-09-15 11:04:34 +02:00
Avi Kivity	08042c1688	Merge 'reader_permit: make query max result size accessible from the permit' from Kamil Braun This will make it easier, for example, to enforce memory limits in lower levels of the `flat_mutation_reader` stack. By default, the query result size is unlimited. However, for specific queries it is possible to store a different value (e.g. obtained from a `read_command` object) through a setter. An example of this can be seen in the last commit of this PR, where we set the limit to `cmd.max_result_size` if engaged, or to the 'unlimited query' limit (using `database::get_unlimited_query_max_result_size()`) if not. Refs: #9281. The v2 version of the reverse sstable reader PR will be based on this PR: we'll use the query max result size parameter in one of the readers down the stack where `read_command` is not available but `reader_permit` is. Closes #9341 * github.com:scylladb/scylla: table, database: query, mutation_query: remove unnecessary class_config param reader_permit: make query max result size accessible from the permit reader_concurrency_semaphore: remove default parameter values from constructors query_class_config: remove query::max_result_size default constructor	2021-09-14 16:17:18 +03:00
Piotr Dulikowski	77f2448b2c	hints: propagate abort signal correctly in wait_for_sync_point When `manager::wait_for_sync_point` is called, the abort source from the arguments (`as`) might have already been triggered. In such case, the subscription which was supposed to trigger the `local_as` abort source won't be run, and the code will wait indefinitely for hints to be replayed instead of checking the replay status and returning immediately. This commit fixes the problem by manually triggering `local_as` if `as` have been triggered.	2021-09-14 14:27:01 +02:00
Piotr Dulikowski	8e29ebc5d5	hints: fix use-after-free when dismissing replay waiters When the promise waited on in the `wait_until_hints_are_replayed_up_to` function is resolved, a continuation runs which prints a log line with information about this event. The continuation captures a pointer to the hints sender and uses it to get information about the endpoint whose hints are waited for. However, at this point the sender might have been deleted - for example, when the node is being stopped and everybody waiting for hints is dismissed. This commit fixes the use-after-free by getting all necessary information while the sender is guaranteed to be alive and captures it in the continuation's capture list.	2021-09-14 13:46:16 +02:00
Kamil Braun	c12e265eb8	table, database: query, mutation_query: remove unnecessary class_config param The semaphore inside was never accessed and `max_memory_for_unlimited_query` was always equal to `cmd.max_result_size` so the parameter was completely redundant. `cmd.max_result_size` is supposed to be always set in the affected functions - which are executed on the replica side - as soon as the replica receives the `read_command` object, in case the parameter was not set by the coordinator. However, we don't have a guarantee at the type level (it's still an `optional`). Many places used `cmd.max_result_size` without even an assertion. We make the code a bit safer, we check for `cmd.max_result_size` and if it's indeed engaged, store it in `reader_permit`. We then access it from `reader_permit` where necessary. If `cmd.max_result_size` is not set, we assume this is an unlimited query and obtain the limit from `get_unlimited_query_max_result_size`.	2021-09-14 13:39:56 +02:00
Kamil Braun	e8824986dd	reader_permit: make query max result size accessible from the permit This will make it easier, for example, to enforce memory limits in lower levels of the flat_mutation_reader stack. By default the size is unlimited. However, for specific queries it is possible to store a different value (for example, obtained from a `read_command` object) through a setter.	2021-09-14 13:27:25 +02:00
Kamil Braun	fbb83dd5ca	reader_concurrency_semaphore: remove default parameter values from constructors It's easy to forget about supplying the correct value for a parameter when it has a default value specified. It's safer if 'production code' is forced to always supply these parameters manually. The default values were mostly useful in tests, where some parameters didn't matter that much and where the majority of uses of the class are. Without default values adding a new parameter is a pain, forcing one to modify every usage in the tests - and there are a bunch of them. To solve this, we introduce a new constructor which requires passing the `for_tests` tag, marking that the constructor is only supposed to be used in tests (and the constructor has an appropriate comment). This constructor uses default values, but the other constructors - used in 'production code' - do not.	2021-09-14 12:20:28 +02:00
Kamil Braun	8386b55e9c	query_class_config: remove query::max_result_size default constructor The default values for the fields of this class didn't make much sense, and the default constructor was used only in a single place so removing it is trivial. It's safer when the user is forced to supply the limits.	2021-09-14 12:20:28 +02:00
Avi Kivity	3f2c680b70	Merge 'Add initial support for WebAssembly in user-defined functions (UDF)' from Piotr Sarna This series adds very basic support for WebAssembly-based user-defined functions. This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation. Example usage: ```cql CREATE FUNCTION ks.fibonacci (str text) RETURNS NULL ON NULL INPUT RETURNS boolean LANGUAGE xwasm AS ' (module (func $fibonacci (param $n i32) (result i32) (if (i32.lt_s (local.get $n) (i32.const 2)) (return (local.get $n)) ) (i32.add (call $fibonacci (i32.sub (local.get $n) (i32.const 1))) (call $fibonacci (i32.sub (local.get $n) (i32.const 2))) ) ) (export "fibonacci" (func $fibonacci)) ) ' ``` Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future. Closes #9108 * github.com:scylladb/scylla: docs: add a WebAssembly entry cql-pytest: add wasm-based tests for user-defined functions main: add wasm engine instantiation treewide: add initial WebAssembly support to UDF wasm: add initial WebAssembly runtime implementation db: add wasm_engine pointer to database lang: add wasm_engine service import wasmtime.hh lua: move to lang/ directory cql3: generalize user-defined functions for more languages	2021-09-14 11:34:20 +03:00
Avi Kivity	e9ae9279e8	system_keyspace: reindent after conversion to class Conversion to class left indentation in ruins, but that can be easily fixed. 'git diff -w' reports no changes. Closes #9339	2021-09-14 08:49:24 +03:00
Avi Kivity	64537beb38	Update tools/java submodule (nodetool stop reshape) * tools/java 3b378f7095...9c5c0ad1fd (1): > nodetool stop: Support Reshape	2021-09-13 21:17:01 +03:00
Piotr Sarna	6c4a71cdea	docs: add a WebAssembly entry The doc briefly describes the state of WASM support for user-defined functions.	2021-09-13 19:03:58 +02:00
Piotr Sarna	41b94d3cf3	cql-pytest: add wasm-based tests for user-defined functions A first set of wasm-based test cases is added. The tests include verifying that supported types work and that validation of the input wasm is performed.	2021-09-13 19:03:58 +02:00
Piotr Sarna	4959136afd	main: add wasm engine instantiation Once the engine is up, it can be used to execute user-defined functions.	2021-09-13 19:03:58 +02:00
Piotr Sarna	62e8c89a9c	treewide: add initial WebAssembly support to UDF This commit adds a very basic support for user-defined functions coded in wasm. The support is very limited (only a few types work) and was not tested against reactor stalls and performance in general.	2021-09-13 19:03:58 +02:00
Piotr Sarna	78afd518a8	wasm: add initial WebAssembly runtime implementation The engine is based on wasmtime and is able to: - compile wasm text format to bytecode - run a given compiled function with custom arguments This implementation is missing crucial features, like running on any other types than 32-bit integers. It serves as a skeleton for future full implementation.	2021-09-13 19:03:58 +02:00
Avi Kivity	e9343fd382	Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek Add new struct to the `expression` variant: ```c++ // A value serialized with the internal (latest) cql_serialization_format struct constant { cql3::raw_value value; data_type type; // Never nullptr, for NULL and UNSET might be empty_type }; ``` and use it where possible instead of `terminal`. This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely. We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression. `bool` is removed from `expression`, now `constant` is used instead. This is a redesign of PR #9203, there is some discussion about the chosen representation there. Closes #9244 * github.com:scylladb/scylla: cql3: term: Remove get_elements and multi_item_terminal from terminals cql3: Replace most uses of terminal with expr::constant cql3: expr: Remove repetition from expr::get_elements cql3: expr: Add expr::get_elements(constant) cql3: term: remove term::bind_and_get cql3: Replace all uses of bind_and_get with evaluate_to_raw_view cql3: expr: Add evaluate_IN_list cql3: tuples: Implement tuples::in_value::get cql3: Move data_type to terminal, make get_value_type non-virtual cql3: user_types: Implement get_value_type in user_types.hh cql3: tuples: Implement get_value_type in tuples.hh cql3: maps: Implement get_value_type in maps.hh cql3: sets: Implement get_value_type in sets.hh cql3: lists: Implement get_value_type in lists.hh cql3: constants: Implement get_value_type in constants.hh cql3: expr: Add expr::evaluate cql3: values: Add unset value to raw_value_view::make_temporary cql3: expr: Add constant to expression	2021-09-13 19:26:09 +03:00
Nadav Har'El	27138b215b	Merge 'system_keyspace: convert from namespace to class' from Avi Kivity All the namespace scope functions in system_keyspace have no place to store context, so they must store their context in global variables. This prevents conversion of those global variables to constructor-provided depdendencies. Take the first step towards providing a place to store the context by converting system_keyspace to a class. All the functions are static, so no context is yet available, but we can de-static-ify them incrementally in the future and store the context in class members. Closes #9335 * github.com:scylladb/scylla: system_keyspace: convert from namespace to class system_keyspace: prepare forward-declared members system_keyspace: rearrange legacy subnamespace system_keyspace: remove outdated java code	2021-09-13 19:01:42 +03:00
Avi Kivity	1b75e9312d	Update tools/java and tools/jmx submodules (load-and-stream support) * tools/java a2fe67fd42...3b378f7095 (1): > nodetool: add `--load-and-stream` option to `refresh` * tools/jmx 70b19e6...658818b (1): > Support `--load-and-stream` option from `nodetool refresh`	2021-09-13 18:48:11 +03:00
Tomasz Grabiec	890b861d20	Merge 'query::reverse_slice(): toggle reversed bit instead of setting it' from Botond Dénes The above mentioned method is supposed to work both ways: reversed <-> forward, so setting the reversed bit is not correct: it should be toggled, which is what this mini-series does. Closes #9327 * github.com:scylladb/scylla: reverse_slice(): toggle reversed bit instead of setting it partition_slice_builder(): add with_option_toggled() enum_set: add toggle()	2021-09-13 18:48:11 +03:00
Takuya ASADA	f93793da7e	configure.py: remove $builddir/release/{scylla_product}-python3-{arch}-package.tar.gz from dist-python3 target '$builddir/release/{scylla_product}-python3-package.tar.gz' on dist-python3 target is for compat-python3, we forgot to remove at `35a14ab`. Fixes #9333 Closes #9334	2021-09-13 18:48:10 +03:00
Jan Ciolek	fd98d40b75	cql3: term: Remove get_elements and multi_item_terminal from terminals terminal now isn't used as a final value anywhere. Remove things that are no longer needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:47:17 +02:00
Jan Ciolek	a0ec2113ae	cql3: Replace most uses of terminal with expr::constant constant is now ready to replace terminal as a final value representation. Replace bind() with evaluate and shared_ptr<terminal> with constant. We can't get rid of terminal yet. Sometimes terminal is converted back to term, which constant can't do. This won't be a problem once we replace term with expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:47:17 +02:00
Jan Ciolek	b67f72037f	cql3: expr: Remove repetition from expr::get_elements There was some repeating code in expr::get_elements family of functions. It has been reduced into one function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:47:17 +02:00
Jan Ciolek	8b475a966c	cql3: expr: Add expr::get_elements(constant) We need to be able to access elements of a constant. Adds functions to easily do it. Those functions check all preconditions required to access elements and then use partially_deserialize_* or similar. It's much more convenient than using partially_deserialize directly. get_list_of_tuples_elements is useful with IN restrictions like (a, b) IN [(1, 2), (3, 4)]. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:47:17 +02:00
Jan Ciolek	134b76f5d9	cql3: term: remove term::bind_and_get term::bind_and_get is not needed anymore, remove it. Some classes use bind_and_get internally, those functions are left intact and renamed to bind_and_get_internal. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:47:17 +02:00
Avi Kivity	ed396a31f3	Merge "Remove global storage proxy from cdc" from Pavel E " There's a single call to get_local_storage_proxy in cdc code that needs to get database from. Furtunately, the database can be easily provided there via call argument. tests: unit(dev) " * 'br-remove-proxy-from-cdc' of https://github.com/xemul/scylla: cdc: Add database argument to is_log_for_some_table client_state: Pass database into has_access() client_state: Add database argument to has_schema_access client_state: Add database argument to has_keyspace_access() cdc: Add database argument to check_for_attempt_to_create_nested_cdc_log	2021-09-13 18:45:46 +03:00
Avi Kivity	f3712d4767	Merge "Avoid nested seastar::async in tests" from Pavel E " There's a bunch of explicit and implicit async contexts nesting in sstables tests. This set turns them into a single nest async (mostly with an awk script). The indentation in first two patches is deliberately left as it was before patching, i.e. -- slightly broken. As a consolation, after the third patch it suddenly becomes fixed as the unneeded intermediate call with broken indent is removed. tests: unit(dev) " * 'br-sst-tests-no-nested-async' of https://github.com/xemul/scylla: test: Don't nest seastar::async calls (2nd cont) test: Don't nest seastar::async calls (cont) test: Don't nest seastar::async calls	2021-09-13 18:45:46 +03:00
Takuya ASADA	f928dced0c	scylla_cpuscaling_setup: add --force option To building Ubuntu AMI with CPU scaling configuration, we need force running mode for scylla_cpuscaling_setup, which run setup without checking scaling_governor support. See scylladb/scylla-machine-image#204 Closes #9326	2021-09-13 18:45:46 +03:00
Jan Ciolek	c3fb2f2b57	cql3: Replace all uses of bind_and_get with evaluate_to_raw_view Start using evaluate_to_raw_value instead of bind_and_get. This is a step towards using only evaluate. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:44:06 +02:00
Botond Dénes	6b5936812f	reverse_slice(): toggle reversed bit instead of setting it reverse_slice() works in both direction: reverse <-> forward, so it cannot unconditionally set the reversed bit, instead it should toggle it.	2021-09-13 18:05:11 +03:00
Botond Dénes	e16c388437	partition_slice_builder(): add with_option_toggled()	2021-09-13 18:05:11 +03:00
Botond Dénes	96c95119f9	enum_set: add toggle()	2021-09-13 18:05:11 +03:00
Jan Ciolek	25caa1950d	cql3: expr: Add evaluate_IN_list A list representing IN values might contain NULLs before evaluation. We can remove them during evaluation, because nothing equals NULL. If we don't remove them, there are gonna be errors, because a list can't contain NULLs. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	5a90fd097a	cql3: tuples: Implement tuples::in_value::get To convert a terminal to expr::constant we need to be able to serialize it. tuples::in_value didn't have serialization implemented, do it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	9b6b2899ed	cql3: Move data_type to terminal, make get_value_type non-virtual Every class now has implementation of get_value_type(). We can simply make base class keep the data_type. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	9b3478e1cd	cql3: user_types: Implement get_value_type in user_types.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in user_types.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	68b65771a7	cql3: tuples: Implement get_value_type in tuples.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in tuples.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	319b6608b0	cql3: maps: Implement get_value_type in maps.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in mapshh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	5a755cda2b	cql3: sets: Implement get_value_type in sets.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in sets.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	0b3436598a	cql3: lists: Implement get_value_type in lists.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in lists.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	60a34236ee	cql3: constants: Implement get_value_type in constants.hh To convert a terminal to expr::constant we need know the value type. Implement getting value type for terminals in constants.hh. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	844bf2d472	cql3: expr: Add expr::evaluate Adds the functions: constant evaluate(term, const query_options&); raw_value_view evaluate(term, const query_options&); These functions take a term, bind it and convert the terminal to constant or raw_value_view. In the future these functions will take expression instead of term. For that to happen bind() has to be implemented on expression, this will be done later. Also introduces terminal::get_value_type(). In order to construct a constant from terminal we need to know the type. It will be implemented in the following commits. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	da099dd922	cql3: values: Add unset value to raw_value_view::make_temporary When unset_value is passed to make_temporary it gets converted to null_value. This looks like a mistake, it should be just unset_value. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:23 +02:00
Jan Ciolek	79cb268ada	cql3: expr: Add constant to expression Adds constant to the expression variant: struct constant { raw_value value; data_type type; }; This struct will be used to represent constant values with known bytes and type. This corresponds to the terminal from current design. bool is removed from expression, now constant is used instead. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-13 17:03:21 +02:00
Avi Kivity	e70b9d4835	system_keyspace: convert from namespace to class All the namespace scope functions in system_keyspace have no place to store context, so they must store their context in global variables. This prevents conversion of those global variables to constructor-provided depdendencies. Take the first step towards providing a place to store the context by converting system_keyspace to a class. All the functions are static, so no context is yet available, but we can de-static-ify them incrementally in the future and store the context in class members. Indentation is a mess, but can be easily fixed later.	2021-09-13 15:14:14 +03:00
Avi Kivity	115d6d8d4c	system_keyspace: prepare forward-declared members In anticipation of making system_keyspace a class instead of a namespace, rename any member that is currently forward-declared, since one can't forward-declare a class member. Each member is taken out of the system_keyspace namespace and gains a system_keyspace prefix. Aliases are added to reduce code churn. The result isn't lovely, but can be adjusted later.	2021-09-13 15:11:26 +03:00
Avi Kivity	c6ce81d6a0	system_keyspace: rearrange legacy subnamespace Merge two fragments together, in anticipation of making 'legacy' s struct instead of a namespace (when system_keyspace is a class, we can't nest a namespace inside it).	2021-09-13 15:10:15 +03:00
Avi Kivity	6d379ae6f9	system_keyspace: remove outdated java code This code has been rewritten and not removed, or is not needed. Remove it to reduce clutter.	2021-09-13 15:08:57 +03:00
Michał Chojnowski	7df9deb628	service: storage_proxy: don't compute probabilistic read repair decisions when probability is 0 On ARM, the code in libstdc++ responsible for computing random floating-point numbers is very slow, because it uses `long double` arithmetic, which is emulated in software on this architecture. The performance effect on read queries is noticeable – about 6% of the total work of a read from cache. Since probabilistic read repair is almost always disabled (and under consideration for removal) let's just optimize the case when it's disabled. Fixes #9107 Closes #9329	2021-09-13 12:31:14 +03:00
Piotr Sarna	83f46e6e6f	db: add wasm_engine pointer to database WASM engine needs to be used from two separate contexts: - when a user-defined function is created via CQL - when a user-defined function is received during schema migration The common instance that these two have in common is the database object, so that's where the reference is stored.	2021-09-13 11:01:33 +02:00
Piotr Sarna	5e6fa47198	lang: add wasm_engine service WASM engine stores the wasm runtime engine for user-defined functions.	2021-09-13 11:01:33 +02:00
Piotr Sarna	4caf57f730	import wasmtime.hh Courtesy of https://github.com/bytecodealliance/wasmtime-cpp . Taken as is, with a small licensing blurb added on top.	2021-09-13 11:01:33 +02:00
Piotr Sarna	4e952df470	lua: move to lang/ directory Support for more languages is comming, so let's group them in a separate directory.	2021-09-13 11:01:33 +02:00
Piotr Sarna	46c6603fe0	cql3: generalize user-defined functions for more languages In order to support more languages than just Lua in the future, Lua-specific configuration is now extracted to a separate structure.	2021-09-13 11:01:33 +02:00
Avi Kivity	61c9df4bd2	Merge "Split sstable_conforms_to_mutation_source" from Pavel E " The tests contains a single case that runs 6 different cases inside and is one of the longest tests out there. Splitting it improves parallel-cases suite run time. tests: unit(dev, debug, release) " * 'br-split-sst-conforms-to-ms' of https://github.com/xemul/scylla: tests: Fix indentation after previous patch tests: Split sstable_conforms_to_mutation_source	2021-09-13 11:27:44 +03:00
Avi Kivity	1fd701e709	test: cql-pytest: skip tests depending on timeuuid monotonicity timeuuid is not monotonic when now() is called on different connections, so when running tests that depend on that property, we get failures if using the Scylla driver (which became standard in `729d0fe`). Skip the tests for now, until we figure out what to do. We probably can't make now() globally monotonic, and there isn't much to gain by making it monotonic only per connection, since clients are allowed to switch connections (and even nodes) at will. Ref #9300 Closes #9323 [avi: committing my own patch to unblock master]	2021-09-12 19:30:40 +03:00
Nadav Har'El	1d4474d543	test/alternator/run: don't run Scylla if "--aws" option The test/alternator/run script runs Scylla and then runs pytest against it. But when passing the "--aws" option, the intention is that these tests be run against AWS DynamoDB, not a local Scylla, so there is no point in starting Scylla at all - so this is what we do in this patch. This doesn't really add a new feature - "test/alternator/run --aws" will now be nothing more than "cd test/alternator; pytest --aws". But it adds the convenience that you can run the same tests on Scylla and AWS with exactly the same "run" command, just adding the "--aws" option, and don't need to sometimes use "run" and sometimes "pytest". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210912133239.75463-1-nyh@scylladb.com>	2021-09-12 16:50:38 +03:00
Avi Kivity	c5f52f9d97	schema_tables: don't flush in tests Flushing schema tables is important for crash recovery (without a flush, we might have sstables using a new schema before the commitlog entry noting the schema change has been replayed), but not important for tests that do not test crash recovery. Avoiding those flushes reduces system, user, and real time on tests running on a consumer-level SSD. before: real 8m51.347s user 7m5.743s sys 5m11.185s after: real 7m4.249s user 5m14.085s sys 2m11.197s Note real time is higher that user+sys time divided by the number of hardware threads, indicating that there is still idle time due to the disk flushing, so more work is needed. Closes #9319	2021-09-12 11:32:13 +03:00
Raphael S. Carvalho	acba3bd3c4	sstables: give a more descriptive name to compaction_options the name compaction_options is confusing as it overlaps in meaning with compaction_descriptor. hard to reason what are the exact difference between them, without digging into the implementation. compaction_options is intended to only carry options specific to a give compaction type, like a mode for scrub, so let's rename it to compaction_type_options to make it clearer for the readers. [avi: adjust for scrub changes] Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>	2021-09-12 11:21:33 +03:00
Benny Halevy	389ef9316f	compaction: scrub/validate: prevent printing non-utf8 partition keys Corrupt keys might be printed as non-utf8 strings to the log, and that, in turn, may break applications reading the logs, such as Python (3.7) For example: ``` Traceback (most recent call last): File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown self.cleanUpCluster() File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster matches = node.grep_log(expr) File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log for line in f: File "/usr/lib64/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte ``` Test: unit(dev) DTest: scrub_with_one_node_expect_data_loss_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>	2021-09-12 10:52:18 +03:00
Tomasz Grabiec	83113d8661	Merge "raft: new schema for storing raft snapshots" from Pavel Solodovnikov Previously, the layout for storing raft snapshot descriptors contained a `config` field, which had `blob` data type. That means `raft::configuration` for the snapshot was serialized as a whole in binary form. It's convenient to implement and is the most compact form of representing the data, but: 1. Hard to debug due to the need to de-serialize the data. 2. Plants a time bomb wrt. changing data layout and also the documentation in the future. Remove the `config` field from `system.raft_snapshots` and extract it to a separate `system.raft_config` table to store the data in exploded form. Also, modify the schema of `system.raft_snapshots` table in the following way: add a `server_id` field as a part of composite partition key ((group_id, server_id)) to be able to start multiple raft servers belonging to one raft group on the same scylla node. Rename `id` field in `raft_snapshots` to `snapshot_id` so it's self-documenting. Rename `snapshot_id` from clustering key since a given server can have only one snapshot installed at a time. Note that the `raft::server_address` stucture contains an opaque `info` member, which is `bytes`, but in the `raft_config` table we use `ip_addr inet` field, instead. We always know that the corresponding member field is going to contain an IP address (either v4 or v6) of a given raft server. So, now the snapshots schema looks like this: CREATE TABLE raft_snapshots ( group_id timeuuid, server_id uuid, snapshot_id uuid, idx int, term int, -- no `config` field here, moved to `raft_config` table PRIMARY KEY ((group_id, server_id)) ) CREATE TABLE raft_config ( group_id timeuuid, my_server_id uuid, server_id uuid, disposition text, -- can be either 'CURRENT` or `PREVIOUS' can_vote bool, ip_addr inet, PRIMARY KEY ((group_id, my_server_id), server_id, disposition) ); This way it's much easier to extend the schema with new fields, very easy to debug and inspect via CQL, and it's much more descriptive in terms of self-documentation. Tests: unit(dev) * manmanson/raft_snapshots_new_schema_v2: test: adjust `schema_change_test` to include new `system.raft_config` table raft: new schema for storing raft snapshots raft: pass server id to `raft_sys_table_storage` instance	2021-09-10 20:41:59 +02:00
Avi Kivity	16116ac631	interval: constrain comparator parameters The interval template member functions mostly accept tri-comparators but a few functions accept less-comparators. To reduce the chance of error, and to provide better error messages, constrain comparator parameters to the expected signature. In one case (db/size_estimates_virtual_reader.cc) the caller had to be adjusted. The comparator supported comparisons of the interval value type against other types, but not against itself. To simplify things, we add that signature too, even though it will never be called. Closes #9291	2021-09-10 16:43:16 +02:00
Avi Kivity	7a798b44a2	cql3: expr: replace column_value_tuple by a composition of tuple_constructor and column_value column_value_tuple overlaps both column_value and tuple_constructor (in different respects) and can be replaced by a combination: a tuple_constructor of column_value. The replacement is more expressive (we can have a tuple of column_value and other expression types), though the code (especially grammar) do not allow it yet. So remove column_value_tuple and replace it everywhere with tuple_constructor. Visitors get the merged behavior of the existing tuple_constructor and column_value_tuple, which is usually trivial since tuple_constructor and column_value_tuple came from different hierarchies (term::raw and relation), so usually one of the types just calls on_internal_error(). The change results in awkwards casts in two areas: WHERE clause filtering (equal() and related), and clustering key range evaluations (limits() and related). When equal() is replaced by recursive evaluate(), the casts will go way (to be replaced by the evaluate()) visitor. Clustering key range extraction will remain limited to tuples of column_value, so the prepare phase will have to vet the expressions to ensure the casts don't fail (and use the filtering path if they will). Tests: unit (dev) Closes #9274	2021-09-10 10:43:29 +02:00
Piotr Sarna	234c2b9f6d	Merge 'Scrub compaction serialization' from Benny Halevy Currently scrub compaction filters-out sstables that are undergoing (regular) compaction. This is surprising to the user and we would like scrub (in validate mode or otherwise) to examine all sstables in the table. Scrub in VALIDATE mode is read-only, therefore it can run in parallel to regular compaction. However, this series makes sure it selects all sstables in the table, without filtering sstables undergoing compaction. For scrub in non-validation mode, we would like to ensure that it examined all sstables that were sealed when it started and it fixed any corruption (based on the scrub mode). Therefore, we stop ongoing compactions when running scrub in non-validation modes. Otherwise compaction might just copy the corrupt data onto new sstables, requiring scrub to run again. Also, acquire _compaction_locks write lock for the table to serialize with other custom compaction jobs like major compaction, reshape, and reshard. Fixes #9256 Test: unit(dev) DTest: nodetool_additional_test.py:TestNodetool.{validate_sstable_with_invalid_fragment_test, validate_ks_sstable_with_invalid_fragment_test,validate_with_one_node_expect_data_loss_test} Closes #9258 * github.com:scylladb/scylla: compaction_manager: rewrite_sstables: acquire _compaction_locks compaction_manager: perform_sstable_scrub: run_with_compaction_disabled compaction: don't rule out compacting sstables in validate-mode scrub	2021-09-09 18:33:43 +02:00
Avi Kivity	219fdcd8da	Merge 'tools: introduce scylla-sstable' from Botond Dénes A tool which can be used to examine the content of sstable(s) and execute various operations on them. The currently supported operations are: * dump - dumps the content of the sstable(s), similar to sstabledump; * dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index; * writetime-histogram - generates a histogram of all the timestamps in the sstable(s); * custom - a hackable operation for the expert user (until scripting support is implemented); * validate - validate the content of the sstable(s) with the mutation fragment stream validator, same as scrub in validate mode; The sstables to-be-examined are passed as positional command line arguments. Sstables will be processed by the selected operation one-by-one (can be changed with `--merge`). Any number of sstables can be passed but mind the open file limits. Pass the full path to the data component of the sstables (-Data.db). For now it is required that the sstable is found at a valid data path: /path/to/datadir/{keyspace_name}/{table_name}-{table_id}/ The schema to read the sstables is read from a `schema.cql` file. This should contain the keyspace and table definitions, as well as any UDTs used. Filtering the sstable(s) to process only certain partition(s) is supported via the `--partition` and `--partitions-file` command line flags. Partition keys are expected to be in the hexdump format used by scylla (hex representation of the raw buffer). Operations write their output to stdout, or file(s). The tool logs to stderr, with a logger called `scylla-sstable-crawler`. Examples: # dump the content of the sstable $ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db # dump the content of the two sstable(s) as a unified stream $ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db # generate a joint histogram for the specified partition $ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db # validate the specified sstables $ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db Future plans: JSON output for dump. * A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output. * Relax sstable path checks, so sstables can be loaded from any path. * Add scripting support (Lua), allowing custom operations to be written in a scripting language. Refs: #9241 Closes #9271 * github.com:scylladb/scylla: tools: remove scylla-sstable-index tools: introduce scylla-sstable tools: extract finding selected operation (handler) into function tools: add schema_loader cql3: query_processor: add parse_statements() cql3: statements/create_type: expose create_type() cql3: statements/create_keyspace: add get_keyspace_metadata()	2021-09-09 19:24:06 +03:00
Avi Kivity	c1028de22a	Merge 'Introduce native reversed format' from Botond Dénes We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream. This series is the first step towards implementing efficient reverse reads. It allows us to remove all the special casing we have in various places for reverse reads and thus treating reverse streams transparently in all the middle layers. The only layers that have to know about the actual reversing are mutation sources proper. The plan is that when reading in reverse we create a reversed schema in the top layer then pass this down as the schema for the read. There are two layers that will need to act on this reversed schema: * The layer sitting on top of the first layer which still can't handle reversed streams, this layer will create a reversed reader to handle the transition. * The mutation source proper: which will obtain the underlying schema and will emit the data in reverse order. Once all the mutation sources are able to handle reverse reads, we can get rid of the reverse reader entirely. Refs: #1413 Tests: unit(dev) TODO: * v2 * more testing Also on: https://github.com/denesb/scylla.git reverse-reads/v3 Changelog v3: * Drop the entire schema transformation mechanism; * Drop reversing from `schema_builder()`; * Don't keep any information about whether the schema is reversed or not in the schema itself, instead make reversing deterministic w.r.t. schema version, such that: `s.version() == s.make_reversed().make_reversed().version()`; * Re-reverse range tombstones in `streaming_mutation_freezer`, so `reconcilable_results` sent to the coordinator during read repair still use the old reverse format; v2: * Add `data_type reversed(data_type)`; * Add `bound_kind reverse_kind(bound_kind)`; * Make new API safer to use: - `schema::underlying_type()`: return this when unengaged; - `schema::make_transformed()`: noop when applying the same transformation again; * Generalize reversed into transformation. Add support to transferring to remote nodes and shards by way of making `schema_tables` aware of the transformation; * Use reverse schema everywhere in reverse reader; Closes #9184 * github.com:scylladb/scylla: range_tombstone_accumulator: drop _reversed flag test/boost/mutation_test: add test for mutation::consume() monotonicity test/boost/flat_mutation_reader_test: more reversed reader tests flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range) flat_mutation_reader: make_reversing_reader(): take ownership of the reader test/lib/mutation_source_test: add consistent log to all methods mutation: introduce reverse() mutation_rebuilder: make it standalone mutation: make copy constructor compatible with mutation_opt treewide: switch to native reversed format for reverse reads mutation: consume(): add native reverse order mutation: consume(): don't include dummy rows query: add slice reversing functions partition_slice_builder: add range mutating methods partition_slice_builder: add constructor with slice query: specific_ranges: add non-const ranges accessor range_tombstone: add reverse() clustering_bounds_comparator: add reverse_kind() schema: introduce make_reversed() schema: add a transforming copy constructor utils: UUID_gen: introduce negate() types: add reversed(data_type) docs: design-notes: add reverse-reads.md	2021-09-09 15:50:22 +03:00
Botond Dénes	f02632aeb0	range_tombstone_accumulator: drop _reversed flag	2021-09-09 15:42:15 +03:00
Botond Dénes	f07805c3ef	test/boost/mutation_test: add test for mutation::consume() monotonicity In both forward and reverse modes.	2021-09-09 15:42:15 +03:00
Botond Dénes	3cc882f6a8	test/boost/flat_mutation_reader_test: more reversed reader tests Check that the reverse reader emits a stream identical to that emitted by a reader reading in native order from a table with reversed clustering order.	2021-09-09 15:42:15 +03:00
Botond Dénes	bf38e204af	flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)	2021-09-09 15:42:15 +03:00
Botond Dénes	350440b418	flat_mutation_reader: make_reversing_reader(): take ownership of the reader Makes for much simpler client code.	2021-09-09 15:42:15 +03:00
Botond Dénes	c71a281e6b	test/lib/mutation_source_test: add consistent log to all methods Most test methods log their own name either via testlog.info() or BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do however. This commit fixes this and also converts all those using BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.	2021-09-09 15:42:15 +03:00
Botond Dénes	1d6896c14f	mutation: introduce reverse() Which reverses the mutation as if it was created with a schema with reversed clustering order.	2021-09-09 15:42:15 +03:00
Botond Dénes	74a22a706b	mutation_rebuilder: make it standalone Not requiring a wrapper object to become usable.	2021-09-09 15:42:15 +03:00
Botond Dénes	16b9d19e50	mutation: make copy constructor compatible with mutation_opt Currently `_data` is assumed to be engaged by the copy constructor which is not necessarily the case with `mutation_opt` objects (which is an `optimized_optional<mutation>`). Fix this by only copying `_data` if non-null.	2021-09-09 15:42:15 +03:00
Botond Dénes	502a45ad58	treewide: switch to native reversed format for reverse reads We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream.	2021-09-09 15:42:15 +03:00
Botond Dénes	0af5a8add0	mutation: consume(): add native reverse order The existing consume_in_reverse::yes is renamed to consume_in_reverse::legacy_half_reverse and consume_in_reverse::yes now means native reverse order. This is because we expect the legacy order to die out at one point and when that happens we can just remove that ugly third option and will be left with yes and no as before.	2021-09-09 14:18:32 +03:00
Botond Dénes	38ef80d4d2	mutation: consume(): don't include dummy rows	2021-09-09 14:18:32 +03:00
Botond Dénes	5d33d76cfd	query: add slice reversing functions	2021-09-09 14:18:32 +03:00
Botond Dénes	4fc39721a2	partition_slice_builder: add range mutating methods	2021-09-09 14:16:21 +03:00
Botond Dénes	a2eb0f7d7e	partition_slice_builder: add constructor with slice Intended to be used to modify an existing slice. We want to move the slice into the direction where the schema is at: make it completely immutable, all mutations happening through the slice builder class.	2021-09-09 14:15:42 +03:00
Benny Halevy	40a6049ac2	compaction_manager: rewrite_sstables: acquire _compaction_locks Take write lock for cf to serialize cleanup/upgrade sstables/scrub with major compaction/reshape/reshard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-09 14:13:45 +03:00
Benny Halevy	44348b3080	compaction_manager: perform_sstable_scrub: run_with_compaction_disabled since we might potentially have ongoing compactions, and we must ensure that all sstables created before we run are scrubbed, we need to barrier out any previously running compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-09 14:13:40 +03:00
Raphael S. Carvalho	a145ffcf52	compaction: don't rule out compacting sstables in validate-mode scrub even sstables being compacted must be validated. otherwise scrub validate may return false negative. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-09 14:06:50 +03:00
Raphael S. Carvalho	a23057edce	Update CODEOWNERS for compaction subsystem Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210904003731.67134-1-raphaelsc@scylladb.com>	2021-09-09 12:50:06 +03:00
Botond Dénes	34abbe82fe	query: specific_ranges: add non-const ranges accessor	2021-09-09 12:09:08 +03:00
Botond Dénes	30f6f676b8	range_tombstone: add reverse() Reversing the range-tombstone, as-if it was emitted from a table with reversed clustering order.	2021-09-09 11:49:05 +03:00
Botond Dénes	d0351eaaed	clustering_bounds_comparator: add reverse_kind() Hiding the tricky reversing of a bound_kind.	2021-09-09 11:49:05 +03:00
Botond Dénes	f200c8104a	schema: introduce make_reversed() `make_revered()` creates a schema identical to the schema instance it is called on, with clustering order reversed. To distinguish the reverse schema from the original one, the node-id part of its version UUID is bit-flipped. This ensures that reversing a schema twice will result in the identical schema to the original one (although a different C++ object). This reversed schema will be used in reversed reads, so intermediate layers can be ignorant of the fact that the read happens in reverse.	2021-09-09 11:49:05 +03:00
Botond Dénes	9a9b58e67b	schema: add a transforming copy constructor Taking a transform functor, which is executed after the raw schema is copied, but before the derivate fields are computed (rebuild()).	2021-09-09 11:49:05 +03:00
Botond Dénes	65913f4cfa	utils: UUID_gen: introduce negate()	2021-09-09 11:49:05 +03:00
Botond Dénes	183ac6981a	types: add reversed(data_type) Reversing the sort order of a type.	2021-09-09 11:49:05 +03:00
Botond Dénes	0cc00b5d17	docs: design-notes: add reverse-reads.md Explaining how reverse reads work, in particular the difference between the legacy and native formats.	2021-09-09 11:49:02 +03:00
Tzach Livyatan	eba2ea9907	scylla.yaml: remove comment for num_tokens The comment is less relevant for Scylla, and point to a non relevant Apache Cassandra doc page. Closes #9284	2021-09-09 11:45:40 +03:00
Nadav Har'El	e4bafe7dc7	Merge 'Split view builder shutdown procedure to drain + stop' from Piotr Sarna In order to be able to avoid a deadlock when CQL server cannot be started, the view builder shutdown procedure is now split to two parts - - drain and stop. Drain is performed before storage proxy shutdown, but stop() will be called even before drain is scheduled. The deadlock is as follows: - view builder creates a reader permit in order to be able to read from system tables - CQL server fails to start, shutdown procedure begins - view builder stop() is not called (because it was not scheduled yet), so it holds onto its reader permit - database shutdown procedure waits for all permits to be destroyed, and it hangs indefinitely because view builder keeps holding its permit. Fixes #9306 Closes #9308 * github.com:scylladb/scylla: main: schedule view builder stopping earlier db,view: split stopping view builder to drain+stop	2021-09-09 11:38:15 +03:00
Dejan Mircevski	6afdc6004c	cql3/modification_statement: Replace empty-range check with null check The empty-range check causes more bugs than it fixes. Replace it with an explicit check for =NULL (see #7852). Fixes #9311. Fixes #9290. Tests: unit (dev), cql-pytest on Cassandra 4.0 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9314	2021-09-09 10:56:13 +03:00
Avi Kivity	595f1fe802	storage_proxy: digest_read_resolver: use small_vector for holding digests There is typically just 1-2 digests per query, so we can allocate space for them in digest_read_resolver using small_vector, saving an allocation. Results: (perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10) before: median 215301.75 tps ( 75.1 allocs/op, 12.1 tasks/op, 45238 insns/op) after: median 221121.37 tps ( 74.1 allocs/op, 12.1 tasks/op, 45186 insns/op) While the throughput numbers are not reliable due to frequency throttling, it's clear there are fewer allocations and instuctions executed. Closes #9296	2021-09-09 10:24:39 +03:00
Piotr Sarna	e93585e66c	main: schedule view builder stopping earlier In order to avoid a deadlock described in the previous commit, view builder stopping is registered earlier, so that its destructor is called and its reader permit is released before the database starts shutting down. Note that draining the view builder is still scheduled later, because it needs to happen before storage proxy drain to keep the existing deinitialization order. Fixes #9306	2021-09-08 10:53:08 +02:00
Piotr Sarna	5d7c765422	db,view: split stopping view builder to drain+stop In order to be able to avoid a deadlock when CQL server cannot be started, the view builder shutdown procedure is now split to two parts - - drain and stop. Drain is performed before storage proxy shutdown, but stop() will be called even before drain is scheduled. The deadlock is as follows: - view builder creates a reader permit in order to be able to read from system tables - CQL server fails to start, shutdown procedure begins - view builder stop() is not called (because it was not scheduled yet), so it holds onto its reader permit - database shutdown procedure waits for all permits to be destroyed, and it hangs indefinitely because view builder keeps holding its permit.	2021-09-08 10:52:40 +02:00
Dejan Mircevski	58a9a24ff0	cql3: Allow indexed query to select static columns We previously forbade selecting a static column when an index is used. But Cassandra allows it, so we should, too -- see #8869. After removing the static-column check, the existing code gets the correct result without any further changes (though it may read multiple rows from the same partition). Fixes #8869. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9307	2021-09-08 08:22:59 +02:00
Tomasz Grabiec	9a77a03ea1	Merge "Remove most uses of gms::get_gossiper(), gms::get_local_gossiper()" from Avi In the quest to have explicit dependencies and the abiliy to run multiple nodes in one process, remove some uses of get_gossiper() and get_local_gossiper() and replace them with parameters passed from main() or its equivalents. Some uses still remain, mostly in snitch, but this series removes a majority. * https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1 alternator: remove uses of get_local_gossiper() storage_service: remove stray get_gossiper(), get_local_gossiper() calls migration_manager: remove use of get_gossiper() from passive_announce() storage_proxy: start_hints_manager(): don't require caller to provide gossiper migration_manager: remove uses of get_local_gossiper() storage_proxy: remove uses of get_local_gossiper() gossiper: remove get_local_gossiper() from some inline helpers gossiper: remove get_gossiper() from stop_gossiping() gossiper: remove uses of get_local_gossiper for its rpc server api: remove use of get_local_gossiper() gossiper: remove calls to global get_gossiper from within the gossiper itself	2021-09-07 20:02:30 +02:00
Avi Kivity	4aaddd8609	alternator: remove uses of get_local_gossiper() Replace with a gossiper parameter passed from the controller.	2021-09-07 20:08:15 +03:00
Avi Kivity	1ece156de6	storage_service: remove stray get_gossiper(), get_local_gossiper() calls storage_service already has a reference go gossiper, so just use it.	2021-09-07 20:08:15 +03:00
Avi Kivity	1a8f4937ca	migration_manager: remove use of get_gossiper() from passive_announce() migration_manager already has a reference to _gossiper, but passive_announce is static and so can't use it. Luckily the only caller (in storage_service) uses it as it it wasn't static, so we can just unstaticify it.	2021-09-07 20:08:15 +03:00
Avi Kivity	37818170d8	storage_proxy: start_hints_manager(): don't require caller to provide gossiper storage_proxy now maintains a reference to gossiper, so it can simplify its callers.	2021-09-07 20:08:15 +03:00
Avi Kivity	d8f7903f60	migration_manager: remove uses of get_local_gossiper() Pass gossiper as a constructor parameter instead. cql_test_env gains a use of get_gossiper() instead, but at least these uses are concentrated in one place.	2021-09-07 20:08:11 +03:00
Avi Kivity	71081be99c	storage_proxy: remove uses of get_local_gossiper() Pass the gossiper as a constructor parameter instead.	2021-09-07 17:14:09 +03:00
Botond Dénes	6e78e6c97f	tools: remove scylla-sstable-index It is replaced by scylla-sstable --dump-index.	2021-09-07 17:10:44 +03:00
Botond Dénes	2c600e34aa	tools: introduce scylla-sstable A tool which can be used to examine the content of sstable(s) and execute various operations on them. The currently supported operations are: * dump - dumps the content of the sstable(s), similar to sstabledump; * index-dump - dumps the content of the sstable index(es), similar to scylla-sstable-index; * writetime-histogram - generates a histogram of all the timestamps in the sstable(s); * custom - a hackable operation for the expert user (until scripting support is implemented); * validate - validate the content of the sstable(s) with the mutation fragment stream validator, same as scrub in validate mode;	2021-09-07 17:10:44 +03:00
Avi Kivity	aa68927873	gossiper: remove get_local_gossiper() from some inline helpers Some state accessors called get_local_gossiper(); this is removed and replaced with a parameter. Some callers (redis, alternators) now have the gossiper passed as a parameter during initialization so they can use the adjusted API.	2021-09-07 17:03:37 +03:00
Avi Kivity	9ce1af9fcb	gossiper: remove get_gossiper() from stop_gossiping() Have the callers pass it instead, and they all have a reference already except for cql_test_env (which will be fixed later). The checks for initialization it does are likely unnecessary, but we'll only be able to prove it when get_gossiper() is completely removed.	2021-09-07 16:20:04 +03:00
Avi Kivity	fcd5376585	gossiper: remove uses of get_local_gossiper for its rpc server Initialization happens in the gossiper itself, so we can capture 'this'. If we need to move to shard 0, use sharded::invoke_on() to get the local instance.	2021-09-07 16:06:11 +03:00
Avi Kivity	9fb9299d95	api: remove use of get_local_gossiper() Pass down gossiper from main, converting it to a shard-local instance in calls to register_api() (which is the point that broadcasts the endpoint registration across shards). This helps remove gossiper as a global variable.	2021-09-07 15:53:39 +03:00
Botond Dénes	e86073c703	tools: extract finding selected operation (handler) into function We want to use the pattern of having a command line flag for each operation in more tools, so extract the logic which finds the selected operation from the command line arguments into a function.	2021-09-07 15:47:22 +03:00
Botond Dénes	23a56beccc	tools: add schema_loader A utility which can load a schema from a schema.cql file. The file has to contain all the "dependencies" of the table: keyspace, UDTs, etc. This will be used by the scylla-sstable-crawler in the next patch.	2021-09-07 15:47:22 +03:00
Avi Kivity	61f02ece39	gossiper: remove calls to global get_gossiper from within the gossiper itself gossiper is a peering_sharded_service, so it has access to sharded<gossiper>. Remove the global call.	2021-09-07 15:15:09 +03:00
Botond Dénes	64dce2a59e	cql3: query_processor: add parse_statements()	2021-09-07 11:13:30 +03:00
Felipe Mendes	1b8dff63c3	iotune - Fix i3en.xlarge check i3en.xlarge is currently not getting tuned properly. A quick test using Scylla AMI ( ami-07a31481e4394d346 ) reveals that the storage capabilities under this instance are greatly reduced: $ grep iops /etc/scylla.d/io_properties.yaml read_iops: 257024 write_iops: 174080 This patch corrects this typo, in such a way that iotune now properly tunes this instance type. Closes #9298	2021-09-07 10:44:39 +03:00
Botond Dénes	68f5277e52	cql3: statements/create_type: expose create_type()	2021-09-07 10:37:25 +03:00
Botond Dénes	6b224b76b9	cql3: statements/create_keyspace: add get_keyspace_metadata()	2021-09-07 10:37:25 +03:00
Benny Halevy	e613c0d287	abstract_replication_strategy: remove commented out keyspace* member It is not needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210906133840.3307279-2-bhalevy@scylladb.com>	2021-09-06 16:51:22 +03:00
Benny Halevy	b7eaa22ce6	abstract_replication_strategy: create_replication_strategy: drop keyspace name parameter It is not used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210906133840.3307279-1-bhalevy@scylladb.com>	2021-09-06 16:51:21 +03:00
Benny Halevy	56e063ce93	keyspace: get rid of set_replication_strategy It's unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210906133905.3307397-1-bhalevy@scylladb.com>	2021-09-06 16:48:35 +03:00
Avi Kivity	69275f02fd	Merge "cmake: fix sources and out of source builds" from Pavel S " This is a set of random patches trying to fix broken cmake build: * `-fcoroutines` flag is now used only for GCC, but not for Clang * `SCYLLA-VERSION-GEN` invocation is adjusted to work correctly with out-of-source builds * Auxiliary targets are adjusted to support out-of-source builds * Removed extra source files and added missing ones to the scylla target Scylla still doesn't build successfully with CMake build. But now, at least, it's passes configuration step, which is a prerequisite to loading the solution in IDEs. " * 'cmake_improvements' of github.com:ManManson/scylla: cmake: fix out-of-source builds cmake: don't use `-fcoroutines` for clang cmake: update and sort source files and idl:s	2021-09-06 14:17:23 +03:00
Pavel Emelyanov	ad2e63aaf4	test: Don't nest seastar::async calls (2nd cont) The 2nd continuation of the previous patch fixes the places with run_with_async() nested inside explicit seastar::async calls. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-06 08:26:09 +03:00
Pavel Emelyanov	5fdc82bad7	test: Don't nest seastar::async calls (cont) The continuation of the previous patch for the cases when the sstables::test_env::run_with_async sits lower the stack from the SEASTAR_THREAD_TEST_CASE. The patching is similar but also requires some care about reference-captured variables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-06 08:26:09 +03:00
Pavel Emelyanov	8c786937d5	test: Don't nest seastar::async calls The SEASTAR_THREAD_TEST_CASE runs the provided lambda in async context. The sstables::test_env::run_with_async does the same. This (script-generated) patch makes all of the found cases be SEASTAR_TEST_CASE and, respectively, return the async future from the run_with_async(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-06 08:26:09 +03:00
Avi Kivity	dfc135dbd1	Merge "Keep range_tombstone apart from list linkage" from Pavel E " There's a landmine buried in range_rombstone's move constructor. Whoever tries to use it risks grabbing the tombstone from the containing list thus leaking the guy optionally invalidating an iterator pointing at it. There's a safety without_link moving constructor out there, but still. To keep this place safe it's better to separate range_tombstone from its linkage into anywhere. In particular to keep the range tombstones in a range_tombstone_list here's the entry that keeps the tombstone _and_ the list hook (which's a boost set hook). The approach resembles the rows_entry::deletable_row pair. tests: unit(dev, debug, patch from #9207) fixes: #9243 " * 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla: range_tombstone: Drop without-link constructor range_tombstone: Drop move_assign() range_tombstone: Move linkage into range_tombstone_entry range_tombstone_list: Prepare to use range_tombstone_entry range_tombstone, code: Add range_tombstone& getters range_tombstone_list: Factor out tombstone construction range_tombstone_list: Simplify (maybe) pop_front_and_lock() range_tombstone_list: De-templatize pop_as<> range_tombstone_list: Conceptualize erase_where() range_tombstone(_list): Mark some bits noexcept mutation: Use range_tombstone_list's iterators mutation_partition: Shorten memory usage calculation mutation_partition: Remove unused local variable	2021-09-05 17:26:13 +03:00
Raphael S. Carvalho	6849ec46b8	compaction: Don't purge tombstones in scrub Scrub is supposed to not remove anything from input, write it as is while fixing any corruption it might have. It shouldn't have any assumption on the input. Additionally, a data shadowed by a tombstone might be in another corrupted sstable, so expired tombstones should not be purged in order to prevent data ressurection from occurring. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210904165908.135044-1-raphaelsc@scylladb.com>	2021-09-05 17:10:34 +03:00
Dejan Mircevski	1fdaeca7d0	cql3: Reject updates with NULL key values We were silently ignoring INSERTs with NULL values for primary-key columns, which Cassandra rejects. Fix it by rejecting any modification_statement that would operate on empty partition or clustering range. This is the most direct fix, because range and slice are calculated in one place for all modification statements. It covers not only NULL cases, but also impossible restrictions like c>0 AND c<0. Unfortunately, Cassandra doesn't treat all modification statements consistently, so this fix cannot fully match its behavior. We err on the side of tolerance, accepting some DELETE statements that Cassandra rejects. We add a TODO for rejecting such DELETEs later. Fixes #7852. Tests: unit (dev), cql-pytest against Cassandra 4.0 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9286	2021-09-05 10:23:28 +03:00
Pavel Emelyanov	7a0e56d7c1	range_tombstone: Drop without-link constructor The thing was used to move a range tombstone without detaching it from the containing list (well, intrusive set). Now when the linkage is gone this facility is no longer needed (and actually no longer used). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:50 +03:00
Pavel Emelyanov	f82b5f30f6	range_tombstone: Drop move_assign() The helper was in use by move-assignment operator and by the .swap() method. Since now the operator equals the helper, the code can be merged and the .swap() can be prettified. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:50 +03:00
Pavel Emelyanov	d6af441eaa	range_tombstone: Move linkage into range_tombstone_entry Now it's time to remove the boost set's hook from the range_tombstone and keep it wrapped into another class if the r._tombstone's location is the range_tombstone_list. Also the added previously .tombstone() getters and the _entry alias can be removed -- all the code can work with the new class. Two places in the code that made use of without_link{} move-constructor are patched to get the range_tombstone part from the respective _entry with the same result. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	b8c585c54d	range_tombstone_list: Prepare to use range_tombstone_entry A continuation of the previous patch. The range_tombstone_list works with the range_tombstone very actively, kicking every single line doing this to call .tombstone() seems excessive. Instead, declare the range_tombstone_entry alias. When the entry will appear for real, the alias would go away and the range_tombstone_list will be switched into new entity right at once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	5515f7187d	range_tombstone, code: Add range_tombstone& getters Currently all the code operates on the range_tombstone class. and many of those places get the range tombstone in question from the range_tombstone_list. Next patches will make that list carry (and return) some new object called range_tombstone_entry, so all the code that expects to see the former one there will need to patched to get the range_tombstone from the _entry one. This patch prepares the ground for that by introdusing the range_tombstone& tombstone() { return *this; } getter on the range_tombstone itself and patching all future users of the _entry to call .tombstone() right now. Next patch will remove those getters together with adding the new range_tombstone_entry object thus automatically converting all the patched places into using the entry in a proper way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	ae8a5bd046	range_tombstone_list: Factor out tombstone construction Just add a helper for constructing the managed range tombstone object. This will also help further patch have less duplicating hunks in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	8f061b9b1c	range_tombstone_list: Simplify (maybe) pop_front_and_lock() The method returns a pointer on the left-most range tombstone and expects the caller to "dispose" it. This is not very nice because the callers thus needs to mess with the relevant deleter. A nicer approach is the pop-like one (former pop_as<>-like) which is in returning the range tombstone by value. This value is move-constructed from the original object which is disposed by the container itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	2e1b21d72b	range_tombstone_list: De-templatize pop_as<> The method pops the range tombstone from the containing list and transparently "converts" it into some other type. Nowadays all callers of it need range tombstone as-is, so the template can be relaxed down to a plan call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	e4965b1662	range_tombstone_list: Conceptualize erase_where() Just while at this code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	fcc02c6bed	range_tombstone(_list): Mark some bits noexcept The range_tombstone's .empty() and .operator bool are trivially such. The swap()'s noexceptness comes from what it calls -- the without-link move constructor (noexcept) and .move_assign(). The latter is noexcept because it's already called from noexcept move-assign operator and because it calls noexcept move operators of tombstones' fields. The update_node() is noexcept for the same reason. The range_tombstone_list's clear() is noexcept because both -- set clear and disposer lambda are both such. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:31:43 +03:00
Pavel Solodovnikov	9dc4e35e89	cmake: fix out-of-source builds Don't use relative paths, construct absolute paths to sources wherever needed. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-03 17:51:04 +03:00
Pavel Solodovnikov	334c982697	cmake: don't use `-fcoroutines` for clang This gcc flag is not supported. `-fcoroutines-ts` also cannot be used, so just don't supply anything, similar to what `configure.py` does. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-03 17:50:57 +03:00
Pavel Emelyanov	87ce46d1c6	mutation: Use range_tombstone_list's iterators The consume_clustering_fragments declares several auxiliary symbols to work with rows' and range-tombstones' iterators. For the range tombstones it relies on what container is declared inside the range tombstone itself. Soon the container declaration will move from range_tombstone class into a new entity and this place should be prepared for that. The better place to get iterator types from is the range-tombstones container itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 12:56:13 +03:00
Pavel Emelyanov	ac473a9e67	mutation_partition: Shorten memory usage calculation The range_tombstone_list's replacer runs exactly the same loop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 12:56:13 +03:00
Pavel Emelyanov	f173be29d9	mutation_partition: Remove unused local variable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 12:56:13 +03:00
Nadav Har'El	b3f4a37a75	test/alternator: verify that nulls are valid inside string and bytes The tests in this patch verify that null characters are valid characters inside string and bytes (blob) attributes in Alternator. The tests verify this for both key attributes and non-key attributes (since those are serialized differently, it's important to check both cases). The tests pass on both DynamoDB and Alternator - confirming that we don't have a bug in this area. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210824163442.186881-1-nyh@scylladb.com>	2021-09-03 08:49:06 +02:00
Avi Kivity	a81057b2e1	Merge "sstables: introduce crawling reader" from Botond " A special-purpose reader which doesn't use the index at all, designed to be used in circumstances where the index is not reliable. The use-case is scrub and validate which often have to work with corrupt indexes and it is especially important that they don't further any existing corruption. Tests: unit(dev) " * 'crawling-sstable-reader/v2' of https://github.com/denesb/scylla: compaction: scrub/validate: use the crawling sstable reader sstables: wire in crawling reader sstables: mx/reader: add crawling reader sstables: kl/reader: add crawling reader	2021-09-02 16:26:35 +03:00
Nadav Har'El	068c4283b7	test/cql-pytest: add tests for some undocumented cases of string types This patch adds tests for two undocumented (as far as I can tell) corner cases of CQL's string types: 1. The types "text" and "varchar" are not just similar - they are in fact exactly the same type. 2. All CQL string and blob types ("ascii", "text" or "varchar", "blob") allow the null character as a valid character inside them. They are not C strings that get terminated by the first null. These tests pass on both Cassandra and Scylla, so did not expose any bug, but having such tests is useful to understand these (so-far) undocumented behaviors - so we can later document them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210824225641.194146-1-nyh@scylladb.com>	2021-09-02 15:45:47 +03:00
Pavel Solodovnikov	ebee744590	idl-compiler: make the script work with python 3.8 Python 3.8 doesn't allow to use built-in collection types in type annotations (such as `list` or `dict`). This feature is implemented starting from 3.9. Replace `list[str]` type annotation with an old-style `List[str]`, which uses `List` from the `typing` module. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210901131436.35231-1-pa.solodovnikov@scylladb.com>	2021-09-02 15:38:44 +03:00
Pavel Solodovnikov	4cfec099b9	cmake: update and sort source files and idl:s Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-09-02 14:23:37 +03:00
Raphael S. Carvalho	3263c1d5f1	Make shutdown clean when stopping sstable reshard After `aa7cdc0392`, run_custom_job() propagates stop exception. The problem is that we fail to handle stop exception in the procedure which stops ongoing compactions, so the exception will be propagated all the way to init, which causes scylla to abort. to fix this, let's swallow stop_exception in stop_ongoing_compactions(), which is correct because compactions are stopped by triggering that exception if signalled to stop. when reshard is stopped, scylla init will fail as follow instead: ERROR 2021-08-16 20:13:13,770 [shard 0] init - Startup failed: std::runtime_error (Exception while populating keyspace 'keyspace5' with column family 'standard1' from file ... Fixes #9158. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210816232434.78375-1-raphaelsc@scylladb.com>	2021-09-02 13:50:24 +03:00
Benny Halevy	33f579f783	distributed_loader: distributed_loader::get_sstables_from_upload_dir: do not copy vector containing foreign shared sstables lw_shared_ptr must not be copied on a foreign shard. Copying the vector on shard 0 tries increases the reference count of lw_shared_ptr<sstable> elements that were created on other shards, as seen in https://github.com/scylladb/scylla/issues/9278. Fixes #9278 DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>	2021-09-02 13:49:06 +03:00
Avi Kivity	9c17f75f52	cql3: reduce noise in grammar when using cql3::expr types The CQL grammar is obviously about cql3 and mostly about cql3 expressions, so add using namespace statements so we don't have to specify it over and over again. These statements are present in the headers, but only in the cql_parser namespace, so it doesn't pollute other translation units. Closes #9255	2021-09-02 13:39:42 +03:00
Michał Radwański	9a1e82bb92	.gitignore: add compile_commands.json compile_commands.json is a format of compilation database info for use with several editors, such as VSCode (with official C++ extension) and Vim (with youcompleteme). It can be generated with ninja: ``` ninja -t compdb > compile_commands.json ``` I propose this addition, so that this file won't be commited by accident. Closes #9279	2021-09-02 13:37:35 +03:00
Pavel Solodovnikov	f8fe043b94	build: allow to run `SCYLLA-VERSION-GEN` utility out of source This change allows to invoke the script in out-of-source builds: `git log` now uses `-C` option with the directory containing the script. Also, the destination path can now be overriden by providing `-o\|--output-dir PATH` option. By default it's set to the `build` directory relative to the script location. Usage message is now shown, when '-h\|--help' option is specified. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210831120257.46920-1-pa.solodovnikov@scylladb.com>	2021-09-02 13:04:34 +03:00
Takuya ASADA	729d0feef0	install-dependencies.sh: add scylla-driver to relocatable python3 Pass --pip-packages option to tools/python3/reloc/build_reloc.sh, add scylla-driver to relocatable python3 which required for fix_system_distributed_tables.py. [avi: regenrate toolchain] Ref #9040	2021-09-02 11:52:47 +03:00
Pavel Emelyanov	cfcea8fc33	storage_service: Replace is_local_dc() with vs db::is_local() Both functions do the same -- get datacenters from given endpoint and local broadcast address and compare them to match (or not). tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210902080858.16364-1-xemul@scylladb.com>	2021-09-02 11:25:48 +03:00
Avi Kivity	403645f58c	Merge "raft: miscellaneous fixes" from Gleb * 'raft-misc-v3' of github.com:scylladb/scylla-dev: raft: rename snapshot into snapshot_descriptor raft: drop snapshot if is application failed raft: fix local snapshot detection raft: replication_test: store multiple snapshots in a state machine raft: do not wait for entry to become stable before replicate it	2021-09-02 11:25:06 +03:00
Avi Kivity	8a1d99a039	Update seastar submodule * seastar 07758294ef...c04a12edbd (4): > core: add alien() getter to reactor > io_priority_class: add missing headers > Merge "require deferred action to be noexcept" from Benny > net: silence compiler warning in tls_connected_socket_impl.	2021-09-02 11:11:49 +03:00
Michael Livshin	fbb5802229	mf-stream-validator: add previous partition key to error messages Only seems to make sense in mutation fragment validation where validation level is >= `partition_key`. Fixes #9269 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210901165641.340185-1-michael.livshin@scylladb.com>	2021-09-02 11:05:33 +03:00
Botond Dénes	7a78601b5d	compaction: scrub/validate: use the crawling sstable reader Sstables that are scrubbed or validated are typically problematic ones that potentially have corrupt indexes. To avoid using the index altogether use the recently added crawling reader. Scrub and validate never skips in the sstable anyway.	2021-09-01 16:21:49 +03:00
Botond Dénes	1abf665d1d	sstables: wire in crawling reader	2021-09-01 16:21:49 +03:00
Avi Kivity	705f957425	Merge "Generalize TLS creds builder configuration" from Pavel E " There are 4 places out there that do the same steps parsing "client_\|server_encryption_options" and configuring the seastar::tls::creds_builder with the values (messaging, redis, alternator and transport). Also to make redis and transport look slimmer main() cleans the client_encryption_options by ... parsing it too. This set introduces a (coroutinized) helper to configure the creds_builder with map<string, string> and removes the options beautification from main. tests: unit(dev), dtest.internode_ssl_test(dev) " * 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla: code: Generalize tls::credentials_builder configuration transport, redis: Do not assume fixed encryption options messaging: Move encryption options parsing to ms main: Open-code internode encryption misconfig warning main, config: Move options parsing helpers	2021-09-01 14:19:19 +03:00
Nadav Har'El	72bc37ddc1	README.md: update link to docker build instructions The link to the docker build instructions was outdated - from the time our docker build was based on a Redhat distribution. It no longer is, it's now based on Ubuntu, and the link changed accordingly. Fixes #9276. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210901083055.445438-1-nyh@scylladb.com>	2021-09-01 11:50:11 +03:00
Liu Lan	a5c54867f8	alternator: Exclusive start key must lie within the segment ...when using Segment/TotalSegment option. The requirement is not specified in DynamoDB documents, but found in DynamoDB Local: {"__type":"com.amazon.coral.validate#ValidationException", "message":"Exclusive start key must lie within the segment"} Fixes #9272 Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com> Closes #9270	2021-09-01 11:05:45 +03:00
Botond Dénes	9548200e85	sstables: mx/reader: add crawling reader A special-purpose reader which doesn't use the index at all and hence doesn't support skipping at all. It is designed to be used in conditions in which the index is not reliable (scrub compaction).	2021-09-01 08:44:13 +03:00
Botond Dénes	4421929b25	sstables: kl/reader: add crawling reader A special-purpose reader which doesn't use the index at all and hence doesn't support skipping at all. It is designed to be used in conditions in which the index is not reliable (scrub compaction).	2021-09-01 08:42:10 +03:00
Avi Kivity	8b59e3a0b1	Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski Return the pre- `6773563d3` behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now. Fixes #7608; see comments there for justification. Tests: unit (debug, dev), dtest (cql_additional_test, paging_test) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9126 * github.com:scylladb/scylla: cql3: Demand ALLOW FILTERING for unlimited, sliced partitions cql3: Track warnings in prepared_statement test: Use ALLOW FILTERING more strictly cql3: Add statement_restrictions::to_string	2021-08-31 18:05:26 +03:00
Dejan Mircevski	2f28f68e84	cql3: Demand ALLOW FILTERING for unlimited, sliced partitions When a query requests a partition slice but doesn't limit the number of partitions, require that it also says ALLOW FILTERING. Although do_filter() isn't invoked for such queries, the performance can still be unexpectedly slow, and we want to signal that to the user by demanding they explicitly say ALLOW FILTERING. Because we now reject queries that worked fine before, existing applications can break. Therefore, the behavior is controlled by a flag currently defaulting to off. We will default to "on" in the next Scylla version. Fixes #7608; see comments there for justification. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-31 10:45:41 -04:00
Nadav Har'El	9666921dbc	Merge 'cql3: expr: introduce search_and_replace()' from Avi Kivity Introduce a general-purpose search and replace function to manipulate expressions, and use it to simplify replace_column_def() and replace_token(). Closes #9259 * github.com:scylladb/scylla: cql3: expr: rewrite replace_token in terms of search_and_replace() cql3: expr: rewrite replace_column_def in terms of search_and_replace() cql3: expr: add general-purpose search-and-replace	2021-08-31 15:56:41 +03:00
Avi Kivity	6a0a5a17d7	Merge "Fix exception safety of btree::clone_from()" from Pavel E " When cloning throws in the middle it may leak some child nodes triggering the respective assertion in node destructor. Also there's a chance to mis-assert the linear node roll-back. tests: unit(dev) " Fixes #9248 Backport: 4.5 * 'br-btree-clone-exceptions-2' of https://github.com/xemul/scylla: btree: Add commens in .clone() and .clear() btree, test: Test exception safety and non-leakness of btree::clone_from btree, test: Test key copy constructor may throw btree: Dont leak kids on clone roll-back btree: Destroy, not drop, node on clone roll-back	2021-08-31 14:34:14 +03:00
Pavel Emelyanov	e6d568b38e	btree: Add commens in .clone() and .clear() There are two tricky places about corner leaves pointers managements. Add comments describing the magic. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:36:54 +03:00
Avi Kivity	542a8bc0f3	cql3: expr: rewrite replace_token in terms of search_and_replace() Use search_and_replace() to simplify replace_token(). Note the conversion does not have 100% fidelity - the previous implementation throws on some impossible subexpression types, and the new one passes them through. It should be the caller's responsibility anyway, not a side effect of replacing tokens, and since these subexpressions are impossible there is no real effect on execution. Note that this affects only TOKEN() calls on the partition key columns in the right order. Other uses of the token function (say with constants) won't be translated to the token subexpression type. So something like WHERE token(pk) = token(?) would only see the left-hand side replaced, not the right-hand side, even if it were an expression rather than a term.	2021-08-31 12:29:47 +03:00
Avi Kivity	10ca63128a	cql3: expr: rewrite replace_column_def in terms of search_and_replace() We're won't introduce new expression types that are equivalent to column_value, and search_and_replace() takes care of all expressions that need to recurse, so we don't need std::visit() for the search/replace lambda.	2021-08-31 12:29:47 +03:00
Avi Kivity	7a594bc42f	cql3: expr: add general-purpose search-and-replace Add a recursive search-and-replace function on expressions. The caller provides a search/replace function to operate on subexpressions, returning nullopt if they want the default behavior of recursively copying, or a new expression to terminate the search (in the current subtree) and replace the current node with the returned expression. To avoid endlessly specifying the subexpression types that get the the common behavior (copying) since they don't contain any subexpressions, we add a new concept LeafExpression to signify them. Existing functions such as replace_token() can be reimplemented in terms of search_and_replace, but that is left for later.	2021-08-31 12:29:37 +03:00
Pavel Emelyanov	e26a6c1acc	btree, test: Test exception safety and non-leakness of btree::clone_from Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Pavel Emelyanov	da38038222	btree, test: Test key copy constructor may throw It calls the tree_test_key_base copy constructor which is throwing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Pavel Emelyanov	d1a1a2dac2	btree: Dont leak kids on clone roll-back When failed-to-be-cloned node cleans itself it must also clear all its child nodes. Plain destroy() doesn't do it, it only frees the provided node. fixes: #9248 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Pavel Emelyanov	1d857d604a	btree: Destroy, not drop, node on clone roll-back The node in this place is not yet attached to its parent, so in btree::debug::yes (tests only) mode the node::drop()'s parent checks will access null parent pointer. However, in non-tesing runtime there's a chance that a linear node fails to clone one of its keys and gets here. In this case it will carry both leftmost and rightmost flags and the assertion in drop will fire. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Dejan Mircevski	81f00d82cf	cql3: Drop more dead code This is some dead code that `44ca965ba` missed. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9267	2021-08-31 12:06:19 +03:00
Beni Peled	4fe4aa190d	dist-check: add podman support ...and use container term instead of docker Closes #9265	2021-08-31 09:10:58 +03:00
Nadav Har'El	d7474ddff3	dist/docker: fix errors in README.md The (oddly-placed) document dist/docker/debian/README.md explains how a developer can build a Scylla docker image using a self-built Scylla executable. While the document begins by saying that you can "build your own Scylla in whatever build mode you prefer, e.g., dev.", the rest of the instructions don't fit this example mode "dev" - the second command does "ninja dist-deb" which builds all modes, while the third command forgets to pass the mode at all (and therefore defaults to "release"). The forth command doesn't work at all, and became irrelevant during a recent rewrite in commit `e96ff3d`. This patch modifies the document to fix those problems. It ends with an example of how to run the resulting docker image (this is usually the purpose of building a docker image - to run it and test it). I did this example using podman because I couldn't get it to work in docker. Later we can hopefully add the corresponding docker example. Fixes #9263. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210829182608.355748-1-nyh@scylladb.com>	2021-08-30 08:36:33 +03:00
Nadav Har'El	ed7106ebd7	docker: fix regression of docker image ignoring command-line arguments Our docker image accepts various command-line arguments and translates them into Scylla arguments. For example, Alternator's getting-started document has the following example: ``` docker run --name scylla -d -p 8000:8000 scylladb/scylla-nightly:latest --alternator-port=8000 --alternator-write-isolation=always``` Recently, this stopped working and the extra arguments at the end were just ignored. It turns out that this is a regression caused by commit `e96ff3d82d` that changed our docker image creation process from Dockerfile to buildah. While the entry point specified in Dockerfile was a string, the same string in buildah has a strange meaning (an entry point which can't take arguments) and to get the original meaning, the entry point needs to be a JSON array. This is kind-of explained in https://github.com/containers/buildah/issues/732. So changing the entry point from a string to a JSON array fixes the regression, and we can again pass arguments to Scylla's docker image. Fixes #9247. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210829180328.354109-1-nyh@scylladb.com>	2021-08-30 08:26:15 +03:00
Nadav Har'El	bd4552fd57	configure.py: fix build-mode-specific targets to not build all modes We have in our Ninja build file various targets which ask to build just a single build mode. For example, "ninja dev" builds everything in dev mode - including Scylla, tests, and distribution artifacts - but shouldn't build anything in other build modes (debug, release, etc.), even if they were previously configured by configure.py. However, we had a bug where these build-mode-specific targets nevertheless compiled all configured modes, not just the requested mode. The bug was introduced in commit `edd54a9463` - targets "dist-server-compat" and "dist-unified-compat" were introduced, but instead of having per-build-mode versions of these targets, only one of each was introduced building all modes. When these new targets were used in a couple of places in per-build-mode targets, it forced these targets to build all modes instead of just the chosen one. The solution is to split the dist-server-compat target into multiple dist-server-compat-{mode}, and similarly split dist-unified-compat. The unsplit target is also retained - for use in targets that really want all build modes. Fixes #9260. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210829123418.290333-1-nyh@scylladb.com>	2021-08-29 15:38:27 +03:00
Gleb Natapov	ce40b01b07	raft: rename snapshot into snapshot_descriptor The snapshot structure does not contain the snapshot itself but only refers to it trough its id. Rename it to snapshot_descriptor for clarity.	2021-08-29 12:53:03 +03:00
Gleb Natapov	0aa2e95475	raft: drop snapshot if is application failed No need to keep a snapshot that was not applied.	2021-08-29 12:53:03 +03:00
Gleb Natapov	f9f859ac40	raft: fix local snapshot detection The code assumes that the snapshot that was taken locally is never applied. Currently logic to detect that is flawed. It relies on an id of a most recently applied snapshot (where a locally taken snapshot is considered to be applied as well). But if between snapshot creation and the check another local snapshot is taken ids will not match. The patch fixes this by propagating locality information together with the snapshot itself.	2021-08-29 12:53:03 +03:00
Gleb Natapov	80a392a444	raft: replication_test: store multiple snapshots in a state machine State machine should be able to store more then one snapshot at a time (one may be the currently used one and another is transferred from a leader but not applied yet).	2021-08-29 12:53:03 +03:00
Gleb Natapov	5e1d589872	raft: do not wait for entry to become stable before replicate it Since io_fiber persist entries before sending out messages even non stable entries will become stable before observed by other nodes. This patch also moves generation of append messages into get_outptut() call because without the change we will lose batching since each advance of last_idx will generate new append message.	2021-08-29 12:48:15 +03:00
Avi Kivity	3de5b849e0	Update tools/java submodule (JAVA8_HOME) * tools/java 0b6ecbeb90...a2fe67fd42 (1): > build_reloc.sh: set JAVA8_HOME if not already set	2021-08-29 12:27:17 +03:00
Pavel Emelyanov	60a7ca62f2	storage_service: Drop .enable_all_features() This method has nothing to do with storage service and is only needed to move feature service options from one method to another. This can be done by the only caller of it. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210827133954.29535-1-xemul@scylladb.com>	2021-08-29 11:27:05 +03:00
Pavel Solodovnikov	998dadf479	keys: remove `with_linearized` uses There is a variant of `to_hex` that works with `managed_bytes_view`, no need to linearize. Tests: unit(dev) [avi: edit out unneeded std::ref()] Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210828093252.650928-1-pa.solodovnikov@scylladb.com>	2021-08-28 12:49:10 +03:00
Pavel Emelyanov	77a8fee513	tests: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 19:17:08 +03:00
Pavel Emelyanov	9a77ff1cf4	tests: Split sstable_conforms_to_mutation_source The only case in this test effectively carries 6 of them. When run as they are now (sequentially) the total test run time in debug mode is ~35 minutes. When split each case takes ~6 minutes to complete. In dev/release mode it's ~1 minute vs ~10 seconds respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 19:17:01 +03:00
Pavel Emelyanov	0fd00d7016	cdc: Add database argument to is_log_for_some_table All callers has been patched already. This argument can now be used to replace get_local_storage_proxy().get_db().local() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 14:07:26 +03:00
Pavel Emelyanov	2701a1ee28	client_state: Pass database into has_access() All callers of it already have it, so just pass it along Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 14:07:26 +03:00
Pavel Emelyanov	de7761985c	client_state: Add database argument to has_schema_access The only caller is thrift that has database reference on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 14:07:26 +03:00
Pavel Emelyanov	36a4c1ddc1	client_state: Add database argument to has_keyspace_access() Callers are cql3, that has database via proxy, and thrift that has one by reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 14:07:18 +03:00
Pavel Emelyanov	fe8bc0757b	cdc: Add database argument to check_for_attempt_to_create_nested_cdc_log The only caller of it already has database argument, just pass it a bit further Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-27 14:07:18 +03:00
Pavel Solodovnikov	b00443ab87	test: adjust `schema_change_test` to include new `system.raft_config` table Check that the new table uses null sharder. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:30:17 +03:00
Pavel Solodovnikov	8d3c0ee9b6	raft: new schema for storing raft snapshots Previously, the layout for storing raft snapshot descriptors contained a `config` field, which had `blob` data type. That means `raft::configuration` for the snapshot was serialized as a whole in binary form. It's convenient to implement and is the most compact form of representing the data, but: 1. Hard to debug due to the need to de-serialize the data. 2. Plants a time bomb wrt. changing data layout and also the documentation in the future. Remove the `config` field from `system.raft_snapshots` and extract it to a separate `system.raft_config` table to store the data in exploded form. Also, modify the schema of `system.raft_snapshots` table in the following way: add a `server_id` field as a part of composite partition key ((group_id, server_id)) to be able to start multiple raft servers belonging to one raft group on the same scylla node. Rename `id` field in `raft_snapshots` to `snapshot_id` so it's self-documenting. Rename `snapshot_id` from clustering key since a given server can have only one snapshot installed at a time. Note that the `raft::server_address` stucture contains an opaque `info` member, which is `bytes`, but in the `raft_config` table we use `ip_addr inet` field, instead. We always know that the corresponding member field is going to contain an IP address (either v4 or v6) of a given raft server. So, now the snapshots schema looks like this: CREATE TABLE raft_snapshots ( group_id timeuuid, server_id uuid, snapshot_id uuid, idx int, term int, -- no `config` field here, moved to `raft_config` table PRIMARY KEY ((group_id, server_id)) ) CREATE TABLE raft_config ( group_id timeuuid, my_server_id uuid, server_id uuid, disposition text, -- can be either 'CURRENT` or `PREVIOUS' can_vote bool, ip_addr inet, PRIMARY KEY ((group_id, my_server_id), server_id, disposition) ); This way it's much easier to extend the schema with new fields, very easy to debug and inspect via CQL, and it's much more descriptive in terms of self-documentation. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:24:46 +03:00
Pavel Solodovnikov	0a8faee660	raft: pass server id to `raft_sys_table_storage` instance Preparations for changing raft snapshots schema. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:24:20 +03:00
Eliran Sinvani	7f44736939	Service Levels: do not notify stale service level removals Before this commit, the service_level_controller will notify the subscribers on stale deletes, meaning, deletes of localy non exixtent service_levels. The code flow shouldn't ever get to such a state, but as long as this condition is checked instead of being asserted it is worthwhile to change the code to be safe. Closes #9253	2021-08-26 18:27:52 +03:00
Nadav Har'El	389b866d33	Merge 'cql3: convert term::raw to expressions' from Avi Kivity This series converts the `term::raw` objects the grammar produces to expressions. For each grammar production, an expression type is either reused or created. The term::raw methods are converted to free functions accepting expressions as input (but not yet generating expressions as output). There is some friction because the original code base had four different expression domains: `term`, `term::raw`, `selectable`, and `selectable::raw`. They don't match exactly, so in some cases we need to add additional state to distinguish between them. There are also many run-time checks introduced (on_internal_error) since the union of the domains is much larger than each individual domain. The method used is to erect a bi-directional bridge between term::raw and expressions, convert various term::raw subclasses one by one, and then remove the bridge and term::raw. Test: unit (dev). Closes #9170 * github.com:scylladb/scylla: cql3: expr: eliminate column_specification_or_tuple cql3: expr: hide column_specification_or_tuple cql3: term::raw: remove term::raw and scaffolding cql3: grammar: collapse conversions between term::raw and expressions cql3: relation: convert to_term() to experssions cql3: functions: avoid intermediate conversions to term::raw cql3: create_aggregate_statement: convert term::raw to expression cql3: update_statement, insert_statement: convert term::raw to expression cql3: select_statement: convert term::raw to expression cql3: token_relation: convert term::raw to expressions cql3: operation: convert term::raw to expression cql3: multi_column_relation: convert term::raw to expressions cql3: single_column_relation: convert term::raw to expressions cql3: column_condition: convert term::raw to expressions cql3: expr: don't convert subexpressions to term::raw during the prepare phase cql3: attributes: convert to expressions cql3: expr: introduce test_assignment_all() cql3: expr: expose prepare_term, test_assignment in the expression domain cql3: expr: provide a bridge between expressions and assignment_testable cql3: expr, user types: convert user type literals to expressions cql3: selection: make selectable.hh not include expr/expresion.hh cql3: sets, user types: move user types raw functions around cql3: expr, sets, maps: convert set and map literals to collection_constructor cql3: sets, maps, expr: move set and map raw functions around cql3: expr, lists: convert lists::literal to new collection_constructor cql3: lists, expr: move list raw functions around cql3: tuples, expr: convert tuples::literal to expr::tuple_constructor cql3: expr, tuples: deinline and move tuple raw functions cql3: expr, constants: convert constants::literal to untyped_constant cql3: constants: move constants::literal implementation around cql3: expr, abstract_marker: convert to expressions cql3: column_condition: relax types around abstact_marker::in_raw cql3: tuple markers: deinline and rearrange cql3: abstract_marker, term_expr: rearrange raw abstract marker implementation cql3: expr, constants: convert cql3::constants::null_literal to new cql3::expr::null cql3: expr, constants: deinline null_literal cql3: constants: extricate cql3::constants::null_literal::null_value from null_literal cql3: term::raw, expr: convert type casts to expressions cql3: type_cast: deinline some methods cql3: expr: prepare expr::cast for unprepared types cql3: expr, functions: move raw function calls to expressions cql3: expr, term::raw: add conversions between the two types cql3: expr, term::raw: add reverse bridge cql3: term::raw, expr: add bridge between term::raw and expressions cql3: eliminate multi_column_raw cql3: term::raw, multi_column_raw: unify prepare() signatures	2021-08-26 17:29:40 +03:00
Michał Chojnowski	126baa7850	utils: compact-radix-tree: fix accidental cache line bouncing Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is overwritten. But since nil_root is shared between shards, this causes severe cache line bouncing. (It was observed to reduce the total write throughput of Scylla by 90% on a large NUMA machine). This backreference is never read anyway, so fix this bug by not writing it. Fixes #9252 Closes #9246	2021-08-26 17:22:22 +03:00
Avi Kivity	2da7b79e16	cql3: expr: eliminate column_specification_or_tuple column_specification_or_tuple is now used internally, wrapping and a column_specification or a vector and immediately unwrapping in the callee. The only exceptions are bind_variable and tuple_constructor, which handles both cases. Use the underlying types directly instead, and add dispatching to prepare_term_multi_column() for the two cases it handles.	2021-08-26 16:30:47 +03:00
Avi Kivity	ad285c3c84	cql3: expr: hide column_specification_or_tuple column_specification_or_tuple was introduced since some terms were prepared using a single receiver e.g. (receiver = <term>) and some using multiple receivers (e.g. (r1, r2) = <term>. Some term types supported both. To hide this complexity, the term->expr conversion used a single interface for both variations (column_expression_or_tuple), but now that we got rid of the term class and there are no virtual functions any more, we can just use two separate functions for the two variants. Internally we still use column_expression_or_tuple, it can be removed later.	2021-08-26 16:17:49 +03:00
Avi Kivity	158822c1a6	cql3: term::raw: remove term::raw and scaffolding Nothing now uses term::raw, remove it and the scaffolding used to migrate it to expressions.	2021-08-26 16:14:47 +03:00
Avi Kivity	78b7af415f	cql3: grammar: collapse conversions between term::raw and expressions The grammar now talks to expression API:s solely, so it can be converted internally to expressions too. Calls to as_term_raw() and as_expression() are removed, and productions return expressions instead of term::raw:s.	2021-08-26 15:56:44 +03:00
Avi Kivity	cb2560728a	cql3: relation: convert to_term() to experssions Now that the entire relation hierarchy was converted to expressions, also convert relation::to_term().	2021-08-26 15:56:44 +03:00
Avi Kivity	f652972b12	cql3: functions: avoid intermediate conversions to term::raw Instead, use conversions to assignment_testable and native expression prepare functions.	2021-08-26 15:56:44 +03:00
Avi Kivity	8cd505d191	cql3: create_aggregate_statement: convert term::raw to expression Straightforward substitution.	2021-08-26 15:53:27 +03:00
Avi Kivity	dd30b7853b	cql3: update_statement, insert_statement: convert term::raw to expression Straightforward substitution.	2021-08-26 15:42:30 +03:00
Avi Kivity	b11ec1aeda	cql3: select_statement: convert term::raw to expression Straightforward substitution; using std::optional<> since those expressions are indeed optional.	2021-08-26 15:41:14 +03:00
Avi Kivity	cf10df10f4	cql3: token_relation: convert term::raw to expressions Change term::raw in token_relation to expressions. to_term() is not converted, since it's part of the larger relation hierarchy.	2021-08-26 15:39:43 +03:00
Avi Kivity	c2d49b50f4	cql3: operation: convert term::raw to expression Straightforward substitution.	2021-08-26 15:37:52 +03:00
Avi Kivity	b6e17ed111	cql3: multi_column_relation: convert term::raw to expressions Change term::raw in multi_column_relation to expressions. Because a single raw class is used to represent multiple shapes (IN ? and IN (x, y, z)), some of the expressions are optional, corresponding to nullables before the conversion. to_term() is not converted, since it's part of the larger relation hierarchy.	2021-08-26 15:36:42 +03:00
Avi Kivity	4809cf7ff3	cql3: single_column_relation: convert term::raw to expressions Change term::raw in single_column_relation to expressions. Because a single raw class is used to represent multiple shapes (IN ? and IN (x, y, z)), some of the expressions are optional, corresponding to nullables before the conversion. to_term() is not converted, since it's part of the larger relation hierarchy.	2021-08-26 15:35:32 +03:00
Avi Kivity	793aca8e4e	cql3: column_condition: convert term::raw to expressions Change term::raw in column_condition::raw to expressions. Because a single raw class is used to represent multiple shapes (IN ? and IN (x, y, z)), some of the expressions are optional, corresponding to nullables before the conversion. to_term() is not converted, since it's part of the larger relation hierarchy.	2021-08-26 15:34:13 +03:00
Avi Kivity	8cdb6a102f	cql3: expr: don't convert subexpressions to term::raw during the prepare phase Now that we have the prepare machinery exposed as expression API:s (not just term::raw) we can avoid conversions from expressions to term::raw when preparing subexpressions.	2021-08-26 15:34:01 +03:00
Avi Kivity	c93731a6e9	cql3: attributes: convert to expressions Convert the three variables in attrbutes::raw to expressions. Since those attributes are optional, use std::optional to indicate it (since we can't rely on shared_ptr<term::raw> being null).	2021-08-26 15:32:52 +03:00
Avi Kivity	3c6914c5bf	cql3: expr: introduce test_assignment_all() The test_assignment class has a test_all() helper to test a vector of assignment_testable. But expressions are not derived from assignment_testable, so introduce a new helper that does the same for expressions.	2021-08-26 15:30:46 +03:00
Avi Kivity	55fd8e69ec	cql3: expr: expose prepare_term, test_assignment in the expression domain So far prepare (in the term domain) was called via term::raw. To be able to prepare in the expression domain, expose functions prepare_term() and test_assignment() that accept expressions as arguments. prepare_term() was chosen rather that prepare() to differentiate wrt. the other domain that can be prepared (selectables).	2021-08-26 15:29:10 +03:00
Avi Kivity	be335f4dee	cql3: expr: provide a bridge between expressions and assignment_testable While we have a bridge between expressions and term::raw, which is derived from assignment_testable, we will soon get rid of term::raw and so won't be able to interface with API:s that require an assignment_testable. So add a bridge for that. The user is function::get(), which uses assignment_testable to infer the function overload from the argument types.	2021-08-26 15:26:38 +03:00
Avi Kivity	562e68835b	cql3: expr, user types: convert user type literals to expressions Convert the user_types::literal raw to a new expression type usertype_constructor. I used "usertype" to convey that is is a ((user type) constructor), not a (user (type constructor)).	2021-08-26 15:26:35 +03:00
Avi Kivity	4d7e00d0f8	cql3: selection: make selectable.hh not include expr/expresion.hh We have this dependency now: column_identifier -> selectable -> expression and want to introduce this: expression -> user types -> column_identifier This leads to a loop, since expression is not (yet) forward declarable. Fix by moving any mention of expression from selectable.hh to a new header selection-expr.hh. database.cc lost access to timeout_config, so adjust its includes to regain it.	2021-08-26 15:19:14 +03:00
Avi Kivity	9d6bc7eae6	cql3: sets, user types: move user types raw functions around Move them closer to prepare related functions for modification.	2021-08-26 15:15:59 +03:00
Avi Kivity	06bca067f8	cql3: expr, sets, maps: convert set and map literals to collection_constructor Add set and map styles to collection_constructor. Maps are implemented as collection_constructor{tuple_constructor{key, value}...}. This saves having a new expression type, and reduces the effort to implement recursive descent evaluation for this omitted expression type.	2021-08-26 15:13:37 +03:00
Avi Kivity	658cd47d21	cql3: sets, maps, expr: move set and map raw functions around Move them closer to prepare related functions for modification. Since sets and maps share some implementation details in the grammar, they are moved and converted as a unit.	2021-08-26 15:13:07 +03:00
Avi Kivity	d2ab7fc26d	cql3: expr, lists: convert lists::literal to new collection_constructor Introduce a collection_constructor (similar to C++'s std::initializer_list) to hold subexpressions being gathered into a list. Since sets, maps, and lists construction share some attributes (all elements must be of the same type) collection_constructor will be used for all of them, so it also holds an enum. I used "style" for the enum since it's a weak attribute - an empty set is also an empty map. I chose collection_constructor rather than plain 'collection' to highlight that it's not the only way to get a collection (selecting a collection column is another, as an example) and to hint at what it does - construct a collection from more primitive elements.	2021-08-26 15:10:41 +03:00
Avi Kivity	4defb42c86	cql3: lists, expr: move list raw functions around Move them closer to prepare related functions for modification.	2021-08-26 15:08:14 +03:00
Avi Kivity	5e448e4a2a	cql3: tuples, expr: convert tuples::literal to expr::tuple_constructor Introduce tuple_constructor (not a literal, since (?, ?) and (column_value, column_value) are not literals) to represent a tuple constructed from subexpressions. In the future we can replace column_value_tuple with tuple_constructor(column_value, column_value, ...), but this is not done now. I chose the name 'tuple_constructor' since other expressions can represent tuples (e.g. my_tuple_column, :bind_variable_of_tuple_type, func_returning_tuple()). It also explains what the expression does.	2021-08-26 15:07:15 +03:00
Avi Kivity	41c532f19c	cql3: expr, tuples: deinline and move tuple raw functions Move them closer to prepare functions for modification.	2021-08-26 15:04:21 +03:00
Avi Kivity	2c42a65db1	cql3: expr, constants: convert constants::literal to untyped_constant Introduce a new expression untyped_constant that corresponds to constants::literal, which is removed. untyped_constant is rather ugly in that it won't exist post-prepare. We should probably instead replace it with typed constants that use the widest possible type (decimal and varint), and select a narrower type during the prepare phase when we perform type inference. The conversion itseld is straightforward.	2021-08-26 15:03:07 +03:00
Avi Kivity	4d9bde561a	cql3: constants: move constants::literal implementation around Move it closer to prepare functions for modification.	2021-08-26 15:01:06 +03:00
Avi Kivity	838bfbd3e0	cql3: expr, abstract_marker: convert to expressions Convert the four forms of abstract_marker to expr::bind_variable (the name was chosen since variable is the role of the thing, while "marker" refers more to the grammar). Having four variants is unnecessary, but this patch doesn't do anything about that.	2021-08-26 15:01:04 +03:00
Avi Kivity	218f4d87f8	cql3: column_condition: relax types around abstact_marker::in_raw We can only convert expressions to term::raw, not the subclass abstract_marker::in_raw, so relax the types. They will all be converted to expressions. Relaxing types isn't good, but the structure is enforced now by the grammar (and dynamically using variant casts), and in the future by a typecheck pass (which will allow us to remove the many variations of markers).	2021-08-26 14:55:17 +03:00
Avi Kivity	6dcc43d227	cql3: tuple markers: deinline and rearrange Move raw methods near to the other prepare-related functions.	2021-08-26 14:54:15 +03:00
Avi Kivity	35db2b34e4	cql3: abstract_marker, term_expr: rearrange raw abstract marker implementation Move raw methods near to the other prepare-related functions.	2021-08-26 14:53:58 +03:00
lauranovich	e78746e94d	docs: fix removal of master from website drop-down Closes #9251	2021-08-26 14:51:37 +03:00
Avi Kivity	aba205917d	cql3: expr, constants: convert cql3::constants::null_literal to new cql3::expr::null Introduce cql3::expr::null and use it to represent null_literal, which is removed.	2021-08-26 14:49:46 +03:00
Avi Kivity	5b42cbf9e0	cql3: expr, constants: deinline null_literal Deinline null_literal methods and place them near the other prepare-related functions.	2021-08-26 14:45:56 +03:00
Avi Kivity	51f62d5953	cql3: constants: extricate cql3::constants::null_literal::null_value from null_literal null_literal (which is in the term::raw domain) will be converted to an expression, so unnest the nested class null_value (which is in the term domain and is not converted now).	2021-08-26 14:44:21 +03:00
Avi Kivity	10e08dc87e	cql3: term::raw, expr: convert type casts to expressions We reuse the expr::cast type that was previously used for selectables. When preparing, subexpressions are converted to term::raw; this will be removed later.	2021-08-26 14:42:55 +03:00
Avi Kivity	6f8b6aef17	cql3: type_cast: deinline some methods These methods will be converted to the expression variant, and it's impossible to do this while inlined due to #include cycles. In any case, deinlining is better. Since there is no type_cast.cc, and since they'll become part of expr_term call chain soon, they're moved there, even though it seems odd for this patch. It's a waste to create type_cast.cc just for those three functions.	2021-08-26 14:41:38 +03:00
Avi Kivity	3d30c161e4	cql3: expr: prepare expr::cast for unprepared types The cast expression has two operands: the subexpression to cast and the type to cast to. Since prepared and unprepared expressions are the same type, we don't have to do anything, but prepared and unprepared types are different. So add a variant to be able to support both. The reason the selectable->expression transformation did not need to do this is that casts in a selector cannot accept a user defined type. Note those casts also have different syntax and different execution, so we'll have to choose whether to unify the two semantics, or whether to keep them separate. This patch does not force anything (but does hint at unification by not including any discriminant beyond the type's rawness). The string representation matches the part of the grammar it was derived from (or conversion back to CQL will yield wrong results).	2021-08-26 14:39:33 +03:00
Avi Kivity	b76395a410	cql3: expr, functions: move raw function calls to expressions Remove cql3::functions::function_call::raw and replace it with cql3::expr::function_call, which already existed from the selector migration to expressions. The virtual functions implementing term::raw are made free functions and remain in place, to ease migration and review. Note that preparing becomes a more complicated as it needs to account for anonymous functions, which were not representable in the previous structure (and still cannot be created by the parser for the term::raw path). The parser now wraps all its arguments with the term::raw->expr bridge, since that's what expr::function_call expects, and in turn wraps the function call with an expr->term::raw bridge, since that's what the rest of the parser expects. These will disappear when the migration completes.	2021-08-26 14:38:16 +03:00
Avi Kivity	0d24af7775	cql3: expr, term::raw: add conversions between the two types Add a way to convert between the old world and the new, and back. Note that instead of blindly wrapping, we unwrap if we received a wrapped object.	2021-08-26 14:35:46 +03:00
Avi Kivity	a5031dd5bf	cql3: expr, term::raw: add reverse bridge Since expressions can nest, and since we won't covert everything at once, add a way to store a term::raw as an expression. We can now have a term::raw that is internally an expression, and an expression that is implemented as term::raw.	2021-08-26 14:32:04 +03:00
Avi Kivity	725065b066	cql3: term::raw, expr: add bridge between term::raw and expressions A term_raw_expression is a term::raw that holds an expression. It will be used to incrementally convert the source base to expressions, while still exposing the result to the common interface of shared_ptr<term::raw>.	2021-08-26 14:14:18 +03:00
Avi Kivity	9a158cd7b5	cql3: eliminate multi_column_raw Now that the signatures of term::raw::prepare and multi_column_raw::prepare are identical, we can eliminate multi_column_raw, replacing it with term::raw where needed. In some cases we delete it from the inheritance chain since we reach term::raw via a different base class. Note that a dynamic_cast<> is eliminated, so we compenate for the addition of runtime checks in the previous patch by the deletion of runtime checks in this patch.	2021-08-26 14:11:42 +03:00
Avi Kivity	660be97028	cql3: term::raw, multi_column_raw: unify prepare() signatures In order to replace the term::raw hierarchy with expressions, we need to unify the signatures of term::raw::prepare() and term::multi_column_raw::prepare(). This is because we'll only have one expression type to represent both single values and tuples (although, different subexpression types will may used). The difference in the two prepare() signatures is the `receiver` parameter - which is a (type, name) pair used to perfom type inference on the expression being prepared, with the name used to report errors. In a perfect world, this would just be an expression - a tuple or a singular expression as the case requires. But we don't have the needed expression infrastructure yet - general tuples or name-annotated expressions. Resolve the problem by introducing a variant for the single-value and tuple. This is more or less creating a mini-expression type used just for this. Once our expression type grows the needed capabilities, it can replace this type. Note that for some cases, this replaces compile-time checks by runtime checks (which should never trigger). In other cases the classes really needed both interfaces, so the new variant is a better fit.	2021-08-26 14:11:42 +03:00
Avi Kivity	9bf3b9f964	Merge 'Some IDL compiler cleanups' from Pavel Solodovnikov This series incorporates various refactorings aimed mostly at eliminating extra parameters to `serializer__impl` functions for `EnumDef` and `ClassDef` AST classes. Instead of carrying these parameters here and there over many places, they are calculated on a preliminary run to collect additional metadata, such as: namespaces and template parameters from parent scopes. This metadata is used later to extend AST classes. The patchset does not introduce any changes in the generation procedures, exclusively dealing with internal code structuring. NOTE: although metadata collection involves an extra run through the parse tree, the proper way should be to populate it instantly while parsing the input. This is left to be adjusted lated in a follow-up series. Closes #8148 github.com:scylladb/scylla: idl: add descriptions for the top-level generation routines idl: make ns_qualified name a class method idl: cache template declarations inside enums and classes idl: cache parent template params for enums and classes idl: rename misleading `local_types` to `local_writable_types` idl: remove remaining uses of `namespaces` argument idl: remove `is_final` function and use `.final` AST class property idl: remove `parent_template_param` from `local_types` set idl: cache namespaces in AST nodes idl: remove unused variables	2021-08-26 13:18:54 +03:00
Benny Halevy	4ffdafe6dc	token_metadata: delete old java code We no longer need to keep it for reference. It's just causing confusion at this point. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210826095457.994834-1-bhalevy@scylladb.com>	2021-08-26 13:03:59 +03:00
Pekka Enberg	a53c1949cd	Update tools/jmx submodule * tools/jmx 5311e9b...70b19e6 (1): > scrub: support scrubMode and deprecate skipCorrupted	2021-08-26 12:27:13 +03:00
Pavel Solodovnikov	c0854a0f62	raft: create system tables only when `raft` experimental feature is set Also introduce a tiny function to return raft-enabled db config for cql testing. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>	2021-08-26 12:21:12 +03:00
Pekka Enberg	bd8fa47d84	Update tools/java submodule * tools/java 4ef8049e07...0b6ecbeb90 (1): > nodetool scrub: support --mode and deprecate --skip-corrupted	2021-08-26 11:07:14 +03:00
Dejan Mircevski	5a4ac002c1	cql3: Track warnings in prepared_statement Preparation should be able to record warnings that make it back to the user via the query response. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-25 17:51:05 -04:00
Avi Kivity	acf8da2bce	Merge "flat_mutation_reader: keep timeout in permit" from Benny " This series moves the timeout parameter, that is passed to most f_m_r methods, into the reader_permit. This eliminates the need to pass the timeout around, as it's taken from the permit when needed. The permit timeout is updated in certain cases when the permit/reader is paused and retrieved later on for reuse. Following are perf_simple_query results showing ~1% reduction in insns/op and corresponding increase in tps. $ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10 Before: 102500.38 tps ( 75.1 allocs/op, 12.1 tasks/op, 45620 insns/op) After: 103957.53 tps ( 75.1 allocs/op, 12.1 tasks/op, 45372 insns/op) Test: unit(dev) DTest: repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release) materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release) materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release) migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release) " * tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla: flat_mutation_reader: get rid of timeout parameter reader_concurrency_semaphore: use permit timeout for admission reader_concurrency_semaphore: adjust reactivated reader timeout multishard_mutation_query: create_reader: validate saved reader permit repair: row_level: read_mutation_fragment: set reader timeout flat_mutation_reader: maybe_timed_out: use permit timeout test: sstable_datafile_test: add sstable_reader_with_timeout reader_permit: add timeout member	2021-08-25 17:51:10 +03:00
Raphael S. Carvalho	a4053dbb72	repair: Postpone data segregation to off-strategy compaction With data segregation on repair, thousands of sstables are potentially added to maintenance set which causes high latency due to stalls. That's because N*M sstables are created by a repair, where N = # of ranges and M = # of segregations For TWCS, M = # of windows. Assuming N = 768 and M = 20, ~15k sstables end up in sstable set To fix this problem, let's avoid performing data segregation in repair, as offstrategy will already perform the segregation anyway. So from now on, only N non-overlapping sstables will be added to set. Read amplification isn't affected because a query will only touch one sstable in maintenance set. When offstrategy starts, it will pick all sstables from set and compact them in a single step while performing data segregation, so data is properly laid out before integrated into the main set. tests: - sstable_compaction_test.twcs_reshape_with_disjoint_set_test - mode(dev) - manual test using repair-based bootstrap Fixes #9199. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210824185043.76475-1-raphaelsc@scylladb.com>	2021-08-25 15:31:38 +03:00
Pavel Emelyanov	b012040a76	mutation: Keep range tombstone in tree when consuming Current code std::move()-s the range tombstone into consumer thus moving the tombstone's linkage to the containing list as well. As the result the orignal range tombstone itself leaks as it leaves the tree and cannot be reached on .clear(). Another danger is that the iterator pointing to the tombstone becomes invalid while it's then ++-ed to advance to the next entry. The immediate fix is to keep the tombstone linked to the list while moving. fixes: #9207 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210825100834.3216-1-xemul@scylladb.com>	2021-08-25 13:25:18 +03:00
Botond Dénes	6df77e350a	mutation_fragment{_v2}: MutationFragmentConsumer: allow for abstract consumer Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210825083244.436274-1-bdenes@scylladb.com>	2021-08-25 13:12:41 +03:00
Avi Kivity	993f824cfd	Merge "raft: implement linearisable reads on a follower" from Gleb and Kostja " This series implements section 6.4 of the Raft PhD. It allows to do linearisable reads on a follower bypassing raft log entirely. After this series server::read_barrier can be executed on a follower as well as leader and after it completes local user's state machine state can be accessed directly. " * 'raft-read-v9' of github.com:scylladb/scylla-dev: raft: test: add read_barrier test to replication_test raft: test: add read_barrier tests to fsm_test raft: make read_barrier work on a follower as well as on a leader raft: add a function to wait for an index to be applied raft: (server) add a helper to wait through uncertainty period raft: make fsm::current_leader() public raft: add hasher for raft::internal::tagged_uint64 serialize: add serialized for std::monostate raft: fix indentation in applier_fiber	2021-08-25 13:11:35 +03:00
Gleb Natapov	3ff6f76cef	raft: test: add read_barrier test to replication_test	2021-08-25 08:57:13 +03:00
Gleb Natapov	ad2c2abcb8	raft: test: add read_barrier tests to fsm_test	2021-08-25 08:57:13 +03:00
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Gleb Natapov	73af7edc78	raft: add a function to wait for an index to be applied	2021-08-25 08:19:25 +03:00
Konstantin Osipov	0429196e06	raft: (server) add a helper to wait through uncertainty period Add a helper to be able to wait until a Raft cluster leader is elected. It can be used to avoid sleeps when it's necessary to forward a request to the leader, but the leader is yet unknown.	2021-08-25 08:19:25 +03:00
Gleb Natapov	376785042f	raft: make fsm::current_leader() public Later patch will call it from server class.	2021-08-25 08:19:25 +03:00
Gleb Natapov	273f753815	raft: add hasher for raft::internal::tagged_uint64 Need it to be able to use tagged_uint64 as a key in an unordered map.	2021-08-25 08:19:25 +03:00
Gleb Natapov	4851d64c68	serialize: add serialized for std::monostate	2021-08-25 08:19:25 +03:00
Gleb Natapov	bd0fd579cf	raft: fix indentation in applier_fiber	2021-08-25 08:19:25 +03:00
Nadav Har'El	cf06b7cd40	test/alternator: correct some typos in comments Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210729125317.1610573-1-nyh@scylladb.com>	2021-08-24 19:43:29 +03:00
Avi Kivity	4a42b69ba8	Merge "raft: testing: many nodes test" from Alejo " Factor out replication test, make it work with different clocks, add some features, and add a many nodes test with steady_clock. Also refactor common test helper. Many nodes test passes for release and dev and normal tick of 100ms for up to 1000 servers. For debug mode it's much fewer due to lack of optimizations so it's only tested for smaller numbers. Tests: unit ({dev}), unit ({debug}), unit ({release}) " * 'raft-many-22-v12' of https://github.com/alecco/scylla: (21 commits) raft: candidate timeout proportional to cluster size raft: testing: many nodes test raft: replication test: remove unused tick_all raft: replication test: delays raft: replication test: packet drop rpc helper raft: replication test: connectivity configuration raft: replication test: rpc network map in raft_cluster raft: replication test: use minimum granularity raft: replication test: minor: rename local to int ids raft: replication test: fix restart_tickers when partitioning raft: replication test: partition ranges raft: replication test: isolate one server raft: replication test: move objects out of header raft: replication test: make dummy command const raft: replication test: template clock type raft: replication test: tick delta inside raft_cluster raft: replication test: style - member initializer raft: replication test: move common code out raft: testing: refactor helper raft: log election stages ...	2021-08-24 17:05:05 +03:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	4e3dcfd7d6	reader_concurrency_semaphore: use permit timeout for admission Now that the timeout is stored in the reader permit use it for admission rather than a timeout parameter. Note that evictable_reader::next_partition currently passes db::no_timeout to resume_or_create_reader, which propagated to maybe_wait_readmission, but it seems to be an oversight of the f_m_r api that doesn't pass a timeout to next_partition(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	9b0b13c450	reader_concurrency_semaphore: adjust reactivated reader timeout Update the reader's timeout where needed after unregistering inactive_read. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	605a1e6943	multishard_mutation_query: create_reader: validate saved reader permit Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	eeab5f77d9	repair: row_level: read_mutation_fragment: set reader timeout The timeout needs to be propagated to the reader's permit. Reset it to db::no_timeout in repair_reader::pause(). Warn if set_timeout asks to change the timeout too far into the past (100ms). It is possible that it will be passed a past timeout from the rcp path, where the message timeout is applied (as duration) over the local lowres_clock time and parallel read_data messages that share the query may end up having close, but different timeout values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:40 +03:00
Benny Halevy	f25aabf1b2	flat_mutation_reader: maybe_timed_out: use permit timeout Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Benny Halevy	46fb7fe68e	test: sstable_datafile_test: add sstable_reader_with_timeout Verify that the sstable reader (for the highest supported version) times out properly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Benny Halevy	fe479aca1d	reader_permit: add timeout member To replace the timeout parameter passed to flat_mutation_reader methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Alejo Sanchez	a5c74a6442	raft: candidate timeout proportional to cluster size To avoid dueling candidates with large clusters, make the timeout proportional to the cluster size. Debug mode is too slow for a test of 1000 nodes so it's disabled, but the test passes for release and dev modes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-24 13:09:01 +02:00
Alejo Sanchez	7206eae16e	raft: testing: many nodes test Tests with many nodes and realistic timers and ticks. Network delays are kept as a fraction of ticks. (e.g. 20/100) Tests with 600 or more nodes hang in debug mode. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-24 13:09:01 +02:00
Alejo Sanchez	87a03a3485	raft: replication test: remove unused tick_all Tests now wait for normal ticks for election, remove deprecated tick_all helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-24 13:09:01 +02:00
Alejo Sanchez	14c214d73e	raft: replication test: delays Allow test supplied delays for rpc communication. Allow supplying network delay, local delay (nodes within the same server), how many nodes are local, and an extra small delay simulating local load. Modify rpc class to support delays. If delays are enabled, it no longer directly calls the other node's server code but it schedules it to be called later. This makes the test more realistic as in the previous version the first candidate was always going to get to all followers first, preventing a dueling candidates scenario. Previously, tickers were all scheduled at the same time, so there was no spread of them across the tick time. Now these tickers are scheduled with a uniform spread across this time (tick delta). Also previously, for custom free elections used tick_all() which traversed _in_configuration sequentially and ticked each. This, combined with rpc outbound directly calling methods in the other server without yielding, caused free elections to be unrealistic with same order determined and first candidate always winning. This patch changes this behavior. The free election uses normal tickers (now uniformly distributed in tick delay time) and its loop waits for tick delay time (yielding) and checks if there's a new leader. Also note the order might not be the same in debug mode if more than one tick is scheduled. As rpc messages are sent delayed, network connectivity needs to be checked again before calling the function on the remote side. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-24 13:05:53 +02:00
Alejo Sanchez	db23823c77	raft: replication test: packet drop rpc helper Add a helper to check if a packet should be dropped. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	497af3167f	raft: replication test: connectivity configuration Pass packet drops within connectivity configuration struct. Default to no packet drops. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	e4d5428e8a	raft: replication test: rpc network map in raft_cluster Move rpc network map to raft cluster, no longer as static in rpc class.	2021-08-23 17:50:16 +02:00
Alejo Sanchez	192ac5be4c	raft: replication test: use minimum granularity seastar lowres_clock minimum granularity is 10ms, not 1ms. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	5cfe6c1ca2	raft: replication test: minor: rename local to int ids For clarity, name 0-based integer ids as int ids not local. This is in contrast with 1-based UUID ids.	2021-08-23 17:50:16 +02:00
Alejo Sanchez	27d90f0165	raft: replication test: fix restart_tickers when partitioning When partitioning, elect_new_leader restarts tickers, so don't re-restart them in this case. When leader is dropped and no new leader is specified, restart tickers before free election. If no change of leader, restart tickers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	e4262291f2	raft: replication test: partition ranges Allow specifying ranges within partition to handle large number of nodes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	56a110d42f	raft: replication test: isolate one server Support disconnection of one server with the rest. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	6b3327c753	raft: replication test: move objects out of header Use a separate cc file for definitions and objects. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	cea18e6830	raft: replication test: make dummy command const Make dummy command const in header. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	2db3192ac3	raft: replication test: template clock type Templetize clock type. Use a struct for run_test to work around https://bugs.llvm.org/show_bug.cgi?id=50345 With help from @kbr- Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	cb35588fb1	raft: replication test: tick delta inside raft_cluster Store tick delta inside raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	49cb040037	raft: replication test: style - member initializer Fix raft_cluster constructor member initializer list.	2021-08-23 17:50:16 +02:00
Alejo Sanchez	6e2ab657b3	raft: replication test: move common code out Common replication test code moved to header. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	a6cd35c512	raft: testing: refactor helper Move definitions to helper object file. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	466972afb0	raft: log election stages Add logging for election tracing. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Alejo Sanchez	617d6df42c	raft: log with method name Use standard log format function[id] for log entries in fsm.cc Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Takuya ASADA	2a8b48b6fa	configure.py: add --dist-only for packaging development Add --dist-only option to disable compiling code, just build packages. It will significantly speed up rebuild packages, make packaging development easier. Closes #9227	2021-08-23 18:38:35 +03:00
Avi Kivity	22d2a815c9	transport: server.hh: trim unneeded cql3 includes query_processor.hh can be replaced with a forward declaration, and result-message headers, and valuees.hh is unneeded. Closes #9238	2021-08-23 18:09:22 +03:00
Avi Kivity	115d14028b	Merge 'Allow multi-parameter user-defined aggregates' from Piotr Sarna Due to an overzealous assertion put in the code (in one of the last iterations, by the way!) it was impossible to create an aggregate which accepts multiple arguments. The behavior is now fixed, and a test case is provided for it. Tests: unit(release) Closes #9211 * github.com:scylladb/scylla: cql-pytest: add test case for UDA with multiple args cql3: fix aggregates with > 1 argument	2021-08-23 17:45:58 +03:00
Pavel Solodovnikov	22794efc22	db: add experimental option for raft Introduce `raft` experimental option. Adjust the tests accordingly to accomodate the new option. It's not enabled by default when providing `--experimental=true` config option and should be requested explicitly via `--experimental-options=raft` config option. Hide the code related to `raft_group_registry` behind the switch. The service object is still constructed but no initialization is performed (`init()` is not called) if the flag is not set. Later, other raft-related things, such as raft schema changes, will also use this flag. Also, don't introduce a corresponding gossiper feature just yet, because again, it should be done after the raft schema changes API contract is stabilized. This will be done in a separate series, probably related to implementing the feature itself. Tests: unit(dev) Ref #9239. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>	2021-08-23 17:45:58 +03:00
Nadav Har'El	49aea3b301	Merge 'database: coroutinize schema load functions' from Avi Kivity Simple coroutinization of the schema load functions, leaving the code tidier. Test: unit (dev) Closes #9217 * github.com:scylladb/scylla: database: adjust indentation after coroutinization of schema table parsing code database: convert database::parse_schema_tables() to a coroutine database: remove unneeded temporary in do_parse_schema_tables() database: convert do_parse_schema_tables() to a coroutine	2021-08-23 17:45:58 +03:00
Nadav Har'El	d598a94b43	Merge: everywhere: mark deferred actions noexcept Merged patch series by By Benny Halevy: Prepare for updating seastar submodule to a change that requires deferred actions to be noexcept (and return void). Test: unit(dev, debug) * tag 'deferred_action-noexcept-v1' of github.com:bhalevy/scylla: everywhere: make deferred actions noexcept cql3: prepare_context: mark methods noexcept commitlog: segment, segment_manager: mark methods noexcept everywhere: cleanup defer.hh includes	2021-08-23 11:16:17 +03:00
Avi Kivity	1b492396c1	stream_session.cc: trim unneeded includes stream_session.cc doesn't need storage_proxy, or sstables, or the system keyspace. Remove them. Closes #9230	2021-08-23 10:57:04 +03:00
Eliran Sinvani	b33479f731	Micro Benchmark: Fix division by zero in 'perf_fast_forward' Commit `8d6e575` introduced a new stat, instructions per fragment. Computing this new stat can end with a division by zero when the number of fragmens read is 0. Here we fix it by reporting 0 ins/f when no fragments were read. Fixes #9231 Closes #9232	2021-08-23 10:55:44 +03:00
Avi Kivity	6221b90b89	secondary_index_manager: stop including expression.hh Use a forward declaration of cql3::expr::oper_t to reduce the number of translation units depending on expression.hh. Before: $ find build/dev -name '.d' \| xargs cat \| grep -c expression.hh 272 After: $ find build/dev -name '.d' \| xargs cat \| grep -c expression.hh 154 Some translation units adjust their includes to restore access to required headers. Closes #9229	2021-08-22 21:21:46 +03:00
Benny Halevy	e9aff2426e	everywhere: make deferred actions noexcept Prepare for updating seastar submodule to a change that requires deferred actions to be noexcept (and return void). Test: unit(dev, debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:52 +03:00
Benny Halevy	eba4191223	cql3: prepare_context: mark methods noexcept Prepare for marking deferred actions noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:40 +03:00
Benny Halevy	ef8ec54970	commitlog: segment, segment_manager: mark methods noexcept Prepare for marking deferred_actions nexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:40 +03:00
Benny Halevy	4439e5c132	everywhere: cleanup defer.hh includes Get rid of unused includes of seastar/util/{defer,closeable}.hh and add a few that are missing from source files. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:39 +03:00
Vlad Zolotarov	7bd1bcd779	loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because iterator doesn't "lock" the entry it points to and if there is a preemption point between aquiring non-end() iterator and its dereferencing the corresponding cache entry may had already got evicted (for whatever reason, e.g. cache size constraints or expiration) and then dereferencing may end up in a use-after-free and we don't have any protection against it in the value_extractor_fn today. And this is in addition to #8920. So, instead of trying to fix the iterator interface this patch kills two birds in a single shot: we are ditching the iterators interface completely and return value_ptr from find(...) instead - the same one we are returning from loading_cache::get_ptr(...) asyncronous APIs. A similar rework is done to a loading_shared_values loading_cache is based on: we drop iterators interface and return loading_shared_values::entry_ptr from find(...) instead. loading_cache::value_ptr already takes care of "lock"ing the returned value so that it would relain readable even if it's evicted from the cache by the time one tries to read it. And of course it also takes care of updating the last read time stamp and moving the corresponding item to the top of the MRU list. Fixes #8920 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>	2021-08-22 16:49:40 +03:00
Takuya ASADA	5b62bebbb6	scylla_io_setup: check root privilege on root mode This is side effect of allowing to run scylla_io_setup in nonroot mode, the script able to run in non-root user even the installation is not nonroot mode. Result of that, the script finally failed to write io_properties.yaml and causes permission denied. Since the evaluation takes long time, we should run permission check before starting it. We need to add root privilege check again, but skip it on nonroot mode. Fixes #8915 Closes #8984	2021-08-22 16:49:40 +03:00
Botond Dénes	714ff8b758	docs/guides/debugging.md: mention the debuginfo package pitfall Add a note to the "Obtaining the relocatable packages" section and a separate entry to Throubleshooting. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210819110459.159733-1-bdenes@scylladb.com>	2021-08-22 16:49:40 +03:00
Botond Dénes	13080794d6	docs/guides/debugging.md: recommend symlinking instead of installing When setting up the env. Install no longer works as it depends on systemd. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210819110419.159351-1-bdenes@scylladb.com>	2021-08-22 16:49:40 +03:00
Tomasz Grabiec	865d072756	Merge 'sstables: convert parse(high level types) to a coroutine' from Avi Kivity The parse() function of high-level sstable metadata types are trivial straight line code and can be easily simplified by conversion to coroutines. Test: unit (dev) Closes #9224 * github.com:scylladb/scylla: sstables: parse(*): adjust indentation after coroutine conversion sstables: parse(compression&): eliminate unnecessary indirection sstables: convert parse(compression&) to a coroutine sstables: convert parse(commitlog_interval&) to a coroutine sstables: parse(streaming_histogram&): eliminate unnecessary indirection sstables: convert parse(streaming_histogram&) to a coroutine sstables: convert parse(estimated_histogran&) to a coroutine sstables: convert parse(statistics&) to a coroutine sstables: convert parse(summary&) to a coroutine	2021-08-22 16:49:40 +03:00
Pavel Emelyanov	e02b39ca3d	code: Generalize tls::credentials_builder configuration All the places in code that configure the mentioned creds builder from client_\|server_encryption_options now do it the same way. This patch generalizes it all in the utils:: helper. The alternator code "ignores" require_client_auth and truststore keys, but it's easy to make the generalized helper be compatible. Also make the new helper coroutinized from the beginning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 18:05:41 +03:00
Pavel Emelyanov	35209e7500	transport, redis: Do not assume fixed encryption options On start main() brushes up the client_encryption_options option so that any user of it sees it in some "clean" state and can avoid using get_or_default() to parse. This patch removes this assumption (and the cleaning code itself). Next patch will make use of it and relax the duplicated parsing complexity back. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 17:59:33 +03:00
Pavel Emelyanov	2f5941ca6f	messaging: Move encryption options parsing to ms Main collects a bunch of local variables from config and passes them as arguments to messaging service initialization helper. This patch replaces all these args with const config reference. The motivation is to facilitate next patching by providing the server encryption options k:v set right in the m.s. init code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 17:56:16 +03:00
Pavel Emelyanov	33c70e54bb	main: Open-code internode encryption misconfig warning There's a warning message printed when internode encryption is set up "incorrectly". The incorrectness "if" uses local variables that soon will be moved away. This patch makes the check rely purely on the config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 17:54:40 +03:00
Pavel Emelyanov	aa88527375	main, config: Move options parsing helpers The get_or_default and is_true are two aux bits that are used to parse the config options. The former is duplicated in the alternator code as well. Put both in utils namespace for future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 17:53:41 +03:00
Avi Kivity	4078d55961	sstables: parse(*): adjust indentation after coroutine conversion Verified with "git diff -w"	2021-08-18 19:10:26 +03:00
Avi Kivity	6f823eba5f	sstables: parse(compression&): eliminate unnecessary indirection lw_shared_ptr<> was used to keep the addresses of two integers stable, but this is now unnecessary.	2021-08-18 19:09:55 +03:00
Avi Kivity	bd05bc40f4	sstables: convert parse(compression&) to a coroutine	2021-08-18 19:09:55 +03:00
Avi Kivity	3626a76c53	sstables: convert parse(commitlog_interval&) to a coroutine	2021-08-18 19:09:55 +03:00
Avi Kivity	8699bda724	sstables: parse(streaming_histogram&): eliminate unnecessary indirection The pre-coroutine code used a unique_ptr to keep the address of a disk_array stable, but this is now unnecessary.	2021-08-18 19:09:20 +03:00
Avi Kivity	ad27824623	sstables: convert parse(streaming_histogram&) to a coroutine	2021-08-18 19:08:53 +03:00
Avi Kivity	f8b2f0449c	sstables: convert parse(estimated_histogran&) to a coroutine	2021-08-18 19:08:53 +03:00
Avi Kivity	71c69fb9e2	sstables: convert parse(statistics&) to a coroutine	2021-08-18 19:08:53 +03:00
Avi Kivity	bd6460f00a	sstables: convert parse(summary&) to a coroutine	2021-08-18 18:21:33 +03:00
Pavel Solodovnikov	f98cb96506	raft: raft_sys_table_storage_test: don't use initializer lists inside loops and coroutines Workaround for Clang bug: https://bugs.llvm.org/show_bug.cgi?id=51515 When compiled on aarch64 with ASAN support and -Og/-Oz/-Os optimization level, `raft_sys_table_storage::do_store_log_entries` crashes during the tests. ASAN incorrectly reports `stack-use-after-return` on `std::vector` list initialization after initial coroutine suspension (initializer list's data pointer starts to point to garbage). The workaround is simple: don't use initializer lists in such case and replace with a series of `emplace_back` calls. Tests: unit(debug, aarch64) Fixes #9178 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210818102038.92509-1-pa.solodovnikov@scylladb.com>	2021-08-18 13:32:55 +03:00
Takuya ASADA	462961ca51	unified: fix handling --supervisor option We need to pass --supervisor option just for scylla-server module, and also pass --packaging option to scylla-jmx module to avoid running systemctl command, since the option may run in container, and container may not have systemd. Fixes #9141 Closes #9142	2021-08-18 13:17:08 +03:00
Avi Kivity	5450af8e1b	database: coroutinize stop() Make the code tidier. The conversion is not mechanical: the finally block is converted to straight line code. stop()/close() must not fail anyway, and we cannot recover from such failures. The when_all_succeed() for stopping the semaphores is also converted to straight-line code - there is no advantage to stopping them in parallel, as we're just waiting for running tasks to complete and clean up. Test: unit (dev) Closes #9218	2021-08-18 10:57:44 +02:00
Avi Kivity	bae67dcce6	Merge "mutation_fragment: Specialize appending_hash for it" from Pavel E " The mutation_fragments hashing code sitting in row-level repair upsets clang and makes it spend 20 minutes compiling itself. This set speeds this up greatly by moving the hashing code into the mutation_fragment.cc and turning it into the appending_hash<> specialisation. A simple sanity checking test makes sure this doesn't change resulting hash values. tests: unit.hashers_test(dev, release) // hash values matched, phew dtest.repair_additional_test.repair_large_partition_existing_rows_test(release) " * 'br-row-level-comp-speedup-2.2' of https://github.com/xemul/scylla: mutation_fragment: Specialize appending_hash for it tests: Add sanity check for hashing mutation_fragments	2021-08-18 11:25:48 +03:00
Pavel Emelyanov	b5fee07527	mutation_fragment: Specialize appending_hash for it Row-level rpair hashes the mutation fragment and wraps this into a private fragment_hasher class. For some reason it takes ~20 minutes for clang to compile the row_level.o with -O3 level (release mode). Putting the whole fragment_hasher into a dedicated file reduces the compilation times ~9 times. However, it seems more natural not to move the fragment_hasher around but to specialize the appending_hash<> for mutation_fragment and make row_level.cc code just call feed_hash(). Compilation times (release mode): before after row_level.o 19m34s 2m4s mutation_fragment.o 13s 17s Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-18 09:17:40 +03:00
Pavel Emelyanov	34f8f10123	tests: Add sanity check for hashing mutation_fragments Next patch is going to change the way row-level repair code hashes mutation_fragment objects. This patch prepares the sanity check for the hash values not be accidentally changed by hashing some simple fragments and comparing them against known expected values. The hash_mutation_fragment_for_test helper is added for this patch only and will be removed really soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-18 09:17:40 +03:00
Raphael S. Carvalho	3a1cf3aa88	database: document database::get_keyspace_local_ranges() Documentation was extracted from abstract_replication_strategy::get_ranges(), which says: // get_ranges() returns the list of ranges held by the given endpoint. // The list is sorted, and its elements are non overlapping and non wrap-around. That's important because users of get_keyspace_local_ranges() expect that the returned list is both sorted and non overlapping, so let's document it to prevent someone from removing any of these properties. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210805140628.537368-1-raphaelsc@scylladb.com>	2021-08-17 21:44:24 +03:00
Asias He	eaf4d2afb4	storage_service: Generate view update for load and stream Currently, view will be not updated because the streaming reason is set to streaming::stream_reason::rebuild. On the receiver side, only streaming with the reason streaming::stream_reason::repair will trigger view update. Change the stream reason to repair to trigger view update for load and stream. This makes load_and_stream behaves the same as nodetool refresh. Note: However, this is not very efficient though. Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is loaded, it streams to 3 replica nodes, if we generate view updates, we will have 3 view updates for this replica (each of the peer nodes finds its peer and writes the view update to peer). After loading sst2 and sst3, we will have 9 view updates in total for a single partition. If we create the view after the load and stream process, we will only have 3 view updates for a single partition. If we create the view after the load and stream process, we will only have 3 view updates for a single partition. Fixes #9205 Closes #9213	2021-08-17 21:44:24 +03:00
Avi Kivity	73d6f2798d	database: adjust indentation after coroutinization of schema table parsing code	2021-08-17 21:05:05 +03:00
Avi Kivity	4ca856157d	database: convert database::parse_schema_tables() to a coroutine In one case we have f = f.then(...), but we can just wait for the first future where it's created.	2021-08-17 21:00:15 +03:00
Avi Kivity	4f91953ebf	database: remove unneeded temporary in do_parse_schema_tables() The coroutine can keep the cf_name parameter alive, provided we pass it by value.	2021-08-17 20:45:41 +03:00
Avi Kivity	b2d5820d75	database: convert do_parse_schema_tables() to a coroutine	2021-08-17 20:44:28 +03:00
Tomasz Grabiec	9fe3e86368	db: Print more fields of read_command Message-Id: <20210810143752.420988-1-tgrabiec@scylladb.com>	2021-08-17 12:24:40 +03:00
Piotr Sarna	88238c7c2a	cql-pytest: add test case for UDA with multiple args A test case for an aggregate which works on multiple parameters is added.	2021-08-16 19:52:50 +02:00
Piotr Sarna	d83d212ee5	cql3: fix aggregates with > 1 argument It was impossible to use an aggreagate with more than 1 argument due to an overzealous assert, which is now removed.	2021-08-16 19:49:03 +02:00
Pavel Emelyanov	9c7bcd1d85	bound_view: Rewrite tri_compare() tail The new implementation is shorter and allows compiler to produce nicer assembly. In particular clang-11 and -O3 flag: Was: if (d1 == d2) { return w1 - w2; } return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0)); 89 f0 mov %esi,%eax 39 d7 cmp %edx,%edi 74 13 je 403f69 <_Z7cmp_intiiii+0x19> 7d 0a jge 403f62 <_Z7cmp_intiiii+0x12> 31 c9 xor %ecx,%ecx 85 c0 test %eax,%eax 0f 9e c1 setle %cl 29 c8 sub %ecx,%eax c3 retq 31 c0 xor %eax,%eax 85 c9 test %ecx,%ecx 0f 9e c0 setle %al 29 c8 sub %ecx,%eax c3 retq 14 instructions 2 cond jumps, 2 cond sets Now: return ((d1 <= d2) ? w1 << 1 : 1) - ((d2 <= d1) ? w2 << 1 : 1); 8d 04 36 lea (%rsi,%rsi,1),%eax 39 d7 cmp %edx,%edi be 01 00 00 00 mov $0x1,%esi 0f 4f c6 cmovg %esi,%eax 01 c9 add %ecx,%ecx 39 fa cmp %edi,%edx 0f 4f ce cmovg %esi,%ecx 29 c8 sub %ecx,%eax c3 retq 9 instructions, 0 cond jumps, 2 cond movs tests: unit(dev), perf(simple_query, release) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210730092629.18940-1-xemul@scylladb.com>	2021-08-16 17:17:27 +03:00
Tomasz Grabiec	09b575474b	Merge "test: raft: generators infrastructure with an actual random nemesis test" from Kamil Operations and generators can be composed to create more complex operations and generators. There are certain composition patterns useful for many different test scenarios. We implement a couple of such patterns. For example: - Given multiple different operation types, we can create a new operation type - `either_of` - which is a "union" of the original operation types. Executing `either_of` operation means executing an operation of one of the original types, but the specific type can be chosen in runtime. - Given a generator `g`, `op_limit(n, g)` is a new generator which limits the number of operations produced by `g`. - Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a new generator which spreads the operations from `g` roughly every `d` ticks. (The actual definition in code is more general and complex but the idea is similar.) Some of these patterns have correspodning notions in Jepsen, e.g. our `stagger` has a corresponding `stagger` in Jepsen (although our `stagger` is more general). Finally, we implement a test that uses this new infrastructure. Two `Executable` operations are implemented: - `raft_call` is for calling to a Raft cluster with a given state machine command, - `network_majority_grudge` partitions the network in half, putting the leader in the minority. We run a workload of these operations against a cluster of 5 nodes with 6 threads for executing the operations: one "nemesis thread" for `network_majority_grudge` and 5 "client threads" for `raft_call`. Each client thread randomly chooses a contact point which it tries first when executing a `raft_call`, but it can also "bounce" - call a different server when the previous returned "not_a_leader" (we use the generic "bouncing" wrapper to do this). For now we only print the resulting history. In a follow-up patchset we will analyze it for consistency anomalies. * kbr/raft-test-generator-v4: test: raft: randomized_nemesis_test: a basic generator test test: raft: generator: a library of basic generators test: raft: introduce generators test: raft: introduce `future_set` test: raft: randomized_nemesis_test: handle `raft::stopped_error` in timeout futures	2021-08-16 15:55:25 +02:00
Kamil Braun	3344ac8a6c	test: raft: randomized_nemesis_test: a basic generator test The previous commits introduced basic the generator concept and a library of most common composition patterns. In this commit we implement a test that uses this new infrastructure. Two `Executable` operations are implemented: - `raft_call` is for calling to a Raft cluster with a given state machine command, - `network_majority_grudge` partitions the network in half, putting the leader in the minority. We run a workload of these operations against a cluster of 5 nodes with 6 threads for executing the operations: one "nemesis thread" for `network_majority_grudge` and 5 "client threads" for `raft_call`. Each client thread randomly chooses a contact point which it tries first when executing a `raft_call`, but it can also "bounce" - call a different server when the previous returned "not_a_leader" (we use the generic "bouncing" wrapper to do this). For now we only print the resulting history. In a follow-up patchset we will analyze it for consistency anomalies.	2021-08-16 13:07:08 +02:00
Kamil Braun	66ec484730	test: raft: generator: a library of basic generators Operations and generators can be composed to create more complex operations and generators. There are certain composition patterns useful for many different test scenarios. This commit introduces a couple of such patterns. For example: - Given multiple different operation types, we can create a new operation type - `either_of` - which is a "union" of the original operation types. Executing `either_of` operation means executing an operation of one of the original types, but the specific type can be chosen in runtime. - Given a generator `g`, `op_limit(n, g)` is a new generator which limits the number of operations produced by `g`. - Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a new generator which spreads the operations from `g` roughly every `d` ticks. (The actual definition in code is more general and complex but the idea is similar.) And so on. Some of these patterns have correspodning notions in Jepsen, e.g. our `stagger` has a corresponding `stagger` in Jepsen (although our `stagger` is more general).	2021-08-16 13:07:08 +02:00
Kamil Braun	d8863c5a7b	test: raft: introduce generators We introduce the concepts of "operations" and "generators", basic building blocks that will allow us to declaratively write randomized tests for torturing simulated Raft clusters. An "operation" is a data structure representing a computation which may cause side effects such as calling a Raft cluster or partitioning the network, represented in the code with the `Executable` concept. It has an `execute` function performing the computation and returns a result of type `result_type`. Different computations of the same type share state of type `state_type`. The state can, for example, contain database handles. Each execution is performed on an abstract `thread' (represented by a `thread_id`) and has a logical starting time point. The thread and start point together form the execution's `context` which is passed as a reference to `execute`. Two operations may be called in parallel only if they are on different threads. A generator, represented through the `Generator` concept, produces a sequence of operations. An operation can be fetched from a generator using the `op` function, which also returns the next state of the generator (generators are purely functional data structures). The generator concept is inspired by the generators in the Jepsen testing library for distributed systems. We also implement `interpreter` which "interprets", or "runs", a given generator, by fetching operations from the generator and executing them with concurrency controlled by the abstract threads. The algorithm used in the interpreter is also similar to the interpreter algorithm in Jepsen, although there are differences. Most notably we don't have a "worker" concept - everything runs on a single shard; but we use "abstract threads" combined with futures for concurrency. There is also no notion of "process". Finally, the interpreter doesn't keep an explicit history, but instead uses a callback `Recorder` to notify the user about operation invocations and completions. The user can decide to save these events in a history, or perhaps they can analyze them on the fly using constant memory.	2021-08-16 13:07:08 +02:00
Kamil Braun	421b1b9494	test: raft: introduce `future_set` A set of futures that can be polled. Polling the set (`poll` function) returns the value of one of the futures which became available or `std::nullopt` if the given logical durationd passes (according to the given timer), whichever event happens first. The current implementation assumes sequential polling. New futures can be added to the set with `add`. All futures can be removed from the set with `release`.	2021-08-16 13:07:08 +02:00
Kamil Braun	a5e92e1c45	test: raft: randomized_nemesis_test: handle `raft::stopped_error` in timeout futures The timeout futures in `call` and `reconfigure` may be discarded after Raft servers were `abort()`ed which would result in `raft::stopped_error` and the test complained about discarded exceptional futures. Discard these errors explicitly.	2021-08-16 13:07:08 +02:00
Takuya ASADA	cb19048186	docker: use dist/common/supervisor script for docker supervisor scripts for Docker and supervisor scripts for offline installer are almost same, drop Docker one and share same code to deduplicate them. Closes #9143 Fixes #9194	2021-08-16 13:36:14 +03:00
Avi Kivity	0ba697d515	Merge 'Add service level config change subscription API' from Eliran Sinvani In order to decouple the service level controller from the systems logic, we introduce an API for subscribing to configuration changes. The timing of the call was determined with resource creation and destruction in mind. An API subscriber can create resources that will be available from the very start of the service level existence it can also destroy them since the service level is guarantied not to exist anymore at the time of the call to the deletion notification callback. Testing: unit tests - all + a newly added one. dtests - next-gating (dev mode) Closes #9097 * github.com:scylladb/scylla: service level controller: Subscriber API unit test Service Level Controller: Add a listener API for service level config changes	2021-08-16 11:47:33 +03:00
Eliran Sinvani	403db8e943	service level controller: Subscriber API unit test Here we add a very simple unit test for the configuration change API.	2021-08-16 11:38:59 +03:00
Eliran Sinvani	47d3862b63	Service Level Controller: Add a listener API for service level config changes This change adds an api for registering a listener for service_level configuration chanhes. It notifies about removal addition and change of service level. The hidden assumption is that some listeners are going to create and/or manage service level specific resources and this it what guided the time of the call to the subscriber. Addition and change of a service level are called before the actual change takes place, this guaranties that resource creation can take place before the service level or new config starts to be used. The deletion notification is called only after the deletion took place and this guranties that the service level can't be active and the resources created can be safely destroyed.	2021-08-16 11:38:59 +03:00
Pavel Emelyanov	6dd67012bb	main: Fix internode encryption warning check It should check for dc \|\| rack, not dc \|\| dc. The correct behavior is described in both -- the warning message and the commit that introduced it (`a0745f94`). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210730094549.19477-1-xemul@scylladb.com>	2021-08-16 11:14:20 +03:00
Calle Wilund	3633c077be	commitlog/config: Make hard size enforcement false by default + add config opt Refs #9053 Flips default for commitlog disk footprint hard limit enforcement to off due to observed latency stalls with stress runs. Instead adds an optional flag "commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it. Sort of tape and string fix until we can properly tweak the balance between cl & sstable flush rate. Closes #9195	2021-08-15 15:10:27 +03:00
Asias He	97bb2e47ff	storage_service: Enable Repair Based Note Operations (RBNO) by default for replace We decided to enable repair based node operations by default for replace node operations. To do that, a new option --allowed-repair-based-node-ops is added. It lists the node operations that are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild. By default, --allowed-repair-based-node-ops is set to contain "replace". Note, the existing option --enable-repair-based-node-ops is still in play. It is the global switch to enable or disable the feature. Examples: - To enable bootstrap and replace node ops: ``` scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap ``` - To disable any repair based node ops: ``` scylla --enable-repair-based-node-ops false ``` Closes #9197	2021-08-15 13:30:46 +03:00
Nadav Har'El	b53eeb8a6c	Merge 'Enable user-defined aggregates' from Piotr Sarna It turns out that user-defined aggregates did not need any elaborate coding in order to make them exposed to the users. The whole infrastructure is already there, including system schema tables and support for running aggregate queries, so this series simply adds lots and lots of boilerplate glue code to make UDA usable. It also comes with a simple test which shows that it's possible to define and use such an aggregate. Performance not tested, since user-defined functions are still experimental, so nothing really changes in this matter. Tests: unit(release) Fixes #7201 Closes #9165 * github.com:scylladb/scylla: cql-pytest: add a test suite for user-defined aggregates cql-pytest: add context managers for functions and aggregates cql3: enable user-defined aggregates in CQL grammar cql3: add statements for user-defined aggregates cql3,functions: add checking if a function is used in UDA gms: add UDA feature migration_manager: add migrating user-defined aggregates db,schema_tables: add handling user-defined aggregates pagers: make a lambda mutable in fetch_page cql3: wrap handling paging result with with_thread_if_needed cql3: correctly mark function selectors as needing threads cql3: add user-define aggregate representation	2021-08-14 12:14:12 +03:00
Piotr Sarna	38c1fd0762	cql-pytest: add a test suite for user-defined aggregates The test suite now consists of a single user aggregate: a custom implementation for existing avg() built-in function, as well as a couple of cases for catching incorrect operations, like using wrong function signatures or dropping used functions.	2021-08-13 11:16:52 +02:00
Piotr Sarna	5f773d04d2	cql-pytest: add context managers for functions and aggregates These context managers can be used to create temporary user-defined functions and user-defined aggregates.	2021-08-13 11:16:52 +02:00
Piotr Sarna	2ebf018e74	cql3: enable user-defined aggregates in CQL grammar Statements for creating and dropping user-defined aggregates are now accepted by the grammar and can be used by the users.	2021-08-13 11:16:52 +02:00
Piotr Sarna	ec25cf965e	cql3: add statements for user-defined aggregates The following statements are added: - CREATE AGGREGATE - DROP AGGREGATE	2021-08-13 11:16:52 +02:00
Piotr Sarna	a9ae753cd6	cql3,functions: add checking if a function is used in UDA If a function is used by a user-defined aggregate, it must not be dropped or the aggregate would be left with a dangling function.	2021-08-13 11:16:47 +02:00
Piotr Sarna	da67c594c8	gms: add UDA feature UDA stands for user-defined aggregates and the feature implies that the whole cluster supports them.	2021-08-13 11:14:12 +02:00
Piotr Sarna	e1be04852b	migration_manager: add migrating user-defined aggregates User-defined aggregate creation and deletion can now be announced.	2021-08-13 11:14:12 +02:00
Piotr Sarna	84876a165b	db,schema_tables: add handling user-defined aggregates Aggregates are propagated, created and dropped very similarly to user-defined functions - a set of helper functions for aggregates are added based on the UDF implementation.	2021-08-13 11:14:11 +02:00
Piotr Sarna	ad2093539b	pagers: make a lambda mutable in fetch_page The lambda passed to with_thread_if_needed helper function relies on moving its captured parameters, so it's made mutable in order to avoid copying.	2021-08-13 11:13:43 +02:00
Piotr Sarna	260604d053	cql3: wrap handling paging result with with_thread_if_needed One of the pagers did not spawn a Seastar thread even if it was required by its underlying selectors - the behavior is now fixed.	2021-08-13 11:13:43 +02:00
Piotr Sarna	cac321cd12	cql3: correctly mark function selectors as needing threads Function call selectors correctly checked if their arguments are required to run in threaded context, but forgot to check the function itself - which is now done.	2021-08-13 11:13:43 +02:00
Piotr Sarna	ee81453596	cql3: add user-define aggregate representation A user-defined aggregate is represented as an aggregate which calls its state function on each input row and then finalizes its execution by calling its final function on the final state, after all rows were already processed.	2021-08-13 11:13:41 +02:00
Piotr Sarna	58196e8ea6	db,view: avoid ignoring failed future in background view updates The code for handling background view updates used to propagate exceptions unconditionally, which leads to "exceptional future ignored" warnings if the update was put to background. From now on, the exception is only propagated if its future is actually waited on. Fixes #6187 Tested manually, the warning was not observed after the patch Closes #9179	2021-08-12 17:32:35 +03:00
Piotr Sarna	ea0e0c924d	configure,install-dependencies: add wasmtime dependency If the wasmtime library is available for download, it will be set up by install-dependencies and prepared for linking. Closes #9151 [avi: regenerate toolchain, which also updates clang to 12.0.1]	2021-08-12 12:33:43 +03:00
Asias He	cc44edb4e2	database: Detemplate run_async I initially tried to use a noncopyable_function to avoid the unnecessary template usage. However, since database::apply_in_memory is a hot function. It is better to use with_gate directly. The run_async function does nothing but calls with_gate anyway. Closes #9160	2021-08-12 07:53:10 +03:00
Takuya ASADA	e5bb88b69a	scylla_cpuscaling_setup: change scaling_governor path On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor does not exist even it supported CPU scaling. Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is avaliable on both environment, so we should switch to it. Fixes #9191 Closes #9193	2021-08-11 15:31:14 +03:00
Nadav Har'El	89724533f8	test/cql-pytest: CREATE INDEX IF NOT EXISTS vs. Cassandra What should the following pair of statements do? CREATE INDEX xyz ON tbl(a) CREATE INDEX IF NOT EXISTS xyz ON tbl(b) There are two reasonable choices: 1. An index with the name xyz already exists, so the second command should do nothing, because of the "IF NOT EXISTS". 2. The index on tbl(b) does not yet exist, so the command should try to create it. And when it can't (because the name xyz is already taken), it should produce an error message. Currently, Cassandra went with choice 1, and Scylla went with choice 2. After some discussions on the mailing list, we agreed that Scylla's choice is the better one and Cassandra's choice could be considered a bug: The "IF NOT EXIST" feature is meant to allow idempotent creation of an index - and not to make it easy to make mistakes without not noticing. The second command listed above is most likely a mistake by the user, not anything intentional: The command intended to ensure than an index on column b exists, but after the silent success of the command, no such index exists. So this patch doesn't change any Scylla code (it just adds a comment), and rather it adds a test which "enshrines" the current behavior. The test passes on Scylla and fails on Cassandra so we tag it "cassandra_bug", meaning that we consider this difference to be intentional and we consider Cassandra's behavior in this case to be wrong. Fixes #9182. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210811113906.2105644-1-nyh@scylladb.com>	2021-08-11 13:41:58 +02:00
Asias He	ce8fd051c9	storage_service: Fix argument in send_meta_data::do_receive The extra status print is not needed in the log. Fixes the following error: ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service - service/storage_service.cc:3150 @do_receive: failed to log message: fmt='send_meta_data: got error code={}, from node={}, status={}': fmt::v7::format_error (argument not found) Fixes #9183 Closes #9189	2021-08-11 11:35:30 +02:00
Asias He	040b626235	table: Fix is_shared assert for load and stream The reader is used by load and stream to read sstables from the upload directory which are not guaranteed to belong to the local shard. Using the make_range_sstable_reader instead of make_local_shard_sstable_reader. Tests: backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test Fixes #9173 Closes #9185	2021-08-11 12:18:40 +03:00
Piotr Jastrzebski	db4c9199f5	sstables: remove unused uppermost_bound from clustering_ranges_walker and mutation_fragment_filter Those methods are never used so it's better not to keep a dead code around. Tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #9188	2021-08-11 10:54:59 +02:00
Nadav Har'El	49ca1f86b2	Merge 'hints: error injection for pausing hint replay' from Piotr Dulikowski Adds a `hinted_handoff_pause_hint_replay` error injection point. When enabled, hint replay logic behaves as if it is run, but it gets stuck in a loop and no hints are actually sent until the point is disabled again. This injection point will be useful in dtests - it will simulate infinitely slow hint replay and will make it possible to test how some operations behave while hint replay logic is running. The first intended use case of this injection point is testing the HTTP API for waiting for hints (#8728). Refs: #6649 Closes #8801 * github.com:scylladb/scylla: hints: fix indentation after previous patch hints: error injection for pausing hint replay hints: coroutinize lambda inside send_one_file	2021-08-11 11:42:29 +03:00
Piotr Dulikowski	f2e1339f38	hints: use an abort_source with sleep_abortable in flush+send loop Each hint sender runs an asynchronous loop with tries to flush and then send hints. Between each attempt, it sleeps at most 10 seconds using sleep_abortable. However, an overload of sleep_abortable is used which does not take an abort_source - it should abort the sleep in case Seastar handles a SIGINT or SIGTERM signal. However, in order for that to work, the application must not prevent default handling of those signals in Seastar - but Scylla explicitly does it by disabling the `auto_handle_sigint_sigterm` option in reactor config. As a result, those sleeps are never aborted, and - because we wait for the async loops to stop - they can delay shutdown by at most 10 seconds. To fix that, an abort_source is added to the hints sender, and the abort_source is triggered when the corresponding sender is requested to stop. Fixes: #9176 Closes #9177	2021-08-11 10:32:53 +02:00
Tomasz Grabiec	e177cd382b	db: Remove superfluous } from read_command printout Message-Id: <20210810131429.407903-1-tgrabiec@scylladb.com>	2021-08-10 17:32:34 +03:00
Michał Chojnowski	2aa0a2e6a1	test: perf: perf_collection: use the optimized version of bptree Since key_compare does not conform to SimpleLessCompare, the benchmark tests the non-optimized version of bptree (without SIMD key search). We want to test the optimized version. Closes #9180	2021-08-10 17:04:34 +03:00
Nadav Har'El	65381bd155	test/alternator: add tests for expression length limits The DynamoDB documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html describes several hard limits on the size of the size of expressions (ProjectionExpression, ConditionExpression, UpdateExpression, FilterExpression) and various elements they contain. In this patch we begin testing those limits with a comprehensive test for the length of each of these four expressions: we test that lengths up to (and including) 4096 bytes are allowed but longer expressions are rejected. We also add TODOs for additional documented limits that should be tested in the future. Currently, this test passes on DynamoDB but xfails on Alternator because Alternator does not enforce any limits on the expression length. I don't think this is a real problem, and we may consider keeping it this way, but we should at least be aware that this difference exists and an xfailing test will remind us. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210810081948.2012120-2-nyh@scylladb.com>	2021-08-10 12:06:21 +02:00
Nadav Har'El	9d49a32486	test/alternator: add tests for attribute name limits DynamoDB limits attribute names in items to lengths of up 65535 bytes, but in some cases (such as key attributes) the limit is lower - 255. This patch adds tests for many of these cases. All the new tests pass on DynamoDB, but some still xfail on Alternator because Alternator is too lenient - sometimes allowing longer attribute names than DynamoDB allows. While this may sound great, it also has downsides: The oversized attribute names perform badly, and as they grow, Alternator's internal limits will be reached as well, and result in an unsightly "internal server error" being reported instead of the expected user-friendly error. Refs #9169. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210810081948.2012120-1-nyh@scylladb.com>	2021-08-10 12:06:13 +02:00
Avi Kivity	112cee4960	Merge "make sstable::make_reader() return flat_mutation_reader_v2" from Michael " * Make `sstable::make_reader()` return `flat_mutation_reader_v2`, retain the old one as `sstable::make_reader_v1()` * Start weaning tests off `sstable::make_reader_v1()` (done all the easy ones, i.e. those not involving range tombstones) " * tag 'sstable-make-reader-v2-v1' of github.com:cmm/scylla: tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2 tests: get rid of sstable::make_reader_v1() in broken_sstable_test tests: get rid of sstable::make_reader_v1() in the trivial cases sstables: make sstable::make_reader() return flat_mutation_reader_v2	2021-08-10 12:57:10 +03:00
Avi Kivity	a7ef826c2b	Merge "Fold validation compaction into scrub" from Botond " Validation compaction -- although I still maintain that it is a good descriptive name -- was an unfortunate choice for the underlying functionality because Origin has burned the name already as it uses it for a compaction type used during repair. This opens the door for confusion for users coming from Cassandra who will associate Validation compaction with the purpose it is used for in Origin. Additionally, since Origin's validation compaction was not user initiated, it didn't have a corresponding `nodetool` command to start it. Adding such a command would create an operational difference between us and Origin. To avoid all this we fold validation compaction into scrub compaction, under a new "validation" mode. I decided against using the also suggested `--dry-mode` flag as I feel that a new mode is a more natural choice, we don't have to define how it interacts with all the other modes, unlike with a `--dry-mode` flag. Fixes: #7736 Tests: unit(dev), manual(REST API) " * 'scrub-validation-mode/v2' of https://github.com/denesb/scylla: compaction/compaction_descriptor: add comment to Validation compaction type compaction/compaction_descriptor: compaction_options: remove validate api: storage_service: validate_keyspace -> scrub_keyspace (validate mode) compaction/compaction_manager: hide perform_sstable_validation() compaction: validation compaction -> scrub compaction (validate mode) compaction/compaction_descriptor: compaction_options: add options() accessor compaction/compaction_descriptor: compaction_options::scrub::mode: add validate	2021-08-10 12:18:35 +03:00
Michael Livshin	c0ba657a86	tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test That is, anything not involving range tombstones. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	7c2854a094	tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	a4c43eda3a	tests: get rid of sstable::make_reader_v1() in broken_sstable_test Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	37c9f8f137	tests: get rid of sstable::make_reader_v1() in the trivial cases Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	f07306d75c	sstables: make sstable::make_reader() return flat_mutation_reader_v2 Rename the old version to `sstables::make_reader_v1()`, to have a nicely searcheable eradication target. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Piotr Dulikowski	68cac2eab7	hints: fix indentation after previous patch	2021-08-09 16:16:14 +02:00
Piotr Dulikowski	20cbe7fa2f	hints: error injection for pausing hint replay Adds a `hinted_handoff_pause_hint_replay` error injection point. When enabled, hint replay logic behaves as if it is run, but it gets stuck in a loop and no hints are actually sent until the point is disabled again. This injection point will be useful in dtests - it will simulate infinitely slow hint replay and will make it possible to test how some operations behave while hint replay logic is running. The first intended use case of this injection point is testing the HTTP API for waiting for hints (#8728). Refs: #6649	2021-08-09 16:16:14 +02:00
Piotr Dulikowski	29993f7745	hints: coroutinize lambda inside send_one_file Converts the lambda invoked for every commitlog entry in a hints file into a coroutine.	2021-08-09 16:16:14 +02:00
Asias He	4ae6eae00a	table: Get rid of table::run_compaction helper The table::run_compaction is a trivial wrapper for table::compact_sstables. We have lots of similar {start, trigger, run}_compaction functions. Dropping the run_compaction wrapper to reduce confusion. Closes #9161	2021-08-09 14:02:54 +03:00
Tomasz Grabiec	e115fce8f7	Merge "raft: sometimes become a candidate even if outside the configuration" from Kamil There are situations where a node outside the current configuration is the only node that can become a leader. We become candidates in such cases. But there is an easy check for when we don't need to; a comment was added explaining that. * kbr/candidate-outside-config-v3: raft: sometimes become a candidate even if outside the configuration raft: fsm: update _commit_idx when applying snapshot	2021-08-09 12:29:03 +02:00
Avi Kivity	1b618921be	Merge 'hinted handoff: introduce HTTP API for waiting for hint replay (stateless version)' from Piotr Dulikowski This PR introduces a new feature to hinted handoff: ability to wait until hints from given node are replayed towards a chosen set of nodes. It replaces the old mechanism which waits for hints to be replayed before repair and exposes it through an HTTP API. The implementation is completely different, so this PR begins with a revert of the old functionality and then introduces the new implementation. Waiting for hints is made possible with the help of "hint sync points". A sync point is a collection of positions in some hint queues from one node - those positions are encoded into the sync point's description as a hexadecimal string. The sync point consists only of its hexadecimal description - there is no state kept on any of the nodes. Two operations are available through the HTTP API: - `/hints_manager/waiting_point` (POST) - _Create a sync point_. Given a set of `target_hosts`, creates a sync point which encodes positions currently at the end of all queues pointing to any of the `target_hosts`. - `/hints_manager/waiting_point` (GET) - _Wait or check the sync point_. Given a description of a sync point, checks if the sync point was already reached. If you provide a non-zero `timeout` parameter and the sync point is not reached yet, this endpoint will wait until it the point reached or the timeout expires. Hinted handoff uses the commitlog framework in order to store and replay hints. Each entry (here, a serialized hint) can be identified by a "replay position", which contains the ID of the segment containing the hint, and its position in the file. Replay positions are ordered with respect to segment ID and then position in the file; because segment IDs are assigned sequentially and entries are also written sequentially, this order corresponds to the chronological order in which hints were written. This order also corresponds to the order in which hints are replayed, provided that hint segments are processed starting with the one with the smallest ID first. The main idea is to track the positions of both the most recently written hint, and the most recently replayed hint. When creating a hint sync point, the position of the last written hint is encoded; when the sync point is waited on, the hints manager waits until the last replayed position reached the position encoded in the sync point. The description of the sync point encodes positions on a per-hint queue basis - separately for each shard, destination endpoint and hint type (regular or MV). Note: although hints manager destroys and re-creates commitlog instances, the ordering described above still works - the ID of the first segment assigned by the commitlog instance corresponds to the number of milliseconds since the epoch, so commitlog instances created by newer instances will have larger IDs. Before the hints manager is enabled, it performs segment _rebalancing_: for a given endpoint, it makes sure that each shard gets roughly the same number of hint segments. For example, if there are 3 shards and shard 1 has 7 segments, then shard 0 will get 2 segments, shard 1: 3 segments, and shard 2: 2 segments. Apart from distributing the work evenly between shards on startup, it also handles the case when the node is resharded - if the number of shards is reduced, segments from removed shards will be redistributed to lower shards. Because of the possibility of segments being moved between shards on restart, this makes accurate tracking of hint replay harder. In order to simplify the problem, this PR changes the order in which hint segments are replayed - segments from other shards (here called "foreign" segments) are replayed first, before any "local" segment from this shard. Foreign segments are treated as if they were placed before the 0 replay position - when waiting for a hint sync point, we will __always__ wait for foreign segments to be replayed. This behavior makes sure that hints generated before the sync point was created will be replayed - and, if segment rebalancing happened in the meantime, we will potentially replay some more segments which were moved across shards. This PR starts with a revert of the "hints: delay repair until hints are replayed" (#8452) functionality. Some infrastructure introduced in the original PR started to be used by other places in the code, so this is not a simple revert of the merge commit - instead, commits of the old PR are reverted separately and modified in order to make the code compile. The following commits from the original PR were omitted from the revert because the code introduced by them became used by other logic in repair: - `0db45d1df5` (repair: introduce abort_source for repair abort) - `3a2d09b644` (repair: introduce abort_source for shutdown) - `49f4a2f968` (repair: plug in waiting for hints to be sent before repair) Refs: #8102 Fixes: #8727 Tests: unit(dev) Closes #8982 * github.com:scylladb/scylla: api: add HTTP API for hint sync points api: register hints HTTP API outside set_server_done storage_proxy: add functions for creating and waiting for hint sync pts hints: add functions for creating and waiting for sync points hints: add hint sync point structure utils,alternator: move base64 code from alternator to utils hints: make it possible to wait until hints are replayed hints: track the RP of the last replayed position hints: track the RP of the last written hint hints: change last_attempted_rp to last_succeeded_rp hints: rearrange error handling logic for hint sending hints: sort segments by ID, divide into foreign and local Revert "db/hints: allow to forcefully update segment list on flush" Revert "db/hints: add a metric for counting processed files" Revert "db/hints: make it possible to wait until current hints are sent" Revert "storage_proxy: add functions for syncing with hints queue" Revert "messaging_service: add verbs for hint sync points" Revert "storage_proxy: implement verbs for hint sync points" Revert "config: add wait_for_hint_replay_before_repair option" Revert "storage_proxy: coordinate waiting for hints to be sent" Revert "repair: plug in waiting for hints to be sent before repair" Revert "hints: dismiss segment waiters when hint queue can't send" Revert "storage_proxy: stop waiting for hints replay when node goes down" Revert "storage_proxy: add abort_source to wait_for_hints_to_be_replayed"	2021-08-09 10:59:07 +03:00
Piotr Dulikowski	7e3966c03e	api: add HTTP API for hint sync points Adds HTTP endpoints for manipulating hint sync points: - /hinted_handoff/sync_point (POST) - creates a new sync point for hints towards nodes listed in the `target_hosts` parameter - /hinted_handoff/sync_point (GET) - checks the status of the sync point. If a non-zero `timeout` parameter is given, it waits until the sync point is reached or the timeout expires.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	9091ce5977	api: register hints HTTP API outside set_server_done Registration of the currently unused hinted handoff endpoints is moved out from the set_server_done function. They are now explicitly registered in main.cc by calling api::set_hinted_handoff and also uninitialized by calling api::unset_hinted_handoff. Setting/unsetting HTTP API separately will allow to pass a reference to the sync_point_service without polluting the set_server_done function.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	14b00610b2	storage_proxy: add functions for creating and waiting for hint sync pts Adds functions in storage_proxy which allow to create sync points and wait for them.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	d41d39bbcd	hints: add functions for creating and waiting for sync points Adds functions which allow to create per-shard sync points and wait for them.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	e18b29765a	hints: add hint sync point structure Adds a sync_point structure. A sync point is a (possibly incomplete) mapping from hint queues to a replay position in it. Users will be able to create sync points consisting of the last written positions of some hint queues, so then they can wait until hint replay in all of the queues reach that point. The sync point supports serialization - first it is serialized with the help of IDL to a binary form, and then converted to a hexadecimal string. Deserialization is also possible.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	5a0942a0f8	utils,alternator: move base64 code from alternator to utils The base64 encoding/decoding functions will be used for serialization of hint sync point descriptions. Base64 format is not specific to Alternator, so it can be moved to utils.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	70df9973f3	hints: make it possible to wait until hints are replayed Adds necessary infrastructure which allows, for a given endpoint manager, to wait until hints are replayed up to a specified position. An abort source must be specified which, if triggered, cancels waiting for hint replay. If the endpoint manager is stopped, current waiters are dismissed with an exception.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	93f244426d	hints: track the RP of the last replayed position Keeps track of a position which serves as an upper bound for positions of already replayed hints - i.e. all hints with replay positions strictly lower than it are considered replayed. In order to accurately track this bound during hint replay, a std::map is introduced which contains positions of hints which are currently being sent.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	03e2e671cd	hints: track the RP of the last written hint The position of the last written hint is now tracked by the endpoint hints manager. When manager is constructed and no hints are replayed yet, the last written hint position is initialized to the beginning of a fake segment with ID corresponding to the current number of milliseconds since the epoch. This choice makes sure that, in case a new hint sync point is created before any hints are written, the position recorded for that hint queue will be larger than all replay positions in segments currently stored on disk.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	27d0d598fd	hints: change last_attempted_rp to last_succeeded_rp Instead of tracking the last position for which hint sending is attempted, the last successfully replayed position is tracked. The previous variable was used to calculate the position from which hint replay should restart in case of an error, in the following way: _last_not_complete_rp = ctx_ptr->first_failed_rp.value_or( ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp)); Now, this formula uses the last_succeeded_rp in place of last_attempted_rp. This change does not have an effect on the choice of the starting position of the next retry: - If the hint at `last_attempted_rp` has succeeded, in the new algorithm the same position will be recorded in `last_succeeded_rp`, and the formula will yield the same result. - If the hint at `last_attempted_rp` has failed, it will be accounted into `first_failed_rp`, so the formula will yield the same result. The motivation for this change is that in the next commits of this PR we will start tracking the position of the last replayed hint per hint queue, and the meaning of the new variable makes it more useful - when there are no failed hints in the hint sending attempt, last_succeeded_rp gives us information that hints _up to this position_ were replayed; the last_attempted_rp variable can only tell us that hints _before that position_ were replayed successfully.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	08a7d79ffc	hints: rearrange error handling logic for hint sending Instead of calling the `on_hint_send_failure` method inside the hint sending task in places where an error occurs, we now let the exceptions be returned and handle them inside a single `then_wrapped` attached to the hint sending task. Apart from the `then_wrapped`, there is one more place which calls `on_hint_send_failure` - in the exception handler for the future which spawns the asynchronous hint sending task. It needs to be kept separate because it is a part of a separate task.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	45b04c94e0	hints: sort segments by ID, divide into foreign and local Endpoint hints manager keeps a commitlog instance which is used to write hints into new segments. This instance is re-created every 10 seconds, which causes the previous instance to leave its segments on disk. On the other hand, hints sender keeps a list of segments to replay which is updated only after it becomes empty. The list is repopulated with segments returned by the commitlog::get_segments_to_replay() method which does not specify the order of the segments returned. As a preparation for the upcoming hint sync points feature, this commit changes the order in which segments are replayed: - First, segments written by other shards are replayed. Such segments may appear in the queue because of segment rebalancing which is done at startup. The purpose of replaying "foreign" segments first is that they are problematic for hint sync points. For each hint queue, a hint sync point encodes a replay position of the last written hint on the local shard. Accounting foreign segments precisely would make the implementation more complicated. To make things simpler, waiting for sync points will always make sure that all foreign segments are replayed. This might sometimes cause more hints to be waited on than necessary if a restart occurs in the meantime. - Segments written by the local shard are replayed later, in order of their IDs. This makes sure that local hints are replayed in the order they were written to segments, and will make it possible to use replay positions to track progress of hint replay.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	f83699bb7c	Revert "db/hints: allow to forcefully update segment list on flush" This reverts commit `e48739a6da`. This commit removes the functionality from endpoint hints manager which allowed to flush hints immediately and forcefully update the list of segments to replay. The new implementation of waiting for hints will be based on replay positions returned by the commitlog API and it won't be necessary to forcefully update the segment list when creating a sync point.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	9c1d4e7e6c	Revert "db/hints: add a metric for counting processed files" This reverts commit `5a49fe74bb`. This commit removes a metric which tracks how many segments were replayed during current runtime. It was necessary for current "wait for hints" mechanism which is being replaced with a different one - therefore we can remove the metric.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	3b851a5ebd	Revert "db/hints: make it possible to wait until current hints are sent" This reverts commit `427bbf6d86`. This commit removes the infrastructure which allows to wait until current hints are replayed in a given hint queue. It will be replaced with a different mechanism in later commits.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	4a35d138f6	Revert "storage_proxy: add functions for syncing with hints queue" This reverts commit `244738b0d5`. This commit removes create_hint_queue_sync_point and check_hint_queue_sync_point functions from storage_proxy, which were used to wait until local hints are sent out to particular nodes. Similar methods will be reintroduced later in this PR, with a completely different implementation.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	0d74dee683	Revert "messaging_service: add verbs for hint sync points" This reverts commit `82c419870a`. This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK rpc verbs. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so there is no need for new verbs.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	4604bb21c3	Revert "storage_proxy: implement verbs for hint sync points" This reverts commit `485036ac33`. This commit removes the handlers for HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK verbs. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so there is no need for new verbs.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	ff453d80ff	Revert "config: add wait_for_hint_replay_before_repair option" This reverts commit `86d831b319`. This commit removes the wait_for_hints_before_repair option. Because a previous commit in this series removes the logic from repair which caused it to wait for hints to be replayed, this option is now useless. We can safely remove this option because it is not present in any release yet.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	6c5d2fe0bf	Revert "storage_proxy: coordinate waiting for hints to be sent" This reverts commit `46075af7c4`. This commit removes the logic responsible for waiting for other nodes to replay their hints. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so there is no need for coordinating multiple nodes.	2021-08-09 09:24:36 +02:00
Piotr Dulikowski	ecf854affc	Revert "repair: plug in waiting for hints to be sent before repair" This reverts commit `49f4a2f968`. The idea to wait for hints to be replayed before repair is not always a good one. For example, someone might want to repair a small token range or just one table - but hinted handoff cannot selectively replay hints like this. The fact that we are waiting for hints before repair caused a small number of regressions (#8612, #8831). This commit removes the logic in repair which caused it to wait for hints. Additionally, the `storage_proxy.hh` include, which was introduced in the commit being reverted is also removed and smaller header files are included instead (gossiper.hh and fb_utilities.hh).	2021-08-09 09:22:26 +02:00
Piotr Dulikowski	e3c32c897a	Revert "hints: dismiss segment waiters when hint queue can't send" This reverts commit `9d68824327`. First, we are reverting existing infrastructure for waiting for hints in order to replace it with a different one, therefore this commit needs to be reverted as well. Second, errors during hint replay can occur naturally and don't necessarily indicate that no progress can be made - for example, the target node is heavily loaded and some hints time out. The "waiting for hints" operation becomes a user-issued command, so it's not as vital to ensure liveness.	2021-08-09 09:06:23 +02:00
Piotr Dulikowski	afb4c85662	Revert "storage_proxy: stop waiting for hints replay when node goes down" This reverts commit `22e06ace2c`. The upcoming HTTP API for waiting for hint replay will be restricted to waiting for hints on the node handling the request, so we are removing all infrastructure related to coordinating hint waiting - therefore this commit needs to be reverted.	2021-08-09 09:06:23 +02:00
Piotr Dulikowski	035da96161	Revert "storage_proxy: add abort_source to wait_for_hints_to_be_replayed" This reverts commit `958a13577c`. The `wait_for_hints_to_be_replayed` function is going to be completely removed in this PR, so this commit needs to be reverted, too.	2021-08-09 09:06:23 +02:00
Takuya ASADA	b822c642e5	docker: fix housekeeping --repo-files to apt repository Even we switched to Ubuntu based container image, housekeeping still using yum repository. It should be switched to apt repository. Fixes #9144 Closes #9147	2021-08-09 07:47:03 +03:00
Avi Kivity	31dcb0d1d0	Update seastar submodule * seastar ce3cc2687f...07758294ef (12): > perftune.py: change hwloc-calc parameters order Fixes perftune on Fedora 34 based hwloc > resource: pass configuration to nr_processing_units() > semaphore: semaphore_timed_out: derive from timed_out_error > Merge "resource: use hwloc_topology_holder" from Benny > Merge "file: ioctl, fcntl and lifetime_hint interfaces in seastar::file" from Arun George > pipe: mark pipe_reader and pipe_writer ctors as noexcept > test: pipe: add simple unit test > test: source_location_test: relax function name check for gcc 11 > http: add 429 too_many_requests status code > Added [[nodiscard]] to abort-source's subscribe > io_queue: Use on_internal_error in io_queue > reactor: Remove unused epoll poller from reactor	2021-08-08 14:42:54 +03:00
Avi Kivity	3b5e312800	db: schema_tables: clean up read_schema_partition_for_keyspace() coroutine captures read_schema_partition_for_keyspace() copies some parameters to capture them in a coroutine, but the same can be achieved more cleanly by changing the reference parameters to value parameters, so do that. Test: unit (dev) Closes #9154	2021-08-08 12:55:10 +03:00
Nadav Har'El	61bcc0ad29	Merge 'compaction: Move compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory ' from Asias He This trial patch set moves compaction_strategy.hh and compaction_garbage_collector.hh to compaction directory and drops two unused compact_for_mutation_query_state and compact_for_data_query_state. Closes #9156 * github.com:scylladb/scylla: compaction: Move compaction_garbage_collector.hh to compaction dir compaction: Move compaction_strategy.hh to compaction dir mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state	2021-08-08 11:58:41 +03:00
Dejan Mircevski	ba55769f80	test: Use ALLOW FILTERING more strictly Prepare for the upcoming strict ALLOW FILTERING check by modifying unit-test queries that need it. Current code allows such queries both with and without ALLOW FILTERING; future code will reject them without ALLOW FILTERING. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-08 08:01:19 +02:00
Dejan Mircevski	5da846a4a8	cql3: Add statement_restrictions::to_string Useful for error messages and debugging. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-08 07:16:55 +02:00
Asias He	4c1f8c2f83	compaction: Move compaction_garbage_collector.hh to compaction dir The top dir is a mess. Move compaction_garbage_collector.hh to the new home.	2021-08-07 08:07:09 +08:00
Asias He	6350a19f73	compaction: Move compaction_strategy.hh to compaction dir The top dir is a mess. Move compaction_strategy.hh and compaction_strategy_type.hh to the new home.	2021-08-07 08:06:37 +08:00
Asias He	47aae83185	mutation_compactor: Drop compact_for_mutation_query_state and compact_for_data_query_state They are not used.	2021-08-07 07:21:48 +08:00
Tomasz Grabiec	0af2c2b1cb	Merge "raft: store cluster configuration when taking snapshots" from Kamil The cluster would forget its configuration when taking a snapshot, making it unable to reelect a leader. We fix the problem and introduce a regression test. The last commit introduces some additional assertions for safety. * kbr/snapshot-preserve-config-v4: raft: sanity checking of apply index test: raft: regression test for storing cluster configuration when taking snapshots raft: store cluster configuration when taking snapshots	2021-08-06 18:34:53 +02:00
Kamil Braun	7533c84e62	raft: sometimes become a candidate even if outside the configuration There are situations where a node outside the current configuration is the only node that can become a leader. We become candidates in such cases. But there is an easy check for when we don't need to; a comment was added explaining that.	2021-08-06 13:18:32 +02:00
Kamil Braun	907672622f	raft: fsm: update _commit_idx when applying snapshot All entries up to snapshot.idx must obviously be committed, so why not update _commit_idx to reflect that. With this we get a useful invariant: `_log.get_snapshot().idx <= _commit_idx`. For example, when checking whether the latest active configuration is committed, it should be enough to compare the configuration index to the commit index. Without the invariant we would need a special case if the latest configuration comes from a snapshot.	2021-08-06 12:43:07 +02:00
Kamil Braun	1ca4d30cc3	raft: sanity checking of apply index Check that entries are applied in the correct order.	2021-08-06 12:21:19 +02:00
Kamil Braun	93822b0ee7	test: raft: regression test for storing cluster configuration when taking snapshots Before the fix introduced in the previous patch, the cluster would forget its configuration when taking a snapshot, making it unable to reelect a leader. This regression test catches that.	2021-08-06 12:17:22 +02:00
Kamil Braun	c6563220b0	raft: store cluster configuration when taking snapshots We add a function `log_last_conf_before(index_t)` to `fsm` which, given an index greater than the last snapshot index, returns the configuration at this index, i.e. the configuration of the last configuration entry before this index. This function is then used in `applier_fiber` to obtain the correct configuration to be stored in a snapshot. In order to ensure that the configuration can be obtained, i.e. the index we're looking at is not smaller than the last snapshot index, we strengthen the conditions required for taking a snapshot: we check that `_fsm` has not yet applied a snapshot at a larger index (which it may have due to a remote snapshot install request). This also causes fewer unnecessary snapshots to be taken in general.	2021-08-06 12:00:32 +02:00
Avi Kivity	52364b5da0	Merge 'cql3: Use expressions to calculate the local-index clustering ranges' from Jan Ciołek Calculating clustering ranges on a local index has been rewritten to use the new `expression` variant. This allows us to finally remove the old `bounds_ranges` function. Closes #9080 * github.com:scylladb/scylla: cql3: Remove unused functions like bounds_ranges cql3: Use expressions to calculate the local-index clustering ranges statement_restrictions_test: tests for extracting column restrictions expression: add a function to extract restrictions for a column	2021-08-05 18:32:11 +03:00
Tomasz Grabiec	4bfff86ba5	gdb: Print disengaged optionals as std::nullopt to reduce noise Message-Id: <20210805113409.75394-1-tgrabiec@scylladb.com>	2021-08-05 14:42:31 +02:00
Kamil Braun	f050d3682c	raft: fsm: stronger check for outdated remote snapshots We must not apply remote snapshots with commit indexes smaller than our local commit index; this could result in out-of-order command application to the local state machine replica, leading to serializability violations. Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>	2021-08-05 14:29:50 +02:00
Tomasz Grabiec	8fe06ad681	storage_proxy: Fix result reconciliation for memory-limitter induced short reads This applies to the case when pages are broken by replicas based on memory limits (not row or partition limits). If replicas stop pages in the following places: replica1 = { row 1, <end-of-page> row 2 } replica2 = { row 3 } The coordinator will reconcile the first page as: { row 1, row 3 } and row 2 will not be emitted at all in the following pages. The coordinator should notice that replica1 returned a short read and ignore everything past row 1 from other replicas, but it doesn't. There is a logic to do this trimming, but it is done in got_incomplete_information_across_partitions() which is executed only for the partition for which row limits were exhausted. Fix by running the logic unconditionally. Fixes #9119 Tests: - unit (dev) - manual (2 node cluster, manual reproducer) Message-Id: <20210802231539.156350-1-tgrabiec@scylladb.com>	2021-08-05 11:28:52 +03:00
Nadav Har'El	ae51fef57c	cql-pytest: add tests for estimated partition count In issue #9083 a user noted that whereas Cassandra's partition-count estimation is accurate, Scylla's (rewritten in commit `b93cc21`) is very inaccurate. The tests introduce here, which all xfail on Scylla, confirm this suspicion. The most important tests are the "simple" tests, involving a workload which writes N distinct partitions and then asks for the estimated partition count. Cassandra provides accurate estimates, which grow more accurate with more partitions, so it passes these tests, while Scylla provides bad estimates and fails them. Additional tests demonstrate that neither Scylla nor Cassandra can handle anything beyond the "simple" case of distinct partitions. Two tests which xfail on both Cassandra and Scylla demonstrate that if we write the same partitions to multiple sstables - or also delete partitions - the estimated partition counts will be way off. Refs #9083 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210726211315.1515856-1-nyh@scylladb.com>	2021-08-05 08:50:19 +02:00
Asias He	9903eecc0f	storage_service: Close reader in load_and_stream We forgot to call the reader.close() for the reader when the close api is introduced. Fixes #9146 Closes #9148	2021-08-05 09:27:19 +03:00
Botond Dénes	76f2790c24	compaction/compaction_descriptor: add comment to Validation compaction type Add a note explaining what Origin uses this for, to deter future attempts at reusing this for something else.	2021-08-05 07:36:45 +03:00
Botond Dénes	ab7a2cabb3	compaction/compaction_descriptor: compaction_options: remove validate It is unused now.	2021-08-05 07:36:45 +03:00
Botond Dénes	c1203618eb	api: storage_service: validate_keyspace -> scrub_keyspace (validate mode) Fold validate keyspace into scrub keyspace (validate mode).	2021-08-05 07:36:45 +03:00
Botond Dénes	5f6468d7d7	compaction/compaction_manager: hide perform_sstable_validation() We are folding validation compaction into scrub (at least on the interface level), so remove the validation entry point accordingly and have users go through `perform_sstable_scrub()` instead.	2021-08-05 07:36:44 +03:00
Botond Dénes	a258f5639b	compaction: validation compaction -> scrub compaction (validate mode) Fold validation compaction into scrub compaction (validate mode). Only on the interface level though: to initiate validation compaction one now has to use `compaction_options::make_scrub(compaction_options::scrub::mode::validate)`. The implementation code stays as-is -- separate.	2021-08-05 07:32:05 +03:00
Raphael S. Carvalho	154e8959f9	compaction: Optimize partition filtering for cleanup compaction Realized that the overall complexity of partition filtering in cleanup is O(N * log(M)), where N is # of tokens M is # of ranges owned by the node Assuming N=10,000,000 for a table and M=257, Nlog(M) ~= 80,056,245 checks performed during the whole cleanup. This can be optimized by taking advantage that owned ranges are both sorted and non wrapping, so an incremental iterator-oriented checker is introduced to reduce complexity from O(N log(M)) to O(N + M) or just O(N). BEFORE 240MB to 237MB (~98% of original) in 3239ms = 73MB/s. ~950016 total partitions merged to 949943. 719MB to 719MB (~99% of original) in 9649ms = 74MB/s. ~2900608 total partitions merged to 2900576. 1GB to 1GB (~100% of original) in 15231ms = 74MB/s. ~4536960 total partitions merged to 4536852. 1GB to 1GB (~100% of original) in 15244ms = 74MB/s. ~4536960 total partitions merged to 4536840. 1GB to 1GB (~100% of original) in 15263ms = 74MB/s. ~4536832 total partitions merged to 4536783. 1GB to 1GB (~100% of original) in 15216ms = 74MB/s. ~4536832 total partitions merged to 4536812. AFTER 240MB to 237MB (~98% of original) in 3169ms = 74MB/s. ~950016 total partitions merged to 949943. 719MB to 719MB (~99% of original) in 9444ms = 76MB/s. ~2900608 total partitions merged to 2900576. 1GB to 1GB (~100% of original) in 14882ms = 76MB/s. ~4536960 total partitions merged to 4536852. 1GB to 1GB (~100% of original) in 14918ms = 76MB/s. ~4536960 total partitions merged to 4536840. 1GB to 1GB (~100% of original) in 14919ms = 76MB/s. ~4536832 total partitions merged to 4536783. 1GB to 1GB (~100% of original) in 14894ms = 76MB/s. ~4536832 total partitions merged to 4536812. Fixes #6807. test: mode(dev). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210802213159.182393-1-raphaelsc@scylladb.com>	2021-08-04 20:35:44 +03:00
Jan Ciolek	44ca965ba0	cql3: Remove unused functions like bounds_ranges Finding clustering ranges has been rewritten to use the new expression variant. Old bounds_ranges() and other similar ones are no longer needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-08-04 17:12:44 +02:00
Jan Ciolek	da54c9e2fb	cql3: Use expressions to calculate the local-index clustering ranges Removes old code used to calculate local-index clustering range and replaces it with new based on the expression variant. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-08-04 17:12:40 +02:00
Asias He	6230bd4b5a	locator: Add yield in do_get_ranges and friends Not all calculate_natural_endpoints implementations respect can_yield flag, for example, everywhere_replication_strategy. This patch adds yield at the caller site to fix stalls we saw in do_get_ranges. Fixes #8943 Closes #9139	2021-08-04 15:52:37 +03:00
Laura Novich	54f0b1556d	Update Conf.py to remove master from drop-down Still allows master to be built, but users will not be able to select it. Closes #9140	2021-08-04 15:24:47 +03:00
Laura Novich	4d7835d635	Update docs/conf.py Co-authored-by: David Garcia <hi@davidgarcia.dev>	2021-08-04 15:24:47 +03:00
Laura Novich	3533d5ec15	Update docs/conf.py Co-authored-by: David Garcia <hi@davidgarcia.dev>	2021-08-04 15:24:47 +03:00
lauranovich	79f0dc64cb	add multiversion control to scylla	2021-08-04 15:24:47 +03:00
Benny Halevy	3ad0067272	date_tiered_manifest: get_now: fix use after free of sstable_list The sstable_list is destroyed right after the temporary lw_shared_ptr<sstable_list> returned from `cf.get_sstables()` is dereferenced. Fixes #9138 Test: unit(dev) DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>	2021-08-04 15:24:47 +03:00
Nadav Har'El	d640998ca8	test/cql-pytest: add test for another ALLOW FILTERING case In this patch we add another test case for a case where ALLOW FILTERING should not be required (and Cassandra doesn't require it) but Scylla does. This problem was introduced by pull request #9122. The pull request fixed an incorrect query (see issue #9085) involving both an index and a multi-column restriction on a compound clustering key - and the fix is using filtering. However, in one specific case involving a full prefix, it shouldn't require filtering. This test reproduces this case. The new test passes on Cassandra (and also theoretically, should pass), but fails on Scylla - the check_af_optional() call fails because Scylla made the ALLOW FILTERING mandatory for that case. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210803092046.1677584-1-nyh@scylladb.com>	2021-08-04 15:24:47 +03:00
Nadav Har'El	dba184039a	test/alternator: another test for Query's ExclusiveStartKey We already have tests for Query's ExclusiveStartKey option, but we only exercised it as a way for paging linearly through all the results. Now we add a test that confirms that ExclusiveStartKey can be used not just for paging through all the result - but also for jumping directly to the middle of a partition after any clustering key (existing or non- existing clustering key). The new test also for the first time verifies that ExclusiveStartKey with a specific format works (previous tests just copied LastEvaluatedKey to ExclusiveStartKey, so any opaque cookie could have worked). The test passes on both DynamoDB and Alternator so it did not find a new bug. But it's useful to have as a regression test, in case in the future we want to improve paging performance (see #6278) - and need to keep in mind that ExclusiveStartKey is not just for paging. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210729114703.1609058-1-nyh@scylladb.com>	2021-08-04 15:24:47 +03:00
Kamil Braun	4165045356	test: raft: randomized_nemesis_test: handle timeouts in rpc::send_snapshot They were already correctly returned to the caller, but we had a leftover discarded future that would sometimes end up with a broken_promise exception. Ignore the exception explicitly. Message-Id: <20210803122207.78406-1-kbraun@scylladb.com>	2021-08-04 15:24:47 +03:00
Nadav Har'El	9662de85f5	Merge 'Azure snitch support' from Pekka Enberg This add support for Azure snitch. The work is an adaptation of AzureSnitch for Apache Cassandra by Yoshua Wakeham: https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java Also change `production_snitch_base` to protect against a snitch implementation setting DC and rack to an empty string, which Lubos' says can happen on Azure. Fixes #8593 Closes #9084 * github.com:scylladb/scylla: scylla_util: Use AzureSnitch on Azure production_snitch_base: Fallback for empty DC or rack strings azure_snitch: Azure snitch support	2021-08-03 22:52:05 +03:00
Pavel Solodovnikov	ce330d11af	cql3: create_view_statement: validate bound variables at prepare step Variables specification is already known at prepare step, so it's safe to move the check to happen as early as possible. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210802090852.253469-1-pa.solodovnikov@scylladb.com>	2021-08-03 22:52:05 +03:00
Tomasz Grabiec	cd56a4ec09	service: query_pagers: Reuse query_uuid across pages when paging locally Query pager was reusing query_uuid only when it had no local state (no _last_pkey), so querier cache was not used when paging locally. This bug affects performance of aggregate queries like count(*). Fixes #9127 Message-Id: <20210803003941.175099-1-tgrabiec@scylladb.com>	2021-08-03 22:52:05 +03:00
Botond Dénes	8b64a6caa7	compaction/compaction_descriptor: compaction_options: add options() accessor	2021-08-03 09:34:17 +03:00
Botond Dénes	f01b799a30	compaction/compaction_descriptor: compaction_options::scrub::mode: add validate To replace compaction_type::Validation.	2021-08-03 09:34:15 +03:00
Avi Kivity	885ca2158e	db: schema_tables: reindent Following conversion to corotuines in `fc91e90c59`, remove extra indents and braces left to make the change clearer. One variable had to be renamed since without the braces it duplicated another variable in the same block. Test: unit (dev) Closes #9125	2021-08-02 22:36:57 +02:00
Raphael S. Carvalho	a869d61c89	tests: Move compaction-related tests into its own unit With commit `1924e8d2b6`, compaction code was moved into a top level dir as compaction is layered on top of sstables. Let's continue this work by moving all compaction unit tests into its own test file. This also makes things much more organized. sstable_datafile_test, as its name implies, will only contain sstable data tests. Perhaps it should be renamed to only sstable_data_test, as the test also contains tests involving other components, not only the data one. BEFORE $ cat test/boost/sstable_datafile_test.cc \| grep TEST_CASE \| wc -l 105 AFTER $ cat test/boost/sstable_compaction_test.cc \| grep TEST_CASE \| wc -l 57 $ cat test/boost/sstable_datafile_test.cc \| grep TEST_CASE \| wc -l 48 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>	2021-08-02 22:26:26 +03:00
Avi Kivity	0f50f8ec5f	Merge "Allow reshape to be aborted" from Raphael " Now reshape can be aborted on either boot or refresh. The workflow is: 1) reshape starts 2) user notices it's taking too long 3) nodetool stop RESHAPE the good thing is that completed reshape work isn't lost, allowing table to enjoy the benefits of all reshaping done up to the abortion point. Fixes #7738. " * 'abort_reshape_v1' of https://github.com/raphaelsc/scylla: compaction: Allow reshape to be aborted api: make compaction manager api available earlier	2021-08-02 21:59:42 +03:00
Raphael S. Carvalho	aa7cdc0392	compaction: Allow reshape to be aborted Now reshape can be aborted on either boot or refresh. The workflow is: 1) reshape starts 2) user notices it's taking too long 3) nodetool stop RESHAPE the good thing is that completed reshape work isn't lost, allowing table to enjoy the benefits of all reshaping done up to the abortion point. Fixes #7738. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-08-02 13:54:51 -03:00
Raphael S. Carvalho	33404b9169	api: make compaction manager api available earlier That will be needed for aborting reshape on boot. Refs #7738. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-08-02 13:54:44 -03:00
Raphael S. Carvalho	f75154afca	compaction: Remove overhead of merging reader for cleanup compaction When perfing cleanup, merging reader showed up as significant. Given that cleanup is performed on a single sstable at a time, merging reader becomes an extra layer doing useless work. 1.71% 1.71% scylla scylla [.] merging_reader<mutation_reader_merger>::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}::operator() mutation compactor, to get rid of purgeable expired data and so on, still consumes the data retrieved by sstable reader, so no semantic change is done. With the overhead removed, cleanup becomes ~9% faster, see: BEFORE real 1m15.240s user 0m2.648s sys 0m0.128s 240MB to 237MB (~98% of original) in 3301ms = 71MB/s. 719MB to 719MB (~99% of original) in 9761ms = 73MB/s. 1GB to 1GB (~100% of original) in 15372ms = 73MB/s. 1GB to 1GB (~100% of original) in 15343ms = 74MB/s. 1GB to 1GB (~100% of original) in 15329ms = 74MB/s. 1GB to 1GB (~100% of original) in 15360ms = 73MB/s. AFTER real 1m9.154s user 0m2.428s sys 0m0.123s 240MB to 237MB (~98% of original) in 3010ms = 78MB/s. 719MB to 719MB (~99% of original) in 8997ms = 79MB/s. 1GB to 1GB (~100% of original) in 14114ms = 80MB/s. 1GB to 1GB (~100% of original) in 14145ms = 80MB/s. 1GB to 1GB (~100% of original) in 14106ms = 80MB/s. 1GB to 1GB (~100% of original) in 14053ms = 80MB/s. With 1TB set, ~20m would had been reduced instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210730190713.462135-1-raphaelsc@scylladb.com>	2021-08-02 19:22:41 +03:00
Michael Livshin	0eb2eb1b44	rename `coarse_clock` to `coarse_steady_clock` Also add a comment to explain why it exists. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9123	2021-08-02 17:41:21 +03:00
Jan Ciolek	a7d1dab066	statement_restrictions_test: tests for extracting column restrictions Add unit tests for the function extract_single_column_restrictions_for_column() Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-08-02 15:43:42 +02:00
Jan Ciolek	43ab3d6831	expression: add a function to extract restrictions for a column Add a function, which given an expression and a column, extracts all restrictions involving this column. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-08-02 15:43:33 +02:00
Tomasz Grabiec	3e47f28c65	Merge "raft: use the correct term when storing a snapshot" from Kamil We should not use the current term; we should use the term of the snapshot's index, which may be lower. * https://github.com/kbr-/scylla/tree/snapshot-right-term-fix: test: raft: regression test for using the correct term when taking a snapshot test: raft: randomized_nemesis_test: server configuration parameter raft: use the correct term when storing a snapshot	2021-08-02 15:33:52 +02:00
Eduardo Benzecri	f196a4131a	scylla_setup: Fix outdated message Message changed according to what 'scylla_bootparam_setup' currently does (set a clock source at boot time) instead of of what it used to do in the past (setting huge pages). Closes #9116.	2021-08-02 16:04:38 +03:00
Dejan Mircevski	debf65e136	cql3: Filter regular-index results on multi-column When a WHERE clause contains a multi-column restriction and an indexed regular column, we must filter the results. It is generally not possible to craft the index-table query so it fetches only the matching rows, because that table's clustering key doesn't match up with the column tuple. Fixes #9085. Tests: unit (dev, debug) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9122	2021-08-02 14:15:43 +03:00
Nadav Har'El	fc91e90c59	Merge 'db: schema_tables: coroutinize' from Avi Kivity schema_tables is quite hairy, but can be easily simplified with coroutines. In addition to switching future-returning functions to coroutines, we also switch Seastar threads to coroutines. This is less of a clear-cut win; the motivation is to reduce the chances of someone calling a function that expects to run in a thread from a non-thread context. This sometimes works by accident, but when it doesn't, it's pretty bad. So a uniform calling convention has some benefit. I left the extra indents in, since the indent-fixing patch is hard to rebase in case a rebase is needed. I will follow up with an indent fix post merge. Test: unit (dev, debug, release) Closes #9118 * github.com:scylladb/scylla: db: schema_tables: drop now redundant #includes db: schema_tables: coroutinize drop_column_mapping() db: schema_tables: coroutinize column_mapping_exists() db: schema_tables: coroutinize get_column_mapping() db: schema_tables: coroutinize read_table_mutations() db: schema_tables: coroutinize create_views_from_schema_partition() db: schema_tables: coroutinize create_views_from_table_row() db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition() db: schema_tables: coroutinize create_tables_from_tables_partition() db: schema_tables: coroutinize create_table_from_name() db: schema_tables: coroutinize read_table_mutations() db: schema_tables: coroutinize merge_keyspaces() db: schema_tables: coroutinize do_merge_schema() db: schema_tables: futurize and coroutinize merge_functions() db: schema_tables: futurize and coroutinize user_types_to_drop::drop db: schema_tables: futurize and coroutinize merge_types() db: schema_tables: futurize and coroutinize merge_tables_and_views() db: schema_tables: coroutinize store_column_mapping() db: schema_tables: futurize and coroutinize read_tables_for_keyspaces() db: schema_tables: coroutinize read_table_names_of_keyspace() db: schema_tables: coroutinize recalculate_schema_version() db: schema_tables: coroutinize merge_schema() db: schema_tables: introduce and use with_merge_lock() db: schema_tables: coroutinize update_schema_version_and_announce() db: schema_tables: coroutinize read_keyspace_mutation() db: schema_tables: coroutinize read_schema_partition_for_table() db: schema_tables: coroutinize read_schema_partition_for_keyspace() db: schema_tables: coroutinize query_partition_mutation() db: schema_tables: coroutinize read_schema_for_keyspaces() db: schema_tables: coroutinize convert_schema_to_mutations() db: schema_tables: coroutinize calculate_schema_digest() db: schema_tables: coroutinize save_system_schema()	2021-08-02 13:43:53 +03:00
Kamil Braun	ac5121a016	test: raft: regression test for using the correct term when taking a snapshot	2021-08-02 11:48:35 +02:00
Kamil Braun	63fdc718d4	test: raft: randomized_nemesis_test: server configuration parameter	2021-08-02 11:47:19 +02:00
Kamil Braun	e9632ee986	raft: use the correct term when storing a snapshot We should not use the current term; we should use the term of the snapshot's index, which may be lower.	2021-08-02 11:46:04 +02:00
Avi Kivity	e4d0af808d	Merge 'repair: Log improvement and cleanup' from Asias He This series improves the repair logging by removing the unused sub_ranges_nr counter, adding peer node ip in the log, removing redundant logs in case of error. Closes #9120 * github.com:scylladb/scylla: repair: Remove redudnary error log in tracker::run repair: Do not log errors in repair_ranges repair: Move more repair single range code into repair_info::repair_range repair: Use the same uuid from the repair_info repair: Drop sub_ranges_nr counter	2021-08-02 12:04:39 +03:00
Avi Kivity	ebda2fd4db	test: cql_test_env: increase file descriptor limit It was observed that since `fce124bd90` ('Merge "Introduce flat_mutation_reader_v2" from Tomasz') database_test takes much longer. This is expected since it now runs the upgrade/downgrade reader tests on all existing tests. It was also observed that in a similar time frame database_test sometimes times our on test machines, taking much longer than usual, even with the extra work for testing reader upgrade/downgrade. In an attempt to reproduce, I noticed ti failing on EMFILE (too many open file descriptors). I saw that tests usually use ~100 open file descriptors, while the default limit is 1024. I suspect we have runaway concurrency, but I was not able to pinpoint the cause. It could be compaction lagging behind, or cleanup work for deleting tables (the test test_database_with_data_in_sstables_is_a_mutation_source creates and deletes many tables). As a stopgap solution to unblock the tests, this patch raises the file descriptor limit in the way recommended by [1]. While tests shouldn't use so many descriptors, I ran out of ideas about how to plug the hole. Note that main() does something similar, through more elaborate since it needs to communicate to users. See `ec60f44b64` ("main: improve process file limit handling"). [1] http://0pointer.net/blog/file-descriptor-limits.html Closes #9121	2021-08-02 11:57:14 +03:00
Asias He	1f86d5a870	repair: Use a timeout for reading fragments We recently saw repairs blocking and not making progress for a prolonged time (days). One of the primary suspects is reads belonging to several repairs deadlocking on the streaming read concurrency semaphore. This is something that we've seen in the past during internal testing, and although theoretically we have fixed it, such deadlocks are notoriously hard to reliably reproduce so not seeing them in recent testing doesn't mean they definitely cannot happen. The main reason these deadlocks can happen in the first place is that reads belonging to repairs don't use timeouts. This means that if there happens to be a deadlock, neither of the participating repairs will give up and the only way to release the deadlock is restarting the node. This patch proposes a workaround for these recently saw repair problems by introducing a timeout for reads belonging to row-level reads, building on the premise that a failed repair is better than a stuck repair. A timeout allows one of the participants to give up, releasing the deadlock, allowing the others to proceed. The timeout value chosen by this patch is 30m. Note that this applies to reading a single mutation fragment, not for the entire read. Thirty minutes should be more than enough for producing a single mutation fragment. Refs: #5359 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Signed-off-by: Asias He <asias@scylladb.com> Closes #9098	2021-08-02 11:55:47 +03:00
Pavel Solodovnikov	b1a3b59a08	test: test_materialized_view: test_mv_select_stmt_bound_values: improve error handling Restrict expected exception message to filter only relevant exception, matching both for scylla and cassandra. For example, the former has this message: Cannot use query parameters in CREATE MATERIALIZED VIEW statements While the latter throws this: Bind variables are not allowed in CREATE MATERIALIZED VIEW statements Also, place cleanup code in try-finally clause. Tests: cql-pytest:test_materialized_view.py(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210802083912.229886-1-pa.solodovnikov@scylladb.com>	2021-08-02 11:49:50 +03:00
Nadav Har'El	e8fe1817df	cql-pytest: translate Cassandra's tests for timestamps This is a translation of Cassandra's CQL unit test source file validation/entities/TimestampTest.java into our our cql-pytest framework. This test file checks has a few tests (8) on various features of cell timestamps. All these tests pass on Cassandra and on Scylla - i.e., these tests no new Scylla bug was detected :-) Two of the new tests are very slow (6 seconds each) and check a trivial feature that was already checked elsewhere more efficiently (the fact that TTL expiration works), so I marked them "skip" after verifying they really pass. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210801142738.1633126-1-nyh@scylladb.com>	2021-08-02 09:25:49 +02:00
Asias He	bd9447370f	repair: Remove redudnary error log in tracker::run The calller of tracker::run will log the error. Remove the log inside tracker::run in case of error to reduce redundancy. Before: WARN 2021-07-28 15:24:32,325 [shard 0] repair - repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed: std::runtime_error ({shard 0: std::runtime_error (repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on s hard 0 failed to repair 513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to repair 513 out of 513 ranges)}) WARN 2021-07-28 15:24:32,325 [shard 0] repair - repair_tracker run for repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] failed: std::runtime_error ({shard 0: std::runtime_error (repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 0 failed to repair 513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1, uuid=e9c63a2b-07c9-4a38-b9ad-b74cbe180366] on shard 1 failed to repair 513 out of 513 ranges)}) After: WARN 2021-07-28 15:33:17,038 [shard 0] repair - repair_tracker run for repair id [id=1, uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] failed: std::runtime_error ({shard 0: std::runtime_error (repair id [id=1, uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 0 failed to repair 513 out of 513 ranges), shard 1: std::runtime_error (repair id [id=1, uuid=497b2f14-e294-4e76-b792-e6b2d17a8cb9] on shard 1 failed to repair 513 out of 513 ranges)}) ERROR 2021-07-28 15:33:17,453 [shard 0] rpc - client 127.0.0.2:7000: fail to connect: Connection refused	2021-08-02 10:11:52 +08:00
Gleb Natapov	15d34d9f96	raft: do not let follower's commit_idx to go backwards append_reply packets can be reordered and thus reply.commit_idx may be smaller than the one it the tracker. The tracker's commit index is used to check if a follower needs to be updated with potentially empty append message, so the bug may theoretically cause unneeded packets to be sent. Message-Id: <YQZZ/6nlNb5nQyXp@scylladb.com>	2021-08-02 01:25:55 +02:00
Tomasz Grabiec	c3ada1a145	Merge "count row (sstables/row cache/memtables) and range (memtables) tombstone reads" from Michael Fixes #7749.	2021-08-01 23:13:18 +02:00
Avi Kivity	343b98d9b5	Merge "Print memory reclamation diagnostics on stalls" from Michael Livshin " Refs #4186 but does not fix it, because I punted on the "number (and kinds) of objects migrated and evicted" part. " * tag 'gh-4186-reclamation-diagnostics-on-stalls-v6' of github.com:cmm/scylla: logalloc: add on-stall memory reclaim diagnostics utils: add a coarse clock logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked logalloc: metrics: remove unneeded captures and a pleonasm logalloc: add metrics for evicted and freed memory logalloc: count evicted memory logalloc: count freed memory	2021-08-01 22:48:55 +03:00
Michael Livshin	71d721a97e	logalloc: add on-stall memory reclaim diagnostics Reuse the existing `reclaim_timer` for stall detection. * Since a timer is now set around every reclaim and compaction, use a coarse one for speed. * Set log level according to conditions (stalls deserve a warning). * Add compaction/migration/eviction/allocation stats. Refs #4186. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 21:51:08 +03:00
Michael Livshin	68ab3948f8	utils: add a coarse clock Implement a millisecond-resolution `std::chrono`-style clock using `CLOCK_MONOTONIC_COARSE`. The use cases are those where you care about clock sampling latency more than about accuracy. Assuming non-ancient versions of the kernel & libc, all clock types recognized by `clock_gettime()` are implemented through a vDSO, so `clock_gettime()` is not an actual system call. That means that even `CLOCK_MONOTONIC` (which is what `std::chrono::steady_clock` uses) is not terribly expensive in practice. But `CLOCK_MONOTONIC_COARSE` is still 3.5 times faster than that (on my machine the latencies are 4ns versus 14ns) and is also supposed to be easier on the cache. The actual granularity of `CLOCK_MONOTONIC_COARSE` is tick (on x86-64, anyway) -- but `getclock_getres()` says it has millisecond resolution, so we use that. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 21:51:08 +03:00
Avi Kivity	ca59754e68	db: schema_tables: drop now redundant #includes	2021-08-01 20:13:15 +03:00
Avi Kivity	40fdbf9558	db: schema_tables: coroutinize drop_column_mapping()	2021-08-01 20:13:15 +03:00
Avi Kivity	7d46300af2	db: schema_tables: coroutinize column_mapping_exists()	2021-08-01 20:13:15 +03:00
Avi Kivity	74b2200f4d	db: schema_tables: coroutinize get_column_mapping()	2021-08-01 20:13:15 +03:00
Avi Kivity	f19ca7aaaa	db: schema_tables: coroutinize read_table_mutations()	2021-08-01 20:13:15 +03:00
Avi Kivity	81a2be17b6	db: schema_tables: coroutinize create_views_from_schema_partition()	2021-08-01 20:13:15 +03:00
Avi Kivity	15f2fd2a23	db: schema_tables: coroutinize create_views_from_table_row()	2021-08-01 20:13:15 +03:00
Avi Kivity	0843d441ff	db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition() The tables local is a lw_shared_ptr which is created and then refeferenced before returning. It can be unpeeled to the pointed-to type, resulting in one less allocation.	2021-08-01 20:13:15 +03:00
Avi Kivity	66054d24c4	db: schema_tables: coroutinize create_tables_from_tables_partition()	2021-08-01 20:13:15 +03:00
Avi Kivity	82ba3c5f4a	db: schema_tables: coroutinize create_table_from_name()	2021-08-01 20:13:15 +03:00
Avi Kivity	862f491605	db: schema_tables: coroutinize read_table_mutations()	2021-08-01 20:13:15 +03:00
Avi Kivity	91c1a29808	db: schema_tables: coroutinize merge_keyspaces()	2021-08-01 20:13:15 +03:00
Avi Kivity	78fc05922b	db: schema_tables: coroutinize do_merge_schema() It is now using an internal thread, so unpeel is and replace future::get() with co_await.	2021-08-01 20:13:15 +03:00
Avi Kivity	9680d9e76c	db: schema_tables: futurize and coroutinize merge_functions() Right now, merge_functions() expects to be called in a thread. Remove that requirement by converting it into a coroutine and returning a future. De-threading helps reduce errors where something expects to be called in a thread, but isn't.	2021-08-01 20:13:15 +03:00
Avi Kivity	9cbae212bf	db: schema_tables: futurize and coroutinize user_types_to_drop::drop user_types_to_drop::drop is a function object returning void, and expecting to be called in a thread. Make it return a future and convert the only value it is initialized to to a coroutine. De-threading helps reduce errors where something expects to be called in a thread, but isn't.	2021-08-01 20:13:15 +03:00
Avi Kivity	e5f28fc746	db: schema_tables: futurize and coroutinize merge_types() Right now, merge_types() expects to be called in a thread. Remove that requirement by converting it into a coroutine and returning a future. The [[nodiscard]] attribute is moved from the function to the return type, since the function now returns a future which is nodiscard anyway. The lambda returned is not coroutinized (yet) since it's part of the user_types_to_drop inner function that still returns void and expects to be called in a thread. De-threading helps reduce errors where something expects to be called in a thread, but isn't.	2021-08-01 20:13:15 +03:00
Avi Kivity	c9584d50ee	db: schema_tables: futurize and coroutinize merge_tables_and_views() Right now, merge_tables_and_views() expects to be called in a thread. Remove that requirement by converting it into a coroutine and returning a future. De-threading helps reduce errors where something expects to be called in a thread, but isn't.	2021-08-01 20:13:15 +03:00
Avi Kivity	80fe158387	db: schema_tables: coroutinize store_column_mapping()	2021-08-01 20:13:15 +03:00
Avi Kivity	ee8b02f437	db: schema_tables: futurize and coroutinize read_tables_for_keyspaces() Right now, read_tables_for_keyspaces() expects to be called in a thread. Remove that requirement by converting it into a coroutine and returning a future. De-threading helps reduce errors where something expects to be called in a thread, but isn't.	2021-08-01 20:13:15 +03:00
Avi Kivity	cd1003daad	db: schema_tables: coroutinize read_table_names_of_keyspace()	2021-08-01 20:13:15 +03:00
Avi Kivity	000f7eabd5	db: schema_tables: coroutinize recalculate_schema_version()	2021-08-01 20:13:15 +03:00
Avi Kivity	95d33e9e86	db: schema_tables: coroutinize merge_schema()	2021-08-01 20:13:15 +03:00
Avi Kivity	25548f46dd	db: schema_tables: introduce and use with_merge_lock() Rather than open-coding merge_lock()/merge_unlock() pairs, introduce and use a helper. This helps in coroutinization, since coroutines don't support RAII with destructors that wait.	2021-08-01 20:13:15 +03:00
Avi Kivity	7b731ae2c6	db: schema_tables: coroutinize update_schema_version_and_announce()	2021-08-01 20:13:15 +03:00
Avi Kivity	385e0dcc2e	db: schema_tables: coroutinize read_keyspace_mutation()	2021-08-01 20:13:15 +03:00
Avi Kivity	ef5df86b1f	db: schema_tables: coroutinize read_schema_partition_for_table()	2021-08-01 20:13:15 +03:00
Avi Kivity	8841c2ba10	db: schema_tables: coroutinize read_schema_partition_for_keyspace() Two reference parameters are copied rather than changing the signature, to avoid a compile-the-world. It can be cleaned up post-merge.	2021-08-01 20:09:00 +03:00
Michael Livshin	5f9695c1b2	sstables: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	64dca1fef9	memtables: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	f364666d4a	row_cache: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	d4a5508d47	memtables: rename `partition_snapshot_accounter` for consistency It is actually `partition_snapshot_flush_accounter`, as opposed to `partition_snapshot_read_accounter`. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	69ade155be	partition_snapshot_reader: rename MemoryAccounter to just Accounter Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	4d8f99df25	remove the newly-unused `partition_snapshot_reader_dummy_accounter` (along with the `make_partition_snapshot_flat_reader` overload that used it) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	2ee9f1b951	memtables: add metric and accounter for range tombstone reads Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Michael Livshin	20c760e638	logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked Similarly to compact_and_evict(). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	a96aed3973	logalloc: metrics: remove unneeded captures and a pleonasm Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	aa6c8ef582	logalloc: add metrics for evicted and freed memory Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	a6283b322b	logalloc: count evicted memory Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	4bcd91a09a	logalloc: count freed memory (On the individual free() request level, i.e. similarly to allocs) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Avi Kivity	d1876488f7	db: schema_tables: coroutinize query_partition_mutation()	2021-08-01 19:17:13 +03:00
Avi Kivity	35f9caf6a9	db: schema_tables: coroutinize read_schema_for_keyspaces()	2021-08-01 19:17:09 +03:00
Avi Kivity	7c0476251a	db: schema_tables: coroutinize convert_schema_to_mutations()	2021-08-01 19:16:55 +03:00
Avi Kivity	921216e8e6	db: schema_tables: coroutinize calculate_schema_digest()	2021-08-01 19:16:50 +03:00
Avi Kivity	3dab308ddf	db: schema_tables: coroutinize save_system_schema()	2021-08-01 19:16:40 +03:00
Nadav Har'El	6c27000b98	Merge 'Propagate exceptions without throwing' from Piotr Sarna NOTE: this series depends on a Seastar submodule update, currently queued in next: 0ed35c6af052ab291a69af98b5c13e023470cba3 In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. `make_exception_future<>` for futures 2. `co_return coroutine::exception(...)` for coroutines which return `future<T>` (the mechanism does not work for `future<>` without parameters, unfortunately) Tests: unit(release) Closes #9079 * github.com:scylladb/scylla: system_keyspace: pass exceptions without throwing sstables: pass exceptions without throwing storage_proxy: pass exceptions without throwing multishard_mutation_query: pass exceptions without throwing client_state: pass exceptions without throwing flat_mutation_reader: pass exceptions without throwing table: pass exceptions without throwing commitlog: pass exceptions without throwing compaction: pass exceptions without throwing database: pass exceptions without throwing	2021-08-01 16:47:47 +03:00
Pavel Solodovnikov	d07f681a95	test: test_non_deterministic_functions: add `lwt` to test cases names The tests are related to LWT so add the corresponding prefix to all the tests cases to emphasize that. Tests: cql-pytest(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210801131820.164480-1-pa.solodovnikov@scylladb.com>	2021-08-01 16:23:30 +03:00
Avi Kivity	48860b135a	Merge "cql3: fix `current()` functions to be non-deterministic" from Pavel S " Previously, the following functions were incorrectly marked as pure, meaning that the function is executed at "prepare" step: `currenttimestamp()` * `currenttime()` * `currentdate()` * `currenttimeuuid()` For functions that possibly depend on timing and random seed, this is clearly a bug. Cassandra doesn't have a notion of pure functions, so they are lazily evaluated. Make Scylla to match Cassandra behavior for these functions. Add a unit-test for a fix (excluding `currentdate()` function, because there is no way to use synthetic clock with query processor and sleeping for a whole day to demonstrate correct behavior is clearly not an option). Also, extend the cql-pytest for #8604 since there are now more non-deterministic CQL functions, they are all subject to the test now. Fixes: #8816 " * 'timeuuid_function_pure_annotation_v3' of https://github.com/ManManson/scylla: test: test_non_deterministic_functions: test more non-pure functions cql3: change `current*()` CQL functions to be non-pure	2021-08-01 12:35:36 +03:00
Pavel Solodovnikov	a130921120	test: test_non_deterministic_functions: test more non-pure functions Check that all existing non-pure functions (except for `currentdate()`) work correctly with or without prepared statements. Tests: cql-pytest/test_non_deterministic_functions.py(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-01 12:18:26 +03:00
Pavel Solodovnikov	21d758020a	cql3: change `current()` CQL functions to be non-pure These include the following: `currenttimestamp()` * `currenttime()` * `currentdate()` * `currenttimeuuid()` Previously, they were incorrectly marked as pure, meaning that the function is executed at "prepare" step. For functions that possibly depend on timing and random seed, this is clearly a bug. Cassandra doesn't have a notion of pure functions, so they are lazily evaluated. Make Scylla to match Cassandra behavior for these functions. Add a unit-test for a fix (excluding `currentdate()` function, because there is no way to use synthetic clock with query processor and sleeping for a whole day to demonstrate correct behavior is clearly not an option). Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-01 12:17:23 +03:00
Avi Kivity	18d9ad1d78	Merge "cql3: create_view_statement: fix wrong check for bound variables" from Pavel S " `create_view_statement::announce_migration()` has an incorrect check to verify that no bound variables were supplied to a select statement of a materialized view. This used `prepare_context::empty()` static method, which doesn't check the current instance for emptiness but constructs a new empty instance instead. The following bit of code actually checked that the pointer to the new empty instance is not null: if (!_variables.empty()) { throw exceptions::invalid_request_exception(format("Cannot use query parameters in CREATE MATERIALIZED VIEW statements")); } Use `get_variable_specifications().empty()` instead to fix the semantics of the `if` statement. This series also removes this `empty()` method, because it's not used anymore. The corresponding non-default constructor is also removed due to being unused. Tests: unit(dev), cql-pytest:test_materialized_view.py(scylla dev, cassandra trunk) " Fixes #9117 * 'create_view_stmt_check_bound_vars_v3' of https://github.com/ManManson/scylla: test: add a test checking that bind markers within MVs SELECT statement don't lead to a crash cql3: prepare_context: remove unused methods cql3: create_view_statement: fix check for bound variables cql3: make `prepare_context::get_variable_specifications()` return const-ref for lvalue overload	2021-08-01 12:04:36 +03:00
Avi Kivity	3089558f8d	tools: toolchain: update to Fedora 34 with clang 12 and libstdc++ 11.2	2021-07-31 15:25:13 +03:00
Pavel Solodovnikov	1ca7825cf6	test: add a test checking that bind markers within MVs SELECT statement don't lead to a crash The request should fail with `InvalidRequest` exception and shouldn't crash the database. Don't check for actual error messages, because they are different between Scylla and Cassandra. The former has this message: Cannot use query parameters in CREATE MATERIALIZED VIEW statements While the latter throws this: Bind variables are not allowed in CREATE MATERIALIZED VIEW statements Tests: cql-pytest/test_materialized_view.py(scylla dev, cassandra trunk) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 17:57:24 +03:00
Pavel Solodovnikov	1694f5f66f	cql3: prepare_context: remove unused methods Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 17:57:10 +03:00
Pavel Solodovnikov	9f0dc99627	cql3: create_view_statement: fix check for bound variables The code for checking that an MV's select statement doesn't have any bind markers uses the wrong method and always returns `false` even when it should not. `prepare_context::empty()` is a misleading name because it doesn't check if the current instance is empty, but creates an empty instance wrapped in a `lw_shared_ptr` instead. Thus, the code in `create_view_statement::announce_migration()` checks that the pointer is not empty, which is always false. Use `get_variable_specifications().empty()` to check that the specifications vector inside the `prepare_context` instance is not empty. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 17:55:43 +03:00
Pavel Solodovnikov	0edf975bf7	cql3: make `prepare_context::get_variable_specifications()` return const-ref for lvalue overload There's no point in copying the `_specs` vector by value in such case, just return a const reference. All existing uses create a copy either way. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 17:54:43 +03:00
Piotr Sarna	1c7af8d46f	cql-pytest: adjust a test case for Cassandra 4 One of the test cases stopped working against Cassandra 4, but that's just because it returns a slightly different error type. The test case is adjusted to work on both Scylla and new Cassandra. Message-Id: <222a7f63a3e9739c6fc646173306fcdb3da25890.1627655555.git.sarna@scylladb.com>	2021-07-30 17:36:23 +03:00
Avi Kivity	0876248c2b	Merge "cql3: cache function calls evaluation for non-deterministic functions" from Pavel S " `function_call` AST nodes are created for each function with side effects in a CQL query, i.e. non-deterministic functions (`uuid()`, `now()` and some others timeuuid-related). These nodes are evaluated either when a query itself is executed or query restrictions are computed (e.g. partition/clustering key ranges for LWT requests). We need to cache the calls since otherwise when handling a `bounce_to_shard` request for an LWT query, we can possibly enter an infinite bouncing loop (in case a function is used to calculate partition key ranges for a query), since the results can be different each time. Furthermore, we don't support bouncing more than one time. Returning `bounce_to_shard` message more than one time will result in a crash. Caching works only for LWT statements and only for the function calls that affect partition key range computation for the query. `variable_specifications` class is renamed to `prepare_context` and generalized to record information about each `function_call` AST node and modify them, as needed: * Check whether a given function call is a part of partition key statement restriction. * Assign ids for caching if above is true and the call is a part of an LWT statement. There is no need to include any kind of statement identifier in the cache key since `query_options` (which holds the cache) is limited to a single statement, anyway. Function calls are indexed by the order in which they appear within a statement while parsing. There is no need to include any kind of statement identifier to the cache key since `query_options` (which holds the cache) is limited to a single statement, anyway. Note that `function_call::raw` AST nodes are not created for selection clauses of a SELECT statement hence they can only accept only one of the following things as parameters: * Other function calls. * Literal values. * Parameter markers. In other words, only parameters that can be immediately reduced to a byte buffer are allowed and we don't need to handle database inputs to non-pure functions separately since they are not possible in this context. Anyhow, we don't even have a single non-pure function that accepts arguments, so precautions are not needed at the moment. Add a test written in `cql-pytest` framework to verify that both prepared and unprepared lwt statements handle `bounce_to_shard` messages correctly in such scenario. Fixes: #8604 Tests: unit(dev, debug) NOTE: the patchset uses `query_options` as a container for cached values. This doesn't look clean and `service::query_state` seems to be a better place to store them. But it's not forwarded to most of the CQL code and would mean that a huge number of places would have to be amended. The series presents a trade-off to avoid forwarding `query_state` everywhere (but maybe it's the thing that needs to be done, nonetheless). " * 'lwt_bounce_to_shard_cached_fn_v6' of https://github.com/ManManson/scylla: cql-pytest: add a test for non-pure CQL functions cql3: cache function calls evaluation for non-deterministic functions cql3: rename `variable_specifications` to `prepare_context`	2021-07-30 14:21:11 +03:00
Pekka Enberg	21cfd090f7	Update tools/python3 submodule * tools/python3 afe2e7f...279aae1 (1): > Drop filename start with '..' in pip modules	2021-07-30 13:58:45 +03:00
Avi Kivity	c3c82415c3	cql3: term: make term::raw, term::multi_column_raw forward declarable As preparation for converting term::raw an expression, make it forward declarable so that we can have a term::raw that is an expression, and an expression that is a term::raw, without driving the compiler insane. Closes #9101	2021-07-30 13:50:28 +03:00
Pavel Emelyanov	4f4b863e6a	test.py: Always disable boost colored output Tests' output is always redirected to a log file. Enabling colored output makes it very hard to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210730083731.17813-1-xemul@scylladb.com>	2021-07-30 12:22:31 +03:00
Piotr Sarna	60072045db	Merge 'cql3: replace cql3::selection::selectable::raw ... hierarchy with expressions' from Avi Kivity Currently, the grammar has two parallel hierarchies. One hierarchy is used in the WHERE clause, and is based on a combination of `term` and expressions. The other is used in the SELECT clause, and is using the cql3::selection::selectable hierarchy. There is some overlap between the hierarchies: both can name columns. Logically, however, they overlap completely - in SQL anything you can select you can filter on, and vice versa. So merging the two hierarchies is important if we want to enrich CQL. This series does that, partially (see below), converting the SELECT clause to expressions. There is another hierarchy split: between the "raw", pre-prepare object hierarchy, and post-prepare non-raw. This series limits itself to converting the raw hierarchy and leaves the non-raw hierarchy alone. An important design choice is not to have this raw/non-raw split in expressions. Note that most of the hierarchy is completely parallel: addition is addition both before prepare and after prepare (but see [1]). The main difference is around identifiers - before preparation they are unresolved, and after preparation they become `column_definition` objects. We resolve that by having two separate types: `unresolved_identifier` for the pre-prepare phase, and the existing `column_value` for post-prepare phase. Alternative choices would be to keep a separate expression::raw variant, or to template the expression variant on whether it is raw or not. I think it would cause undue bloat and confusion. Note the series introduces many on_internal_error() calls. This is because there is not a lot of overlap in the hierarchies today; you can't have a cast in the WHERE clause, for example. These on_internal_error() calls cannot be triggered since the grammar does not yet allow such expressions to be expressed. As we expand the grammar, they will have to be replaced with working implementations. Lastly, field selection is expressible in both hierarchies. This series does not yet merge the two representations (`column_value.sub` vs `field_selection`), but it should be easy to do so later. [1] the `+` operator can also be translated to list concatenation, which we may choose to represent by yet another type. Test: unit(dev) Closes #9087 * github.com:scylladb/scylla: cql3: expression: update find_atom, count_if for function_call, cast, field_selection cql3: expressions: fix printing of nested expressions cql3: selection: replace selectable::raw with expression cql3: expression: convert selectable::with_field_selection::raw to expression cql3: expression: convert selectable::with_cast::raw to expression cql3: expression: convert selectable::with_anonymous_function::raw to expression cql3: expression: convert selectable::with_function_call::raw to expressions cql3: selectable: make selectable::raw forward-declarable cql3: expressions: convert writetime_or_ttl::raw to expression cql3: expression: add convenience constructor from expression element to nested expression utils: introduce variant_element.hh cql3: expression: use nested_expression in binary_operator cql3: expression: introduce nested_expression class Convert column_identifier_raw's use as selectable to expressions make column_identifier::raw forward declarable cql3: introduce selectable::with_expression::raw	2021-07-30 09:57:39 +02:00
Pavel Solodovnikov	eaf70df203	cql-pytest: add a test for non-pure CQL functions Introduce a test using `cql-pytest` framework to assert that both prepared an unprepared LWT statements (insert with `IF NOT EXISTS`) with a non-deterministic function call work correctly in case its evaluation affects partition key range computation (hence the choice of `cas_shard()` for lwt query). Tests: cql-pytest/test_non_deterministic_functions.py Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 01:22:50 +03:00
Pavel Solodovnikov	3b6adf3a62	cql3: cache function calls evaluation for non-deterministic functions And reuse these values when handling `bounce_to_shard` messages. Otherwise such a function (e.g. `uuid()`) can yield a different value when a statement re-executed on the other shard. It can lead to an infinite number of `bounce_to_shard` messages sent in case the function value is used to calculate partition key ranges for the query. Which, in turn, will cause crashes since we don't support bouncing more than one time and the second hop will result in a crash. Caching works only for LWT statements and only for the function calls that affect partition key range computation for the query. `variable_specifications` class is renamed to `prepare_context` and generalized to record information about each `function_call` AST node and modify them, as needed: * Check whether a given function call is a part of partition key statement restriction. * Assign ids for caching if above is true and the call is a part of an LWT statement. There is no need to include any kind of statement identifier in the cache key since `query_options` (which holds the cache) is limited to a single statement, anyway. Note that `function_call::raw` AST nodes are not created for selection clauses of a SELECT statement hence they can only accept only one of the following things as parameters: * Other function calls. * Literal values. * Parameter markers. In other words, only parameters that can be immediately reduced to a byte buffer are allowed and we don't need to handle database inputs to non-pure functions separately since they are not possible in this context. Anyhow, we don't even have a single non-pure function that accepts arguments, so precautions are not needed at the moment. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-30 01:22:39 +03:00
Tomasz Grabiec	7c28f77412	Merge 'Convert all remaining int tri-compares to std::strong_ordering' from Avi Kivity Convert all known tri-compares that return an int to return std::strong_ordering. Returning an int is dangerous since the caller can treat it as a bool, and indeed this series uncovered a minor bug (#9103). Test: unit (dev) Fixes #1449 Closes #9106 * github.com:scylladb/scylla: treewide: remove redundant "x <=> 0" compares test: mutation_test: convert internal tri-compare to std::strong_ordering utils: int_range: change to std::strong_ordering test: change some internal comparators to std::strong_ordering utils: big_decimal: change to std::strong_ordering utils: fragment_range: change to std::strong_ordering atomic_cell: change compare_atomic_cell_for_merge() to std::strong_ordering types: drop scaffolding erected around lexicographical_tri_compare sstables: keys: change to std::strong_ordering internally bytes: compare_unsigned(): change to std::strong_ordering uuid: change comparators to std::strong_ordering types: convert abstract_type::compare and related to std::strong_ordering types: reduce boilerplate when comparing empty value serialized_tri_compare: change to std::strong_ordering compound_compat: change to std::strong-ordering types: change lexicographical_tri_compare, prefix_equality_tri_compare to std::strong_ordering	2021-07-29 21:43:54 +02:00
Takuya ASADA	3ecdd15777	dist/debian: keep sysconfdir.conf for scylla-housekeeping on 'remove' Same as `4309785`, dpkg does not re-install confffiles when it removed by user, we are missing sysconfdir.conf for scylla-housekeeping on rollback. To prevent this, we need to stop removing drop-in file directory on 'remove'. Fixes #9109 Closes #9110	2021-07-29 12:32:21 +03:00
Avi Kivity	e44d3cc0ea	Merge "Remove global storage service instance" from Pavel E " There are few places that call global storage service, but all are easily fixable without significant changes. 1. alternator -- needs token metadata, switch to using proxy 2. api -- calls methods from storage service, all handlers are registered in main and can capture storage service from there 3. thrift -- calls methods from storage service, can carry the reference via controller 4. view -- needs tokens, switch to using (global) proxy 5. storage_service -- (surprisingly) can use "this" tests: unit(dev), dtest(simple_boot_shutdown, dev) " * 'br-unglobal-storage-service' of https://github.com/xemul/scylla: storage_service: Make it local storage_service: Remove (de)?init_storage_service() storage_service: Use container() in run_with(out)_api_lock storage_service: Unmark update_topology static storage_service: Capture this when appropriate view: Use proxy to get token metadata from thrift: Use local storage service in handlers thrift: Carry sharded<storage_service>& down to handler api: Capture and use sharded<storage_service>& in handlers api: Carry sharded<storage_service>& down to some handlers alternator: Take token metadata from server's storage_proxy alternator: Keep storage_proxy on server	2021-07-29 11:47:16 +03:00
Avi Kivity	8d2255d82c	Merge "Parallelize multishard_combining_reader_as_mutation_source test" from Pavel E " This is the 3rd slowest test in the set. There are 3 cases out there that are hard-coded to be sequential. However, splitting them into boost test cases helps running this test faster in --parallel-cases mode. Timings for debug mode: Total before the patch: 25 min Sequential after the patch: 25 min Basic case: 5 min Evict-paused-readers case: 5 min Single-mutation-buffer case: 15 min tests: unit.multishard_combining_reader_as_mutation_source(debug) " * 'br-parallel-mcr-test' of https://github.com/xemul/scylla: test: Split test_multishard_combining_reader_as_mutation_source into 3 test: Fix indentation after previous patch test: Move out internals of test_multishard_combining_reader_as_mutation_source	2021-07-29 11:39:02 +03:00
Raphael S. Carvalho	c399601833	table: kill move_sstables_from_staging() not used anywhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210728175403.86867-1-raphaelsc@scylladb.com>	2021-07-29 10:42:36 +03:00
Raphael S. Carvalho	eb16268768	table: Guarantee serialization of every sstable set updates Continuing the work from `e4eb7df1a1`, let's guarantee serialization of sstable set updates by making all sites acquire the mutation permit. Then table no longer rely on serialization mechanism of row cache's update functions. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210728174740.78826-1-raphaelsc@scylladb.com>	2021-07-29 10:42:18 +03:00
Asias He	0b2359b45b	repair: Do not log errors in repair_ranges The exception will be logged by the caller of repair_ranges. Do not log it here to reduce redundancy.	2021-07-29 15:05:16 +08:00
Asias He	e3c4f2d54f	repair: Move more repair single range code into repair_info::repair_range The benefit is that inside repair_info::repair_range we have the peer node information. It is useful to log peer nodes. In addition, we can avoid logging the similar logs twice in case the range is skipped. For example: INFO 2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513 ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1, keyspace=keyspace1, table={standard1}, range=(5380136883876426790, 5406788998747631705] WARN 2021-07-28 14:57:15,388 [shard 1] repair - Repair 417 out of 513 ranges, id=[id=1, uuid=72344b26-1db2-48a0-bc5b-e8ac2874e154], shard=1, keyspace=keyspace1, table={standard1}, range=(5380136883876426790, 5406788998747631705], peers={127.0.0.2}, live_peers={}, status=skipped	2021-07-29 15:05:16 +08:00
Asias He	c8e5572cf0	repair: Use the same uuid from the repair_info It is a regression introduced by `d92d404629` (repair: Turn repair_range a repair_info method).	2021-07-29 15:05:16 +08:00
Asias He	c72cc3eb9d	repair: Drop sub_ranges_nr counter It was used to count the number of sub ranges divided by partition level repair. We do not use it anymore in row level repair.	2021-07-29 15:05:16 +08:00
Pavel Emelyanov	f9132b582b	storage_service: Make it local There are 3 places that can now declare local instance: - main - cql_test_env - boost gossiper test The global pointer is saved in debug namespace for debugging. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	055025eaa9	storage_service: Remove (de)?init_storage_service() One of them just re-wraps arguments in std::ref and calls for global storage service. The other one is dead code which also calls the global s._s. Remove both and fix the only caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	2ffbe894b9	storage_service: Use container() in run_with(out)_api_lock Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	cd44a808be	storage_service: Unmark update_topology static And use container() to reshard to shard 0. This removes one more call for global storage service instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	39db19191f	storage_service: Capture this when appropriate Some storage_service methods call for global storage service instance while they can enjoy "this" pointer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	689a4c1e54	view: Use proxy to get token metadata from The mutate_MV() call needs token metadata and it gets them from global storage service. Fixing it not to use globals is a huge refactoring, so for now just get the tokens from global storage proxy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	5a13031ce8	thrift: Use local storage service in handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	f2992f4e32	thrift: Carry sharded<storage_service>& down to handler The thrift_handler class' methods need storage service. This patch makes sure this class has sharded storage service reference on board. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	df285fca7a	api: Capture and use sharded<storage_service>& in handlers The reference in question is already there, handlers that need storage service can capture it and use. These handlers are not yet stopped, but neither is the storage service itself, so the potentially dangling reference is not being set up here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	2e50ba7079	api: Carry sharded<storage_service>& down to some handlers Both set_server_storage_service and set_server_storage_proxy set up API handlers that need storage service to work. Now they all call for global storage service instance, but it's better if they receive one from main. This patch carries the sharded storage service reference down to handlers setting function, next patch will make use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	a965a742fc	alternator: Take token metadata from server's storage_proxy There's a local_nodelist_handler serving API requests that calls for global storage service to get token metadata from. Now it can get storage proxy reference from server upon construction and use it for tokens. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Pavel Emelyanov	ba10e96c75	alternator: Keep storage_proxy on server It's already available on controller and will be needed by API handlers in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-29 05:12:36 +03:00
Gleb Natapov	4764028cb3	raft: Remove leader_id from append_request The filed is not used anywhere. Message-Id: <YP0khmjK2JSp77AG@scylladb.com>	2021-07-28 20:30:07 +02:00
Avi Kivity	42e1f318d7	Merge "Respect "bypass cache" in sstable index caching" from Tomasz " This series changes the behavior of the system when executing reads annotated with "bypass cache" clause in CQL. Such reads will not use nor populate the sstable partition index cache and sstable index page cache. " * 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla: sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads sstables: Do not populate partition index cache for "bypass cache" reads	2021-07-28 18:45:39 +03:00
Avi Kivity	331eb57e17	Revert "compression: define 'class' attribute for compression and deprecate 'sstable_compression'" This reverts commit `5571ef0d6d`. It causes rolling upgrade failures. Fixes #9055. Reopens #8948.	2021-07-28 14:14:22 +03:00
Pekka Enberg	ef5b2934e8	scylla_util: Use AzureSnitch on Azure Fixes #8593	2021-07-28 14:07:42 +03:00
Pekka Enberg	42e32566f6	production_snitch_base: Fallback for empty DC or rack strings Lubos Kosco points out that on Microsoft Azure, for example, it is possible for the "zone metadata" (which we use as rack information) can be empty as shown in: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=windows#instance-metadata Therefore, protect against empty DC or rack strings in `production_snitch_base` to keep the behavior consistent across different snitches.	2021-07-28 14:07:42 +03:00
Pekka Enberg	e44fa8d806	azure_snitch: Azure snitch support This add support for Azure snitch. The work is an adaptation of AzureSnitch for Apache Cassandra by Yoshua Wakeham: https://raw.githubusercontent.com/yoshw/cassandra/9387-trunk/src/java/org/apache/cassandra/locator/AzureSnitch.java As per Lubos' suggestion, we switched to a later API version.	2021-07-28 14:07:42 +03:00
Avi Kivity	0909e3c17d	treewide: remove redundant "x <=> 0" compares If x is of type std::strong_ordering, then "x <=> 0" is equivalent to x. These no-ops were inserted during #1449 fixes, but are now unnecessary. They have potential for harm, since they can hide an accidental of the type of x to an arithmetic type, so remove them. Ref #1449.	2021-07-28 13:30:32 +03:00
Avi Kivity	70f481a1f0	test: mutation_test: convert internal tri-compare to std::strong_ordering Drop the temporary merge_container() overload we had to support tri-compares that returned int.	2021-07-28 13:30:07 +03:00
Avi Kivity	14fd886c72	utils: int_range: change to std::strong_ordering Ref #1449.	2021-07-28 13:29:50 +03:00
Avi Kivity	11fa402ecc	test: change some internal comparators to std::strong_ordering Ref #1449.	2021-07-28 13:28:51 +03:00
Avi Kivity	89bd7737f3	utils: big_decimal: change to std::strong_ordering Ref #1449.	2021-07-28 13:28:21 +03:00
Avi Kivity	59941c536c	utils: fragment_range: change to std::strong_ordering Ref #1449.	2021-07-28 13:27:49 +03:00
Avi Kivity	a180cd240f	atomic_cell: change compare_atomic_cell_for_merge() to std::strong_ordering The implementation is in database.cc for some reason. Ref #1449.	2021-07-28 13:26:27 +03:00
Avi Kivity	b866c12bc5	types: drop scaffolding erected around lexicographical_tri_compare With no more users, the int-returning variant can be dropped. Ref #1449.	2021-07-28 13:25:19 +03:00
Avi Kivity	9a2f3ac288	sstables: keys: change to std::strong_ordering internally The signature already returned std::strong_ordering, but an internal comparator returned int. Switch it, so it now uses the strong_ordering overload of lexicographicall_tri_compare(). Ref #1449.	2021-07-28 13:23:13 +03:00
Avi Kivity	1b64b1a628	bytes: compare_unsigned(): change to std::strong_ordering Note that the previous implementation was broken for blobs larger than 4GB. Luckily that can't happen. Ref #1449.	2021-07-28 13:21:01 +03:00
Avi Kivity	7729ff03ad	uuid: change comparators to std::strong_ordering Ref #1449.	2021-07-28 13:20:32 +03:00
Avi Kivity	e52ebe2da5	types: convert abstract_type::compare and related to std::strong_ordering Change comparators around types to std::strong_ordering. Ref #1449.	2021-07-28 13:19:24 +03:00
Avi Kivity	b7160b74ea	types: reduce boilerplate when comparing empty value Some types have boilerplate code to check if one or both values are empty. Consolidate it in a helper to reduce noise.	2021-07-28 13:19:09 +03:00
Avi Kivity	d86e529239	serialized_tri_compare: change to std::strong_ordering Also convert a users in mutation_test. Ref #1449.	2021-07-28 13:19:00 +03:00
Avi Kivity	3653518d9e	compound_compat: change to std::strong-ordering Ref #1449.	2021-07-28 13:16:05 +03:00
Avi Kivity	1bbabb5ccc	types: change lexicographical_tri_compare, prefix_equality_tri_compare to std::strong_ordering The original signatures with `int` are retained (by calling the new signatures), until the callers are converted. Constraints are used to disambiguate. Ref #1449.	2021-07-28 13:14:46 +03:00
Avi Kivity	12f9a5462d	Merge 'repair: Drop unused partition level repair related code' from Asias He This series removes unused partition level repair related code. Closes #9105 * github.com:scylladb/scylla: repair: Drop stream plan related code locator: Add missing file.hh include in production_snitch_base repair: Drop request_transfer_ranges and do_streaming repair: Drop parallelism_semaphore	2021-07-28 11:24:30 +03:00
Benny Halevy	67d5addc09	test: mutation_reader_test: clustering_order_merger_test_generator: use explicit type for num_ranges gcc 10.3.1 spews the following error: ``` _test_generator::generate_scenario(std::mt19937&) const’: test/boost/mutation_reader_test.cc:3731:28: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare] 3731 \| for (auto i = 0; i < num_ranges; ++i) { \| ~~^~~~~~~~~~~~ ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210728073538.2467040-1-bhalevy@scylladb.com>	2021-07-28 11:22:59 +03:00
Asias He	91bcfba3f7	repair: Drop stream plan related code We do not use stream plan to sync data in repair anymore.	2021-07-28 11:23:42 +08:00
Asias He	860109daca	locator: Add missing file.hh include in production_snitch_base ``` clang++ build/dev/locator/production_snitch_base.o locator/production_snitch_base.cc In file included from locator/production_snitch_base.cc:41: In file included from ./locator/production_snitch_base.hh:41: In file included from /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/unordered_map:38: /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/type_traits:1329:23: error: incomplete type 'seastar::file' used in type trait expression __bool_constant<__has_trivial_destructor(_Tp)>> ^ ``` This code compiles now due to indirect include from repair.hh.	2021-07-28 11:22:37 +08:00
Asias He	f274ed6512	repair: Drop request_transfer_ranges and do_streaming They are used by parititon level repair. With row level repair, we do not need them anymore.	2021-07-28 10:54:18 +08:00
Asias He	f1c08f121a	repair: Drop parallelism_semaphore It is used by partition level repair. With row level repair, we do not need it anymore.	2021-07-28 10:54:15 +08:00
Avi Kivity	77a2b4b520	test: perf: perf_simple_query: add instructions_per_op to the json-result output It's in text output, but `863b49af03` forgot to add it to the machine readable results. Closes #9017	2021-07-27 20:26:19 +02:00
Calle Wilund	59555fa363	cdc: fix broken function signature in maybe_back_insert_iterator Fixes #9103 compare overload was declared as "bool" even though it is a tri-cmp. causes us to never use the speed-up shortcut (lessen search set), in turn meaning more overhead for collections. Closes #9104	2021-07-27 20:37:30 +03:00
Avi Kivity	0a4884e87a	Merge "Expose immutable rows and tombstones collections" from Pavel E " While working on evicting range tombstones one of the nastiest difficulties is that mutation_partition has very loose control over adding and removing of rows and range tombstones. This is because it exposes both collections via public methods, so it's pretty easy to grab a non-const reference on either of it and modify the collection. At the same time restricting the API with returning only const reference on the collection is not possible either, since finding in or iterating over a const-referenced collection would expose the const-reference element as well, while it can be perfectly valid to modify the single row/tombstone without touching the whole collection. In other words there's the need for an access method that both guarantees that no new elements are added to the collection, nor existing ones are removed from it, AND doesn't impose const on the obtained elements. The solution proposed here is the immutable_collection<> template that wraps a non-const collection reference and gives the caller only reading methods (find, lower_bound, begin, etc) so that it's guaranteed that the external user of mutation_partition won't be able to modify the collections. Those that already use the const reference on the mutation_partition itself are OK, they also use const-referenced everything. The places than do need to modify the partition's collections are thus made explicit. tests: unit(dev) " * 'br-mutation-partition-collection-view-3' of https://github.com/xemul/scylla: mutation_partition: Return immutable collection for range tombstones mutation_partition: Pin mutable access to range tombstones mutation_partition: Return immutable collection for rows mutation_partition: Pin mutable access to rows utils: Introduce immutable_collection<> btree: Generalize some iterator methods btree: Make iterators not modify the tree itself btree tests: Dont use iterator erase mutation_partition: Shuffle declarations range_tombstone_list: Mark more methods noexcept range_tombstone_list, code: Mark external_memory_usage noexcept	2021-07-27 20:34:04 +03:00
Avi Kivity	a38b1006d1	cql3: expression: update find_atom, count_if for function_call, cast, field_selection The combination of the new types and these functions cannot happen yet, but as they are generic functions it is better to implement them in case it becomes possible later.	2021-07-27 20:16:43 +03:00
Avi Kivity	2b7b9bb469	cql3: expressions: fix printing of nested expressions Now that we eliminated cql3::selectable::raw, we can print nested expressions.	2021-07-27 20:16:29 +03:00
Avi Kivity	98c4f0dfb3	cql3: selection: replace selectable::raw with expression Now that all selectable::raw subclasses have been converted to cql3::selectable::with_expression::raw, the class structure is just a wrapper around expressions. Peel it, converting the virtual member functions to free functions, and replacing object instances with expression or nested_expression as the case allows.	2021-07-27 20:16:15 +03:00
Avi Kivity	979010a1e5	cql3: expression: convert selectable::with_field_selection::raw to expression Add a field_selection variant element to expression. Like function_call and cast, the structure from which a field is selectewd cannot yet be an expression, since not all seletable::raw:s are converted. This will be done in a later pass. This is also why printing a field selection now does not print the selected expression; this will also be corrected later.	2021-07-27 20:16:12 +03:00
Avi Kivity	714b812212	cql3: expression: convert selectable::with_cast::raw to expression Add a cast variant element to expression. Like function_call, the argument being converted cannot yet be an expression, since not all seletable::raw:s are converted. This will be done in a later pass. This is also why printing a cast now does not print the casted expression; this will also be corrected later.	2021-07-27 20:14:52 +03:00
Avi Kivity	5adae5837e	cql3: expression: convert selectable::with_anonymous_function::raw to expression Rather than creating a new variant element in expression, we extend function_call to handle both named and anonymous functions, since most of the processing is the same.	2021-07-27 20:13:55 +03:00
Avi Kivity	3e392d2513	cql3: expression: convert selectable::with_function_call::raw to expressions Add a function_call variant element to hold function calls. Note that because not all selectables are yet converted, function call arguments are still of type selectable::raw. They will be converted to expressions later. This is also why printing a function now does not print its arguments; this will also be corrected later.	2021-07-27 20:13:51 +03:00
Avi Kivity	a56787d95e	cql3: selectable: make selectable::raw forward-declarable As temporary scaffolding while we're converting selectable::raw subclasses to expressions, we'll need expressions to refer to selectable::raw (specifically, function call arguments, which will end up as expressions as well). To avoid a #include loop, make selectable::raw forward-declarable by moving it to namespace scope.	2021-07-27 20:10:54 +03:00
Avi Kivity	ff65c54316	cql3: expressions: convert writetime_or_ttl::raw to expression Create a new element in the expression variant, column_mutation_attribute, signifying we're picking up an attribute of a column mutation (not a column value!). We use an enum rather than a bool to choose between writetime and ttl (the two mutation attributes) for increased explicitness. Although there can only be one type for the column we're operating on (it must be an unresolved_identifer), we use a nested_expression. This is because we'll later need to also support a column_value as the column type after we prepare it. This is somewhat similar to the address of operator in C, which syntactically takes any expression but semantically operates only on lvalues.	2021-07-27 20:10:52 +03:00
Avi Kivity	294f0f35b1	cql3: expression: add convenience constructor from expression element to nested expression It is convenient to initialize a nested_expression variable from one of the types that compose the expression variant, but C++ doesn't allow it. Add a constructor that does this. Use the new variant_element concept to constrain the input to be one of the variant's elements.	2021-07-27 20:08:48 +03:00
Avi Kivity	636b133cbc	utils: introduce variant_element.hh A type trait (is_variant_element) and a concept (VariantElement) that tell if a type T is a member of a variant or not. It can be used even if the variant's elements are not yet defined (just forward-declared).	2021-07-27 20:08:47 +03:00
Avi Kivity	ac3b093e3c	cql3: expression: use nested_expression in binary_operator binary_operator::lhs is implementing the pattern in nested_expression. Use nested_expression instead to reduce code size.	2021-07-27 20:08:34 +03:00
Avi Kivity	b07a0867b3	cql3: expression: introduce nested_expression class The exression type cannot be a member of a struct that is an element of the expression variant. This is because it would then be required to contain itself. So introduce a nested_expression type to indirectly hold an expression, but keep the value semantics we expect from expressions: it is copyable and a copy has separate identity and storage. In fact binary_operator had to resort to this trick, so it's converted to nested_expression in the next patch.	2021-07-27 20:08:21 +03:00
Avi Kivity	8a518e9c78	Convert column_identifier_raw's use as selectable to expressions Introduce unresolved_identifer as an unprepared counterpart to column_value. column_identifier_raw no longer inherits from selectable::raw, but methods for now to reduce churn.	2021-07-27 20:08:15 +03:00
Pavel Emelyanov	b3c89787be	mutation_partition: Return immutable collection for range tombstones Patch the .row_tombstones() to return the range_tombstone_list wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the tombstones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	1bf643d4fd	mutation_partition: Pin mutable access to range tombstones Some callers of mutation_partition::row_tomstones() don't want (and shouldn't) modify the list itself, while they may want to modify the tombstones. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	05b8cdfd24	mutation_partition: Return immutable collection for rows Patch the .clustered_rows() method to return the btree of rows wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the elements in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	ad27bf40e6	mutation_partition: Pin mutable access to rows Some callers of mutation_partition::clustered_rows() don't want (and shouldn't) modify the tree of rows, while they may want to modify the rows themselves. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	c2a36f5668	utils: Introduce immutable_collection<> Wokring with collections can be done via const- and non-const references. In the former case the collection can only be read from (find, iterate, etc) in the latter it's possible to alter the collection (erase elements from or insert them into). Also the const-ness of the collection refernece is transparently inherited by the returned _elements_ of the collection, so when having a const reference on a collection it's impossible to modify the found element. This patch introduces a immutable_collection -- a wrapper over a random collection that makes sure the collection itself is not modified, but the obtained from it elements can be non-const. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	d1c693473a	btree: Generalize some iterator methods The non-const iterator has constructor from key pointer and the tree_if_singular method. There's no reasons why these two are absent in the const_iterator. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	6ef27c9fa1	btree: Make iterators not modify the tree itself The const_iterator cannot modify anything, but the plain iterator has public methods to remove the key from the tree. To control how the tree is modified this method must be marked private and modification by iterator should come from somewhere else. This somewhere else is the existing key_grabber that's already used to move keys between trees. Generalize this ability to move a key out of a tree (i.e. -- erase). Once done -- mark the iterator::erase_and_dispose private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	e652b03b4e	btree tests: Dont use iterator erase Next patches will mark btree::iterator methods that modify the tree itself as private, so stop using them in tests. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	a9b4fa9db3	mutation_partition: Shuffle declarations Its methods that provide access to enclosed collections of rows and range tombstones are intermixed, so group them for smoother next patching and mark noexcept while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	9122f4129d	range_tombstone_list: Mark more methods noexcept Those returning iterators and size for the underlying collection of range tombstones are all non-throwing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	0f53e83a8e	range_tombstone_list, code: Mark external_memory_usage noexcept The range_tombstone_list's method is at the top of the stack of calls each not throwing anything, so do the deep-dive noexcept marking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Avi Kivity	d3e8c05bed	make column_identifier::raw forward declarable Otherwise we run into a #include loop when try to have an expression with column_identifier::raw: expression.hh -> column_identifier.hh -> selectable.hh -> expression.hh.	2021-07-27 20:00:48 +03:00
Avi Kivity	0e30a78573	cql3: introduce selectable::with_expression::raw Prepare to migrate selectable::raw sub-classes to expressions by creating a bridge betweet the two types. with_expression::raw is a selectable::raw and implements all its methods (right now, trivially), and its contents is an expression. The methods are implemented using the usual visitor pattern.	2021-07-27 20:00:48 +03:00
Benny Halevy	3a4e4f9914	compaction: to_string: handle invalid values as internal error Although the switch in `to_string(compaction_options::scrub::mode)` covers all possible cases, gcc 10.3.1 warns about: ``` sstables/compaction.cc: In function ‘std::string_view sstables::to_string(sstables::compaction_options::scrub::mode)’: sstables/compaction.cc:95:1: error: control reaches end of non-void function [-Werror=return-type] ``` Adding __builtin_unreachable(), as in `to_string(compaction_type)` does calm the compiler down, but it might cause undefined behavior in the future in case the switch won't cover all cases, or the passed value is corrupt somehow. Instead, call on_internal_error_noexcept to report the error and abort if configure to do so, otherwise, just return an "(invalid)" string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210727130251.2283068-1-bhalevy@scylladb.com>	2021-07-27 16:04:09 +03:00
Avi Kivity	df4d77e857	table: simplify generate_and_propagate_view_updates exception handling We have both try/catch and handle_exception() to ignore exceptions. Try/catch is enough, so remove handle_exception(). Closes #9011	2021-07-27 14:08:30 +02:00
Avi Kivity	f86e65b4e7	Merge "Fix quadratic behavior in memtable/row_cache with lots of range tombstones" from Tomasz " This series fixes two issues which cause very poor efficiency of reads when there is a lot of range tombstones per live row in a partition. The first issue is in the row_cache reader. Before the patch, all range tombstones up to the next row were copied into a vector, and then put into the buffer until it's full. This would get quadratic if there is much more range tombstones than fit in a buffer. The fix is to avoid the accumulation of all tombstones in the vector and invoke the callback instead, which stops the iteration as soon as the buffer is full. Fixes #2581. The second, similar issue was in the memtable reader. Tests: - unit (dev) - perf_row_cache_update (release) " * tag 'no-quadratic-rt-in-reads-v1' of github.com:tgrabiec/scylla: test: perf_row_cache_update: Uncomment test case for lots of range tombstones row_cache: Consume range tombstones incrementally partition_snapshot_reader: Avoid quadratic behavior with lots of range tombstones tests: mvcc: Relax monotonicity check range_tombstone_stream: Introduce peek_next()	2021-07-27 14:39:13 +03:00
Avi Kivity	05d22d27a8	Merge "Cut repair->storage-service link" from Pavel E " It exists in the node-ops handler which is registered by repair code, but is handled by storage service. Probably, the whole node-ops handler should instead be moved into repair, but this looks like rather huge rework. So instead -- put the node-ops verb registration inside the storage-service. This removes some more calls for global storage service instance and allows slight optimization of node-ops cross-shards calls. tests: unit(dev), start-stop " * 'br-remove-storage-service-from-nodeops' of https://github.com/xemul/scylla: storage_service: Replace globals with locals storage_service: Remove one extra hop of node-ops handler storage_service: Fix indentation after previous patch storage_service: Move cross-shard hop up the stack repair: Drop empty verbs reg/unreg methods repair, storage_service: Move nodeops reg/unreg to storage service repair: Coroutinize row-level start/stop	2021-07-27 13:27:27 +03:00
Takuya ASADA	fdc786b451	install.sh: add supervisor support Bring supervisor support from dist/docker to install.sh, make it installable from relocatable package. This enables to use supervisor with nonroot / offline environment, and also make relocatable package able to run in Docker environment. Related #8849 Closes #8918	2021-07-27 12:51:29 +03:00
Takuya ASADA	42fd73d033	scylla_setup: add RAID5 support This supports optional RAID5 support on scylla_setup. Fixes #9076 Closes #9093	2021-07-27 12:49:29 +03:00
Avi Kivity	2cca461652	Merge 'sstables: merge row consumer interfaces with implementations' from Wojciech Mitros This patch follows #9002, further reducing the complexity of the sstable readers. The split between row consumer interfaces and implementations has been first added in 2015, and there is no reason to create new implementations anymore. By merging those classes, we achieve a sizeable reduction in sstable reader length and complexity. Refs #7952 Tests: unit(dev) Closes #9073 * github.com:scylladb/scylla: sstables: merge row_consumer into mp_row_consumer_k_l sstables: move kl row_consumer sstables: merge consumer_m into mp_row_consumer_m sstables: move mp_row_consumer_m	2021-07-27 12:23:29 +03:00
Benny Halevy	424c53d5b1	mutation_fragment_stream_validator: disambiguate schema member definition gcc 10.3.1 complains that: ``` ./mutation_fragment_stream_validator.hh:39:21: error: declaration of ‘const schema& mutation_fragment_stream_validator::schema() const’ changes meaning of ‘schema’ [-fpermissive] 39 \| const ::schema& schema() const { return _schema; } \| ^~~~~~ ``` Defining the _schama member as `::schema` rather than just `schema` calms the compiler down. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210727073941.1999909-1-bhalevy@scylladb.com>	2021-07-27 11:55:42 +03:00
Pavel Emelyanov	ca2dfac7d7	test: Split test_multishard_combining_reader_as_mutation_source into 3 There are 3 independent cases in this test that benefit from running in parallel. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 09:29:20 +03:00
Pavel Emelyanov	e184ed2b9c	test: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 09:29:12 +03:00
Pavel Emelyanov	3e979a20ea	test: Move out internals of test_multishard_combining_reader_as_mutation_source Preparation. They will be called from 3 independent cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 09:28:12 +03:00
Nadav Har'El	8030461a2c	cql-pytest: translate Cassandra's misc. type tests This is a translation of Cassandra's CQL unit test source file validation/entities/TypeTest.java into our our cql-pytest framework. This is a tiny test file, with only four test which apparently didn't find their place in other source files. All four tests pass on Cassandra, and all but one pass on Scylla - the test marked xfail discovered one previously-unknown incompatibility with Cassandra: Refs #9082: DROP TYPE IF EXISTS shouldn't fail on non-existent keyspace Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210726140934.1479443-1-nyh@scylladb.com>	2021-07-27 08:28:16 +03:00
Tomasz Grabiec	7578cef0a4	test: perf_row_cache_update: Uncomment test case for lots of range tombstones	2021-07-26 21:38:00 +02:00
Gleb Natapov	56d0f711e8	serialize: allow use non copyable types with std::variant Message-Id: <20210720120935.710549-3-gleb@scylladb.com>	2021-07-26 19:09:19 +03:00
Gleb Natapov	63025a75b2	serialize: allow use non copyable types with std::optional Message-Id: <20210720120935.710549-2-gleb@scylladb.com>	2021-07-26 19:09:19 +03:00
Avi Kivity	8a80e455fb	sstables: keys: convert trichotomic comparisons to std::strong_ordering Prevent accidental conversions to bool from yielding the wrong results. Unprepared users (that converted to bool, or assigned to int) are adjusted. Ref #1449 Test: unit (dev) Closes #9088	2021-07-26 19:09:19 +03:00
Nadav Har'El	d3a715e0ff	Update seastar submodule * seastar 93d053cd...ce3cc268 (4): > doc: update coroutine exception paragraph with make_exception > coroutine: add make_exception helper > coroutine: use std::move for forwarding exception_ptr > doc: tutorial: document direct exception propagation With the new throw-less coroutine exception support, we can modify some of Scylla's new coroutine code to generate exceptions a bit more efficiently, without actually thowing an exception.	2021-07-26 19:09:19 +03:00
Tomasz Grabiec	2d18360157	row_cache: Consume range tombstones incrementally Before the patch, all range tombstones up to the next row were copied into a vector, and then put into the buffer until it's full. This would get quadratic if there is much more range tombstones than fit in a buffer. The fix is to avoid the accumulation of all tombstones in the vector and invoke the callback instead, which stops the iteartion as soon as the buffer is full. Fixes #2581.	2021-07-26 17:48:05 +02:00
Tomasz Grabiec	e74c3c885e	partition_snapshot_reader: Avoid quadratic behavior with lots of range tombstones next_range_tombstone() was populating _rt_stream on each invocation from the current iterator ranges in _range_tombstones. If there is a lot of range tombstones, all would be put into _rt_stream. One problem is that this can cause a reactor stall. Fix by more incremental approach where we populate _rt_stream with minimal amount on each invocation of next_range_tombstone(). Another problem is that this can get quadratic. The iterators in _range_tombstones are advanced, but if lsa invalidates them across calls they can revert back to the front since they go back to _last_rt, which is the last consumed range tombstone, and if the buffer fills up, not all tombstones from _rt_stream could be consumed. The new code doesn't have this problem because everything which is produced out of the iterators in _range_tombstones is produced only once. What we put into _rt_stream is consumed first before we try to feed the _rt_stream with more data.	2021-07-26 17:48:05 +02:00
Tomasz Grabiec	0d7b3f9463	tests: mvcc: Relax monotonicity check Consecutive range tombstones can have the same position. They will, in one of the test cases, after the range tombstone merger in partition_snapshot_flat_reader no longer uses range_tombstone_list to merge data form multiple versions, which deoverlaps, but rather merges the streams corresponding to each version, which interleaves range tombstones from different versions.	2021-07-26 17:27:03 +02:00
Piotr Sarna	ac7e6028a5	system_keyspace: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:05:52 +02:00
Piotr Sarna	55cd46154c	sstables: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:05:51 +02:00
Piotr Sarna	4de751c8c8	storage_proxy: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:05:15 +02:00
Piotr Sarna	776ab4bcb1	multishard_mutation_query: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:05:14 +02:00
Piotr Sarna	101eb26171	client_state: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:04:28 +02:00
Piotr Sarna	e5925d4980	flat_mutation_reader: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:04:20 +02:00
Piotr Sarna	26ae74524a	table: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:04:18 +02:00
Piotr Sarna	3b37d75956	commitlog: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:03:41 +02:00
Piotr Sarna	6e994ce7c2	compaction: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:03:06 +02:00
Piotr Sarna	66c4d58a8c	database: pass exceptions without throwing In order to avoid needless throwing, exceptions are passed directly wherever possible. Two mechanisms which help with that are: 1. make_exception_future<> for futures 2. co_return coroutine::exception(...) for coroutines which return future<T> (the mechanism does not work for future<> without parameters, unfortunately)	2021-07-26 17:02:36 +02:00
Tomasz Grabiec	91868cf0cd	range_tombstone_stream: Introduce peek_next()	2021-07-26 13:33:34 +02:00
Pavel Emelyanov	11a2709f10	storage_service: Replace globals with locals The node-ops verb handler is the lambda of storage-service and it can stop using global storage service instance for no extra charge. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:30 +03:00
Pavel Emelyanov	6e56671d9e	storage_service: Remove one extra hop of node-ops handler It's now clear that the verb handler goes to some "random" shard, then immediatelly switches to shard-0 and then does the handling. Avoid the extra hop and go to shard-0 right at once. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:30 +03:00
Pavel Emelyanov	b6315d3af7	storage_service: Fix indentation after previous patch And, while at it, s/ss/this/g and drop the ss variable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:30 +03:00
Pavel Emelyanov	f5fad311cf	storage_service: Move cross-shard hop up the stack The storage_service::node_ops_cmd_handler runs inside a huge invoke_on(0, ...) lambda. Make it be called on shard-0. This is the preparation for next two patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:30 +03:00
Pavel Emelyanov	eb55c252c9	repair: Drop empty verbs reg/unreg methods Those in repair.cc's are now noops, so remove them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:30 +03:00
Pavel Emelyanov	a09586a237	repair, storage_service: Move nodeops reg/unreg to storage service The storage service is the verb sender, so it must be the verb registrator. Another goal of this patch is to allow removal of repair -> storage_service dependency. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:21 +03:00
Pavel Emelyanov	18397a5e0a	repair: Coroutinize row-level start/stop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-26 14:21:21 +03:00
Piotr Jastrzebski	90a607e844	api: use proper type to reduce partition count Partition count is of a type size_t but we use std::plus<int> to reduce values of partition count in various column families. This patch changes the argument of std::plus to the right type. Using std::plus<int> for size_t compiles but does not work as expected. For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code would probably want 2147483649. Fixes #9090 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #9074	2021-07-26 11:53:06 +03:00
Nadav Har'El	b503ec36c2	cql-pytest: translate Cassandra's tests for tuples This is a translation of Cassandra's CQL unit test source file validation/entities/TupleTypeTest.java into our our cql-pytest framework. This test file checks has a few tests on various features of tuples. Unfortunately, some of the tests could not be easily translated into Python so were left commented out: Some tests try to send invalid input to the server which the Python driver "helpfully" forbids; Two tests used an external testing library "QuickTheories" and are the only two tests in the Cassandra test suite to use this library - so it's not a worthwhile to translate it to Python. 11 tests remain, all of them pass on Cassandra, and just one fails on Scylla (so marked xfail for now), reproducing one known issue: Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax Actually, += is not supposed to be supported on tuple columns anyway, but should print the appropriate error - not the syntax error we get now as the "+=" feature is not supported at all. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210722201900.1442391-1-nyh@scylladb.com>	2021-07-26 08:20:12 +03:00
Benny Halevy	8674746fdd	flat_mutation_reader: detach_buffer: mark as noexcept Since detach_buffer is used before closing and destroying the reader, we want to mark it as noexcept to simply the caller error handling. Currently, although it does construct a new circular_buffer, none of the constructors used may throw. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210617114240.1294501-2-bhalevy@scylladb.com>	2021-07-25 12:02:27 +03:00
Benny Halevy	0e31cdf367	flat_mutation_reader: detach_buffer: clarify buffer constructor detach_buffer exchanges the current _buffer with a new buffer constructed using the circular_buffer(Alloc) constructor. The compiler implicitly constructs a tracking_allocator(reader_permit) and passes it to the circular_buffer constructor. This patch just makes that explicit so it would be clearer to the reader what's going on here. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210617114240.1294501-1-bhalevy@scylladb.com>	2021-07-25 11:59:37 +03:00
Pavel Solodovnikov	bcbcc18aa1	raft: raft_sys_table_storage: fix broken `load_snapshot` and `load_term_and_vote` Loading snapshot id and term + vote involve selecting static fields from the "system.raft" table, constrained by a given group id. The code incorrectly assumes that, for example, `SELECT snapshot_id FROM raft WHERE group_id=?` in `load_snapshot` always returns only one row. This is not true, since this will return a row for each (pk, ck) combination, which is (group_id, index) for "system.raft" table. The same applies for the `load_term_and_vote`, which selects static `vote_term` and `vote` from "system.raft". This results in a crash at node startup when there is a non-empty raft log containing more than one entry for a given `group_id`. Restrict the selection to always return one row by applying `LIMIT 1` clause. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210723183232.742083-1-pa.solodovnikov@scylladb.com>	2021-07-25 02:01:34 +02:00
Pavel Solodovnikov	49ddd269ea	cql3: rename `variable_specifications` to `prepare_context` The class is repurposed to be more generic and also be able to hold additional metadata related to function calls within a CQL statement. Rename all methods appropriately. Visitor functions in AST nodes (`collect_marker_specification`) are also renamed to a more generic `fill_prepare_context`. The name `prepare_context` designates that this metadata structure is a byproduct of `stmt::raw::prepare()` call and is needed only for "prepare" step of query execution. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-24 14:33:33 +03:00
Nadav Har'El	ec5e4c338b	cql: fix undefined behavior in timestamp verification Commit `2150c0f7a2` proposed by issue #5619 added a limitation that USING TIMESTAMP cannot be more than 3 days into the future. But the actual code used to check it, timestamp - now > MAX_DIFFERENCE only makes sense for positive timestamps. For negative timestamps, which are allowed in Cassandra, the difference "timestamp - now" might overflow the signed integer and the result is undefined - leading to the undefined-behavior sanitizer to complain as reported in issue #8895. Beyond the sanitizer, in practice, on my test setup, the timestamp -2^63+1 causes such overflow, which causes the above if() to make the nonsensical statement that the timestamp is more than 3 days into the future. This patch assumes that negative timestamps of any magnitude are still allowed (as they are in Cassandra), and fixes the above if() to only check timestamps which are in the future (timestamp > now). We also add a cql-pytest test for negative timestamps, passing on both Cassandra and Scylla (after this patch - it failed before, and also reported sanitizer errors in the debug build). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210621141255.309485-1-nyh@scylladb.com>	2021-07-24 11:01:08 +03:00
Tomasz Grabiec	b044db863f	Merge 'db/virtual_table: Streaming tables for large data + describe_ring example table' from Juliusz Stasiewicz This is the 2nd PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). This one introduces a new implementation of the virtual tables, the streaming tables, which are suitable for large amounts of data. This PR was created by @jul-stas and @StarostaGit Closes #8961 * github.com:scylladb/scylla: test/boost: run_mutation_source_tests on streaming virtual table system_keyspace: Introduce describe_ring table as virtual_table storage_service: Pass the reference down to system_keyspace endpoint_details: store `_host` as `gms::inet_address` queue_reader: implement next_partition() virtual_tables: Introduce streaming_virtual_table flat_mutation_reader: Add a new filtering reader factory method	2021-07-23 18:05:51 +02:00
Gleb Natapov	f0047bd749	raft: apply snapshots in applier_fiber We want to serialize snapshot application with command application otherwise a command may be applied after a snapshot that already contains the result of its application (it is not necessary a problem since the raft by itself does not guaranty apply-once semantics, but better to prevent it when possible). This also moves all interactions with user's state machine into one place. Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>	2021-07-23 18:05:38 +02:00
Avi Kivity	aaf35b5ac2	Merge "Remove storage-service from transport (and a bit more)" from Pavel E " The cql-server -> storage-service dependency comes from the server's event_notifier which (un)subscribes on the lifecycle events that come from the storage service. To break this link the same trick as with migration manager notifications is used -- the notification engine is split out of the storage service and then is pushed directly into both -- the listeners (to (un)subscribe) and the storage service (to notify). tests: unit(dev), dtest(simple_boot_shutdown, dev) manual({ start/stop, with/without started transport, nodetool enable-/disablebinary } in various combinations, dev) " * 'br-remove-storage-service-from-transport' of https://github.com/xemul/scylla: transport.controller: Brushup cql_server declarations code: Remove storage-service header from irrelevant places storage_service: Remove (unlifecycle) subscribe methods transport: Use local notifier to (un)subscribe server transport: Keep lifecycle notifier sharded reference main: Use local lifecycle notifier to (un)subscribe listeners main, tests: Push notifier through storage service storage_service: Move notification core into dedicated class storage_service: Split lifecycle notification code transport, generic_server: Remove no longer used functionality transport: (Un)Subscribe cql_server::event_notifier from controller tests: Remove storage service from manual gossiper test	2021-07-22 19:27:45 +03:00
Pavel Emelyanov	b1bb00a95c	transport.controller: Brushup cql_server declarations The controller code sits in the cql_transport namespace and can omit its mentionings. Also the seastar::distributed<> is replaced with modern seastar::sharded<> while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:50:57 +03:00
Pavel Emelyanov	c39f04fa6f	code: Remove storage-service header from irrelevant places Some .cc files over the code include the storage service for no real need. Drop the header and include (in some) what's really needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:50:19 +03:00
Pavel Emelyanov	e711bfbb7e	storage_service: Remove (unlifecycle) subscribe methods All the listeners now use main-local notifier instance directly and these methods become unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:49:35 +03:00
Pavel Emelyanov	65b1bb8302	transport: Use local notifier to (un)subscribe server Now the controller has the lifecycle notifier reference and can stop using storage service to manage the subscription. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:48:58 +03:00
Pavel Emelyanov	5f99eeb35e	transport: Keep lifecycle notifier sharded reference It's needed to (un)subscribe server on it (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:48:20 +03:00
Pavel Emelyanov	2a30cb1664	main: Use local lifecycle notifier to (un)subscribe listeners The storage proxy and sl-manager get subscribed on lifecycle events with the help of storage service. Now when the notifier lives in main() they can use it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:47:15 +03:00
Pavel Emelyanov	8248bc9e33	main, tests: Push notifier through storage service Now it's time to move the lifecycle notifier from storage service to the main's scope. Next patches will remove the $lifecycle-subscriber -> storage_service dependency. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:45:51 +03:00
Pavel Emelyanov	6b3b01d9a6	storage_service: Move notification core into dedicated class Introduce the endpoint_lifecycle_notifier class that's in charge of keeping track of subscribers and notifying them. The subscribers will thus be able to set and unset their subscription without the need to mess with storage service at all. The storage_service for now keeps the notifier on board, but this is going to change in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:44:02 +03:00
Pavel Emelyanov	7e8a032013	storage_service: Split lifecycle notification code This prepares the ground for moving the notification engine into own class like it was done for migration_notifier some time ago. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:43:14 +03:00
Pavel Emelyanov	c7b0b25494	transport, generic_server: Remove no longer used functionality After subscription management was moved onto controller level a bunch of code can be dropped: - passing migration notifier beyond controller - event_notifier's _stopped bit - event_notifier .stop() method - event_notifier empty constructor and destrictor - generic_server's on_stop virtual method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:41:32 +03:00
Pavel Emelyanov	1acef41626	transport: (Un)Subscribe cql_server::event_notifier from controller There's a migration notifier that's carried through cql_server _just_ to let event-notifier (un)subscribe on it. Also there's a call for global storage-service in there which will need to be replaced with yet another pass-through argument which is not great. It's easier to establish this subscription outside of cql_server like it's currently done for proxy and sl-manager. In case of cql_server the "outside" is the controller. This patch just moves the subscription management from cql_server to controller, next two patches will make more use of this change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:37:23 +03:00
Pavel Emelyanov	b57fb0aa9a	tests: Remove storage service from manual gossiper test It's not needed there, gossiper starts and works without it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:36:28 +03:00
Yaron Kaikov	a004b1da30	scylla_util:add AWS arm based instance to supported list Today we have a Scylla AMI image based on x86 archituctre only. Following the work we did in https://github.com/scylladb/scylla-machine-image/pull/153 we can build ARM based AMI image Let's add ARM based instance to supported list Closes #9064	2021-07-22 15:48:29 +03:00
Avi Kivity	d0d42891e9	Merge 'Harden batchlog_manager stop and call from main in deferred action' from Benny Halevy This PR contains the parts relevant to batchlog_manager stop in #8998 without adding a gate to the storage_proxy for synchronization with on-going queries in storage_proxy::drain_on_shutdown. As explained in #9009, we see that the batchlog_manager isn't stopped if scylla shuts down during startup, e.g. when waiting for gossip to settle, since currently the batchlog_manager is stopped only from `storage_service::do_drain`, while `storage_service::drain_on_shutdown` deferred shutdown is installed only later on: `222ef17305/main.cc (L1419-L1421)` Fixes #9009 Test: unit(dev) DTest: compact_storage_tests.py:TestCompactStorage.wide_row_test paging_test:TestPagingDatasetChanges.test_cell_TTL_expiry_during_paging update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test (dev) Closes #9010 * github.com:scylladb/scylla: main: add deferred stop of batchlog_manager batchlog_manager: refactor drain out of stop batchlog_manager: stop: break _sem on shard 0 batchlog_manager: stop: use abort_source to abort batchlog_replay_loop batchlog_manager: do_batch_log_replay: hold _gate	2021-07-22 15:47:29 +03:00
Piotr Sarna	ea3d9baa5a	Update seastar submodule * seastar 388ee307...93d053cd (5): > doc: tutorial: document seastar::coroutine::all() > doc: tutorial: nest "exceptions in coroutines" under "coroutines" > coroutine: add a way of propagating exceptions without throwing > input_stream: Fix read_exactly(n) incorrectly skipping data > coroutines: introduce all() template for waiting for multiple futures	2021-07-22 12:29:28 +02:00
Piotr Sarna	e9d26dd7ed	utils/coroutine: wrap a helper in utils namespace The class name `coroutine` became problematic since seastar introduced it as a namespace for coroutine helpers. To avoid a clash, the class from scylla is wrapped in a separate namespace. Without this patch, Seastar submodule update fails to compile. Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>	2021-07-22 13:28:43 +03:00
Piotr Sarna	526ad2a151	Merge 'secondary_index: Fix TOKEN() restrictions in indexed SELECTs' from Jan Ciołek This is a rewrite of an old PR: #7582 `TOKEN()` restrictions don't work properly when a query uses an index. For example this returns both rows: ```cql CREATE TABLE t(pk int, ck int, v int, PRIMARY KEY(pk, ck)); CREATE INDEX ON t(v); INSERT INTO t (pk, ck, v) VALUES (0, 0, 0); INSERT INTO t (pk, ck, v) VALUES (1, 0, 0); SELECT token(pk), pk, ck, v FROM t WHERE v = 0 AND token(pk) = token(0) ALLOW FILTERING; ``` This functionality is supported on both old and new indexes. In old indexes the type of the token column was `blob`. This causes problems, because `blob` representation of tokens is ordered differently. Tokens represented as blobs are ordered like this: ``` 0, 1, 2, 3, 4, 5, ..., bigint_max, bigint_min, ...., -5, -4, -3, -2, -1 ``` Because of that clustering range for `token()` restrictions needs to be translated to two clustering ranges on the `blob` column. To create old indexes disable the feature called: `CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX` or run scylla version from branch [`cvybhu/si-token2-old-index`](https://github.com/cvybhu/scylla/commits/si-token2-old-index) I'm not sure if it's possible to create automatic tests with old indexes. I ran `dev-test` manually on the `si-token2-old-index` branch, and the only tests that failed were the ones testing row ordering. Rows should be ordered by `token`, but because in old indexes the token is represented as a `blob` this ordering breaks. This is a known issue (#7443), that has been fixed by introducing new indexes. To sum up: * `token()` restrictions are fixed on both new and old indexes. * When using old indexes, the rows are not properly ordered by token. * With new indexes the rows are properly ordered by token. Fixes #7043 Closes #9067 * github.com:scylladb/scylla: tests: add secondary index tests with TOKEN clause secondary_index_test: extract test data secondary_index: Fix TOKEN() restrictions in indexed SELECTs expression: Add replace_token function	2021-07-22 10:22:45 +02:00
Wojciech Mitros	7f41af0916	sstables: merge row_consumer into mp_row_consumer_k_l The row_consumer interface has only one implementation: mp_row_consumer_k_l; and we're not planning other ones, so to reduce the number of inheritances, and the number of lines in the sstable reader, these classes may be combined. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 18:19:49 +02:00
Wojciech Mitros	1ff72ca0a6	sstables: move kl row_consumer In preparation for the next patch combining row_consumer and mp_row_consumer_k_l, move row_consumer next to row_consumer. Because row_consumer is going to be removed, we retire some old tests for different implementations of the row_consumer interface; as a result, we don't need to expose internal types of kl sstable reader for tests, so all classes from reader_impl.hh are moved to reader.cc, and the reader_impl.hh file is deleted, and the reader.cc file has an analogous structure to the reader.cc file in sstables/mx directory. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 18:04:22 +02:00
Wojciech Mitros	fc17c48bc9	sstables: merge consumer_m into mp_row_consumer_m The consumer_m interface has only one implementation: mp_row_consumer_m; and we're not planning other ones, so to reduce the number of inheritances, and the number of lines in the sstable reader, these classes may be combined. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:10 +02:00
Wojciech Mitros	fbb56e930c	sstables: move mp_row_consumer_m To make next patch combining consumer_m and mp_row_consumer_m more readable, move mp_row_consumer_m next to consumer_m. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:04 +02:00
Piotr Grabowski	e06102aed9	tests: add secondary index tests with TOKEN clause Add tests of SELECTs with TOKEN clauses on tables with secondary indexes (both global and local). test_select_with_token_range_cases checks all possible token range combinations (inclusive/exclusive/infinity start/end) on tables without index, with local or with global index. test_select_with_token_range_filtering checks whether TOKEN restrictions combined with column restrictions work properly. As different code paths are taken if index is created on clustering key (first or non-first) or non-primary-key column, the tests checks scenarios when index is created on different columns.	2021-07-21 16:12:55 +02:00
Piotr Grabowski	e2bd1cdb9d	secondary_index_test: extract test data Extract test data to a separate variables, allowing it to be easily reused by other tests. The tokens are hard-coded, because calculating their value brought too much complexity to this code.	2021-07-21 16:12:55 +02:00
Jan Ciolek	694d62a567	secondary_index: Fix TOKEN() restrictions in indexed SELECTs When using an index, restrictions like token(p) <= x were ignored. Because of this a query like this would select all rows where r = 0: SELECT * FROM tab WHERE r = 0 and token(p) > 0; Adds proper handling of token restrictions to queries that use indexes. Old indexes represented token as a blob, which complicates clustering bounds. Special code is included, which translates token clustering bounds to blob clustering bounds. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-07-21 16:12:49 +02:00
Raphael S. Carvalho	e4eb7df1a1	table: Make correctness of concurrent sstable list update robust Today, table relies on row_cache::invalidate() serialization for concurrent sstable list updates to produce correct results. That's very error prone because table is relying on an implementation detail of invalidate() to get things right. Instead, let's make table itself take care of serialization on concurrent updates. To achieve that, sstable_list_builder is introduced. Only one builder can be alive for a given table, so serialization is guaranteed as long as the builder is kept alive throughout the update procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210721001716.210281-1-raphaelsc@scylladb.com>	2021-07-21 16:45:30 +03:00
Botond Dénes	84c9bf2b63	tools/scylla-sstable-index: remove global reader concurrency semaphore Use a local one instead and make sure to stop it before it is destroyed. Message-Id: <20210721133754.356229-1-bdenes@scylladb.com>	2021-07-21 16:41:01 +03:00
Raphael S. Carvalho	aad72289e2	table: Kill load_sstable() That function is dangerously used by distributed loader, as the latter was responsible for invalidating cache for new sstable. load_sstable() is an unsafe alternative to add_sstable_and_update_cache() that should never have been used by the outside world. Instead, let's kill it and make loader use the safe alternative instead. This will also make it easier to make sure that all concurrent updates to sstable set are properly serialized. Additionally, this may potentially reduce the amount of data evicted from the cache, when the sstables being imported have a narrow range, like high level sstables imported from a LCS table. Unlikely but possible. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210721131949.26899-1-raphaelsc@scylladb.com>	2021-07-21 16:21:42 +03:00
Botond Dénes	a819f013f6	compaction/compaction: create_compaction_info(): take const compaction_descriptor& Don't copy the descriptor. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210721120219.326972-1-bdenes@scylladb.com>	2021-07-21 16:19:03 +03:00
Pavel Solodovnikov	718977e2b7	idl: add descriptions for the top-level generation routines Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:39:20 +03:00
Pavel Solodovnikov	fe6b0e8bbf	idl: make ns_qualified name a class method Introduce ASTBase and move `combine_ns` and `ns_qualified_name` there. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:31:47 +03:00
Pavel Solodovnikov	c584bbf841	idl: cache template declarations inside enums and classes Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	82cb2dc3e3	idl: cache parent template params for enums and classes Also remove `parent_template_param` argument for `handle_enum` and `handle_class` functions. `setup_namespace_bindings` is renamed to `setup_additional_metadata` since it now also sets parent template arguments for each object. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	6dd3bf9d6d	idl: rename misleading `local_types` to `local_writable_types` Rename `add_to_types` function to `register_local_writable_type` and `local_types` set to `local_writable_types`. Also rename other related functions accordingly, by adding `writable` to the names. Previous names were misleading since `local_types` refers not to all local types but only to those which are marked with `[[writable]]` attribute. Nonetheless, we are going to need a mapping of all local types to resolve type references from `BasicType` AST node instances. So the `local_types` set is retained, but now it corresponds to the list of all local types. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	d17a6a5e5a	idl: remove remaining uses of `namespaces` argument Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	8aeaba5eb6	idl: remove `is_final` function and use `.final` AST class property Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	5ec8aeb74c	idl: remove `parent_template_param` from `local_types` set Previously local types set contained a items, which are lists of `[cls, parent_template_param]`. The second element is never used, so remove it and move `cls` from the list. All uses are adjusted accordingly. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:40 +03:00
Pavel Solodovnikov	4a66e07cb1	idl: cache namespaces in AST nodes Do a pre-processing pass to cache namespaces info in each type declaration AST node (`ClassDef` and `EnumDef`) and store it in the `ns_context` field of a node. Switch to `ns_context` and eliminate `namespaces` parameter carried over through all writer functions. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:39 +03:00
Pavel Solodovnikov	3218a952b9	idl: remove unused variables This patch removes unused `parent_template_param` and `namespaces` variables obtained from unpacking values from the `local_types` set. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-07-21 15:17:39 +03:00
Jan Ciolek	51ee9adeec	expression: Add replace_token function Adds replace_token function which takes an expression and replaces all left hand side occurences of token() with the given column definition. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-07-21 12:25:12 +02:00
Gleb Natapov	7261c2c93e	raft: return a correct leader when leaving leader state When a leader moves to a follower state it aborts all requests that are waiting on an admission semaphore with not_a_leader exception. But currently it specifies itself as a new leader since abortion happens before the fsm state changes to a follower. The patch fixes this by destroying leader state after fsm state already changed to be a follower. Message-Id: <YPbI++0z5ZPV9pKb@scylladb.com>	2021-07-21 00:42:39 +02:00
Nadav Har'El	c4f20f1641	Update seastar submodule * seastar ef320940...388ee307 (4): > Merge 'Add a stall analyser tool' from Benny Halevy > compat: implement coroutine_handle<void> for <experimental/coroutine> header > Merge "Make app_template::run noexcept" from Pavel E > perftune.py: make RPS CPU set to be a full CPU set The stall analyser tool was requested by the SCT team to help make sense of Scylla's stall reports and find more stall bugs!	2021-07-21 00:47:11 +03:00
Benny Halevy	c5e08eb6e7	main: add deferred stop of batchlog_manager Stop the batchlog manager using a deferred action in main to make sure it is stopped after its start() method has been called, also if we bail out of main early due to exception. Change the bm.stop() calls in storage_service to just stop the replay loop using drain(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-20 20:24:11 +03:00
Benny Halevy	5165780d81	batchlog_manager: refactor drain out of stop drain() aborts the replay loop fiber and returns its future. It's grabbing _gate so stop() will wait on it. The intention is to call stop_replay_loop from storage_service::decommission and do_drain rather than stop, so we can stop the batchlog manager once, using a deferred action in main. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-20 20:23:06 +03:00
Benny Halevy	c47fbda076	batchlog_manager: stop: break _sem on shard 0 Abort do_batch_log_replay if waiting on the semaphore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-20 19:35:23 +03:00
Benny Halevy	deef1b4f59	batchlog_manager: stop: use abort_source to abort batchlog_replay_loop Harden start/stop by using an abort_source to abort from the replay loop. Extract the loop into batchlog_replay_loop() coroutine, with the _stop abourt source as a stop condition, plus use it for sleep_abortable to be able to promptly stop while sleeping. start() stores batchlog_replay_loop's future in a newly added _started member, which is waited on in stop() to synchronize with the start process at any stage. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-20 19:32:55 +03:00
Benny Halevy	976b517f55	batchlog_manager: do_batch_log_replay: hold _gate So we can wait on do_batch_log_replay on stop(). Note that do_batch_log_replay is called both from batchlog_replay_loop and from the storage_service. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-20 19:30:55 +03:00
Juliusz Stasiewicz	38b8a6ce2c	test/boost: run_mutation_source_tests on streaming virtual table Tests that require inter-partition forwarding are excluded.	2021-07-20 14:19:17 +02:00
Juliusz Stasiewicz	65c87e2c74	system_keyspace: Introduce describe_ring table as virtual_table This change adds "system.describe_ring" table using the new streaming_virtual_table infrastructure.	2021-07-20 14:19:17 +02:00
Juliusz Stasiewicz	f8067d938d	storage_service: Pass the reference down to system_keyspace According to the policy of avoiding globals.	2021-07-20 14:18:24 +02:00
Juliusz Stasiewicz	a8b741efe2	endpoint_details: store `_host` as `gms::inet_address` In an upcoming commit I will add "system.describe_ring" table which uses endpoint's inet address as a part of CK and, therefore, needs to keep them sorted with `inet_addr_type::less`.	2021-07-20 14:00:54 +02:00
Juliusz Stasiewicz	2b802711c2	queue_reader: implement next_partition()	2021-07-20 14:00:54 +02:00
Piotr Wojtczak	9a77751c6b	virtual_tables: Introduce streaming_virtual_table This change adds another implementation of the virtual_table interface, useful for cases where there's bigger amounts of data.	2021-07-20 14:00:54 +02:00
Piotr Wojtczak	cb2a0ab858	flat_mutation_reader: Add a new filtering reader factory method Introduce a new function creating a filtering reader using query slice and partition range.	2021-07-20 14:00:47 +02:00
Tomasz Grabiec	dcd05f77b1	lsa: Avoid excessive eviction if region is not compactible Introduced in `d72b91053b`. If region was not compactible, for example because it has dense segments, we would keep evicting even though the target for reclaimed segments was met. In the worst case we may have to evict whole cache. Refs #9038 (unlikely to be the cause though) Message-Id: <20210720104039.463662-1-tgrabiec@scylladb.com>	2021-07-20 14:36:14 +03:00
dgarcia360	8d51482ffe	docs: moved latest_version to conf.py Related issues: scylladb/sphinx-scylladb-theme#87 All the variables related to the multiversion extension are now defined in conf.py instead of using the GitHub Actions file. How to test this PR Run make multiversionpreview on docs folder. When you open https://0.0.0.0:5500, the browser should render the documentation site. Closes #7957	2021-07-20 14:31:46 +03:00
Avi Kivity	05fcf11557	Merge 'Coroutinize commit log' from Calle Wilund No real refactoring, just move the various methods to coroutines. Because coroutines are neat. Broken down into one method per change to make review easier. And hoping I get tipped per change. Grand idea being that using coroutines will eventually make real refactoring easier. Unit tests + relevant dtest. As discussed below, simply coroutinizing the code will, at least in the fast path, cause the slightly naive compiler to generate multiple unused coroutine frames, dropping raw performance a bit. The last two patches in this series addresses this, by breaking the fast path into non-coroutine subroutines (no futures involved) and one coroutine main loop. Results, as collected by `perf_simple_query` are: Master (before changes): ``` { "parameters" : { "concurrency" : 100, "concurrency,partitions,cpus,duration" : "100,10000,1,30", "cpus" : 1, "duration" : 30, "partitions" : 10000 }, "stats" : { "allocs_per_op" : 52.237303521776113, "instructions_per_op" : 47403.34422198555, "mad tps" : 670.12528706749436, "max tps" : 140817.0800358199, "median tps" : 139391.58369995767, "min tps" : 133663.0095463676, "tasks_per_op" : 13.189605506751203 }, "test_properties" : { "type" : "write" }, "versions" : { "scylla-server" : { "commit_id" : "1f51bc67fd", "date" : "20210712", "run_date_time" : "2021-07-13 10:26:46", "version" : "4.6.dev" } } } ``` This PR (coroutines + fast path optimization patches): ``` { "parameters" : { "concurrency" : 100, "concurrency,partitions,cpus,duration" : "100,10000,1,30", "cpus" : 1, "duration" : 30, "partitions" : 10000 }, "stats" : { "allocs_per_op" : 52.208628061750559, "instructions_per_op" : 47300.501878330339, "mad tps" : 707.70233700674726, "max tps" : 139618.0661493362, "median tps" : 137891.11290420164, "min tps" : 127551.83433347062, "tasks_per_op" : 13.172121395660733 }, "test_properties" : { "type" : "write" }, "versions" : { "scylla-server" : { "commit_id" : "1d4b6f50bd", "date" : "20210719", "run_date_time" : "2021-07-19 09:27:09", "version" : "4.6.dev" } } } ``` I.e. both allocations/op and instruction count seem to be on par. Closes #8954 * github.com:scylladb/scylla: commitlog: Make allocate_when_possible a template commitlog: break fast path alloc into non-fut/corout + outer loop commitlog: Drop stream/subscription from replayer commitlog: coroutinize commitlog::read_log_file commitlog: coroutinize commitlog::create_commitlog commitlog: coroutinize commitlog::add_entries commitlog: coroutinize commitlog::add_entry commitlog: coroutinize commitlog::add commitlog: change entry_writer usage to reference commitlog: coroutinize segment_manager::clear commitlog: coroutinize segment_manager::do_pending_deletes commitlog: coroutinize segment_manager::delete_file commitlog: coroutinize segment_manager::shutdown commitlog: coroutinize segment_manager::shutdown_all_segments commitlog: coroutinize segment_manager::sync_all_segments commitlog: coroutinize segment_manager::clear_reserve_segments commitlog: coroutinize segment_manager::active_segment commitlog: coroutinize segment_manager::new_segment commitlog: coroutinize segment_manager::allocate_segment commitlog: coroutinize segment_manager::rename_file commitlog: coroutinize segment_manager::init commitlog: coroutinize segment_manager::list_descriptors commitlog: coroutinize segment_manager::replenish_reserve commitlog: coroutinize segment::shutdown commitlog: coroutinize segment::close commitlog: coroutinize segment::batch_cycle commitlog: coroutinize segment::do_flush commitlog: coroutinize segment::flush commitlog: coroutinize segment::cycle commitlog: coroutinize allocate_when_possible commitlog: coroutinize segment::allocate	2021-07-20 14:14:13 +03:00
Tomasz Grabiec	50ec3ea295	lsa: Fix misaccunting of used space when allocating lsa_buffers lsa_buffer allocations are aligned to 4K. If smaller size is requested, whole 4K is used. However, only requested size was used in accounting segment occupancy. This can confuse reclaimer which may think the segment is sparse while it is actually dense, and compacting it will yield no or little gain. This can cause inefficient memory reclamation or lack of progress. Refs #9038 Message-Id: <20210720104110.463812-1-tgrabiec@scylladb.com>	2021-07-20 14:08:06 +03:00
Pavel Solodovnikov	d2b53bc0ca	configure: simplify raft tests dependencies management There's no need for extended `scylla_raft_dependencies`, which includes the entire `scylla_core` target list. Revert the tests which don't need the extended list to use a minimal set of dependencies and switch to using `scylla_core` as a dependency for `raft_sys_table_storage_test` and `raft_address_map_test`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210705104712.295499-1-pa.solodovnikov@scylladb.com>	2021-07-20 12:32:10 +03:00
Botond Dénes	8fc55fa5bf	reader_concurrency_semaphore: get rid of struct permit_list struct permit_list exists so the intrusive list declaration which needs the definition of reader_permit can be hidden in the .cc. But it turns out that if the hook type is fully spelled out, the intrusive list declaration doesn't need T to be defined. Exploit this to get rid of this extra indirection. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210720073121.63027-2-bdenes@scylladb.com>	2021-07-20 10:35:12 +03:00
Botond Dénes	11b39cbc23	reader_concurrency_semaphore: merge permit_stats into stats If there was any reason to have them separate when permit_stats was conceived, it is gone now, so merge the two. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210720073121.63027-1-bdenes@scylladb.com>	2021-07-20 10:35:12 +03:00
Tomasz Grabiec	a8528cb24d	lsa: Fix uninitialized field access resulting in hangs during segment compaction _free_space may be initialized with garbage so kind() getter should only look at the bit which corresponds to the kind. Misclasification of segment as being of different kind may result in a hang during segment compaction. Surfaced in debug mode build where the field is filled with 0xbebebebe. Introduced in `b5ca0eb2a2`. Fixes #9057 Message-Id: <20210719232734.443964-1-tgrabiec@scylladb.com>	2021-07-20 02:33:21 +03:00
Tomasz Grabiec	393b90112f	gdb: segment-descs: Support debug mode builds Debug mode builds have a different implementation of segment_store in LSA. Message-Id: <20210719232125.442458-1-tgrabiec@scylladb.com>	2021-07-20 02:33:18 +03:00
Gleb Natapov	aa8c6b85fb	raft: do not apply empty command list Do not call user's state machine apply() if there is nothing to apply. Message-Id: <YO1dMitXnZhZlmra@scylladb.com>	2021-07-19 18:26:18 +02:00
Nadav Har'El	36ec1d792e	Merge 'cql-pytest: Test selecting from indexed table using only clustering key' from Jan Ciołek Add examples from issue #8991 to tests Both of these tests pass on `cassandra 4.0` but fail on `scylla 4.4.3` First test tests that selecting values from indexed table using only clustering key returns correct values. The second test tests that performing this operation requires filtering. The filtering test looks similar to [the one for #7608](`1924e8d2b6/test/cql-pytest/test_allow_filtering.py (L124)`) but there are some differences - here the table has two clustering columns and an index, so it could test different code paths. Contains a quick fix for the `needs_filtering()` function to make these tests pass. It returns `true` for this case and the one described in #7708. This implementation is a bit conservative - it might sometimes return `true` where filtering isn't actually needed, but at least it prevents scylla from returning incorrect results. Fixes #8991. Fixes #7708. Closes #8994 * github.com:scylladb/scylla: cql3: Fix need_filtering on indexed table cql-pytest: Test selecting using only clustering key requires filtering cql-pytest: Test selecting from indexed table using clustering key	2021-07-19 18:23:08 +03:00
Tomasz Grabiec	049a1ef729	Merge 'flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler' from enedil The downgrade_to_v1 didn't reset the state of range tombstone assembler in case of the calls to next_partition or fast_forward_to, which caused a situation where the closing range tombstone change is cleared from the buffer before being emitted, without notifying the assembler. This patch fixes the behaviour in fast_forward_to as well. Fixes #9022 Closes #9023 * github.com:scylladb/scylla: flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler flat_mutation_reader: introduce public method returning the default size of internal buffer.	2021-07-19 17:10:23 +02:00
Jan Ciolek	54149242b4	cql3: Fix need_filtering on indexed table There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-07-19 16:22:17 +02:00
Michał Radwański	67d99e02a7	flat_mutation_reader: downgrade_to_v1 - reset state of rt_assembler The downgrade_to_v1 didn't reset the state of range tombstone assembler in case of the calls to next_partition or fast_forward_to, which caused a situation where the closing range tombstone change is cleared from the buffer before being emitted, without notifying the assembler. This patch fixes the behaviour in fast_forward_to as well. Fixes #9022	2021-07-19 15:54:26 +02:00
Michał Radwański	c4089007a2	flat_mutation_reader: introduce public method returning the default size of internal buffer. This method is useful in tests that examine behaviour after the buffer has been filled up.	2021-07-19 15:54:13 +02:00
Nadav Har'El	4c6dc5fce2	Merge 'continuous_data_consumer: properly skip bytes at the end of a range' from Wojciech Mitros When skipping bytes at the end of a continuous_data_consumer range, the position of the consumer is moved after the skipped bytes, but the position of the underlying input_stream is not. This patch adds skipping of the underlying input_stream, to make its position consistent with the position of the consumer. Fixes #9024 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #9039 * github.com:scylladb/scylla: tests: add test for skipping bytes at end of consumer continuous_data_consumer: properly skip bytes at the end of a range	2021-07-19 15:57:26 +03:00
Botond Dénes	27fbca84f6	reader_concurrency_semaphore: remove prethrow_action The semaphore accepts a functor as in its constructor which is run just before throwing on wait queue overload. This is used exclusively to bump a counter in the database::stats, which counts queue overloads. However, there is now an identical counter in reader_concurrency_semaphore::stats, so the database can just use that directly and we can retire the now unused prethrow action. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210716111105.237492-1-bdenes@scylladb.com>	2021-07-19 15:47:37 +03:00
Wojciech Mitros	507bdfc36a	tests: add test for skipping bytes at end of consumer The new tests confirms that the regression issue, where we didn't correctly skip bytes at the end of a continuous_data_consumer range, is fixed. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-19 14:42:38 +02:00
Wojciech Mitros	7107e32390	continuous_data_consumer: properly skip bytes at the end of a range When skipping bytes at the end of a continuous_data_consumer range, the position of the consumer is moved after the skipped bytes, but the position of the underlying input_stream is not. This patch adds skipping of the underlying input_stream, to make its position consistent with the position of the consumer. Fixes #9024 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-19 11:43:30 +02:00
Piotr Sarna	38afef71b9	Merge 'Service Level Controller: Stop polling distributed data.. ... when decommissioned (reworked)' from Eliran Sinvani This is a rework of #8916 The polling loop of the service level controller queries a distributed table in order to detect configuration changes. If a node gets decommissioned, this loop continues to run until shutdown, if a node stays in the decommissioned mode without being shut down, the loop will fail to query the table and this will result in warnings and eventually errors in the log. This is not really harmful but it adds unnecessary noise to the log. The series below lays the infrastructure for observing storage service state changes, which eventually being used to break the loop upon preparation for decommissioning. Tests: Unit test (dev) Failing tests in jenkins. Fixes #8836 The previous merge (possibly due to conflict resolution) contained a misplaced get that caused an abort on shutdown. Closes #9035 * github.com:scylladb/scylla: Service Level Controller: Stop configuration polling loop upon leaving the cluster main: Stop using get_local_storage_service in main	2021-07-19 10:52:42 +02:00
Benny Halevy	3700702e90	cmake: update compaction source files location Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210718120906.701185-1-bhalevy@scylladb.com>	2021-07-19 11:47:35 +03:00
Botond Dénes	5aa733f933	sstables/mx/writer: initialize _range_tombstones at the end of the ctor We need a permit to initialize said object which makes the semaphore used and hence trigger an error if an exception is thrown in the constructor. Move the initialization to the end of the constructor to prevent this. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210719040449.9202-1-bdenes@scylladb.com>	2021-07-19 11:43:00 +03:00
Calle Wilund	4990ba2769	commitlog: Make allocate_when_possible a template And call it by-value with the polymorphic writers. This eliminates outer coroutine frame and ensures we use only one for fast-case allocation.	2021-07-19 08:27:30 +00:00
Calle Wilund	69ead0e658	commitlog: break fast path alloc into non-fut/corout + outer loop Removes 2 coroutine frames in fast path (as long as segment + space is avail). Puts IPS back on track with master.	2021-07-19 08:27:30 +00:00
Calle Wilund	62acc84e58	commitlog: Drop stream/subscription from replayer Change args to values so stays on coroutine frame. Remove pointless subscription/stream usage, just iterate.	2021-07-19 08:27:30 +00:00
Calle Wilund	5e8af28da7	commitlog: coroutinize commitlog::read_log_file	2021-07-19 08:27:30 +00:00
Calle Wilund	b3c35f9ec0	commitlog: coroutinize commitlog::create_commitlog	2021-07-19 08:27:30 +00:00
Calle Wilund	ef471d0a93	commitlog: coroutinize commitlog::add_entries	2021-07-19 08:27:30 +00:00
Calle Wilund	96434b1b12	commitlog: coroutinize commitlog::add_entry	2021-07-19 08:27:30 +00:00
Calle Wilund	e16cff6952	commitlog: coroutinize commitlog::add	2021-07-19 08:27:30 +00:00
Calle Wilund	da360fb841	commitlog: change entry_writer usage to reference Calling frames keeps object alive in all paths. Use references in allocate()/allocate_when_possible()	2021-07-19 08:27:30 +00:00
Calle Wilund	42bfae513a	commitlog: coroutinize segment_manager::clear	2021-07-19 08:27:30 +00:00
Calle Wilund	554a09baab	commitlog: coroutinize segment_manager::do_pending_deletes	2021-07-19 08:27:30 +00:00
Calle Wilund	9e18cf3f5f	commitlog: coroutinize segment_manager::delete_file	2021-07-19 08:27:30 +00:00
Calle Wilund	ca65387c53	commitlog: coroutinize segment_manager::shutdown	2021-07-19 08:27:30 +00:00
Calle Wilund	4678d1fbec	commitlog: coroutinize segment_manager::shutdown_all_segments	2021-07-19 08:27:30 +00:00
Calle Wilund	2f048e658b	commitlog: coroutinize segment_manager::sync_all_segments	2021-07-19 08:27:30 +00:00
Calle Wilund	ad4e4e9ee4	commitlog: coroutinize segment_manager::clear_reserve_segments	2021-07-19 08:27:30 +00:00
Calle Wilund	ec430807fc	commitlog: coroutinize segment_manager::active_segment	2021-07-19 08:27:30 +00:00
Calle Wilund	13bba1ef39	commitlog: coroutinize segment_manager::new_segment	2021-07-19 08:27:30 +00:00
Calle Wilund	ccd34203dc	commitlog: coroutinize segment_manager::allocate_segment	2021-07-19 08:27:30 +00:00
Calle Wilund	f5de830f0c	commitlog: coroutinize segment_manager::rename_file	2021-07-19 08:27:30 +00:00
Calle Wilund	011bc68209	commitlog: coroutinize segment_manager::init	2021-07-19 08:27:30 +00:00
Calle Wilund	04c725b29c	commitlog: coroutinize segment_manager::list_descriptors	2021-07-19 08:27:30 +00:00
Calle Wilund	d514fc5822	commitlog: coroutinize segment_manager::replenish_reserve	2021-07-19 08:27:30 +00:00
Jan Ciolek	9bd62a07c9	cql-pytest: Test selecting using only clustering key requires filtering Adds test that creates a table with primary key (p, c1, c2) with a global index on c2 and then selects where c1 = 1 and c2 = 1. This should require filtering, but doesn't. Refs #8991. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-07-19 10:24:48 +02:00
Jan Ciolek	a041767aa3	cql-pytest: Test selecting from indexed table using clustering key Adds test that creates a table with primary key (p, c1, c2) with a global index on c2 and then selects where c1 = 1 and c2 = 1. This currently fails. Refs #8991. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-07-19 10:24:46 +02:00
Calle Wilund	d4bd17d577	commitlog: coroutinize segment::shutdown	2021-07-19 08:17:33 +00:00
Calle Wilund	e9820827e3	commitlog: coroutinize segment::close	2021-07-19 08:17:33 +00:00
Calle Wilund	999701a8ee	commitlog: coroutinize segment::batch_cycle	2021-07-19 08:17:33 +00:00
Calle Wilund	cef7ee2014	commitlog: coroutinize segment::do_flush	2021-07-19 08:17:33 +00:00
Calle Wilund	1a76d735f2	commitlog: coroutinize segment::flush	2021-07-19 08:17:33 +00:00
Calle Wilund	0b1e2084ce	commitlog: coroutinize segment::cycle	2021-07-19 08:17:33 +00:00
Calle Wilund	79b9cb1e5c	commitlog: coroutinize allocate_when_possible	2021-07-19 08:17:33 +00:00
Calle Wilund	e545b382bd	commitlog: coroutinize segment::allocate	2021-07-19 08:17:33 +00:00
Avi Kivity	2cfc517874	main, test: adjust number of networking iocbs Seastar's default limit of 10,000 iocbs per shard is too low for some workload (it places an upper bound on the number of idle connections, above which a crash occurs). Use the new Seastar feature to raise the default to 50000. Also multiply the global reservation by 5, and round it upwards so the number is less weird. This prevents io_setup() from failing. For tests, the reservation is reduced since they don't create large numbers of connections. This reduces surprise test failures when they are run on machines that haven't been adjusted. Fixes #9051 Closes #9052	2021-07-18 14:38:44 +03:00
Avi Kivity	9c3f8028f1	Update tools/java submodule (SLES 15) * tools/java 79a441972d...4ef8049e07 (1): > dist/redhat: change PyYAML filepath to allow installing on SLES15 Fixes #9045.	2021-07-18 14:24:42 +03:00
Raphael S. Carvalho	841e9227f9	table: Document the serialization requirement on sstable set rebuild In order to avoid data loss bugs, that could come due to lack of serialization when using the preemptable build_new_sstable_list(), let's document the serialization requirement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210714201301.188622-1-raphaelsc@scylladb.com>	2021-07-17 18:09:00 +03:00
Avi Kivity	df822e09e0	Merge "Run test cases in parallel" from Pavel E " The debug-mode tests nowadays take ~1 hours to complete on a 24-cores threadripper machine. This is mostly because of a bunch of individual test cases that run sequentially (since they sit in one test) each taking half-an-hour and longer. The previous attempt was to break the longest tests into pieces, and to update the list of long-running test in suite.yaml file, but the concern was that the linkage time and disk space would grow without limits if this continues. Also the long-running tests list needs to be revisited every so often. So the new attempt is to resurrect Avi's patch that ran test cases in parallel for boost tests. This set applies parallelizm to all tests and allows to blacklist those that shound't (the logalloc needs the very first case to prime_segment_pools so that other cases run smoothly, thus is cannot be parallelized). Although this wild parallelizm adds an overhead for _each_ test case this is good enough even for short dev-mode tests (saves 25% of runtime), but greatly relaxes the maintenance of the "parallelizable list of tests". For debug tests the problem is not 100% solved. There are 6 cases that run longer than 30min, while all the others complete much- -much faster. So if excluding those slow 6 cases the full parallel run saves 50+% of the runtime -- 60+m now vs 25m with the patch. Those 6 slowest cases will need more incremental care. The --parallel-cases mode is not yet default, because it requires larger max-aio-nr value to be set, which is not (yet?) automatic. Also it sometimes hits nr-open-files limit, which also needs more work. tests: unit(dev), unit(debug) " * 'br-parallel-testpy-3' of https://github.com/xemul/scylla: tests: Update boost long tests list test.py: Parallelize test-cases run (for boost tests) test.py: Prepare BoostTest for running individual cases test.py: Prepare TestSuite::create_test() for parallelizm test.py: Treat shortname as composite test.py: Reformat tabluar output	2021-07-17 13:57:56 +03:00
Pavel Emelyanov	1ed582304d	memtable_list: Shorten flush coalescing codeflow The memtable_list::flush() maintains a shared_promise object to coalesce the flushers until the get_flush_permit() resolves. Also it needs to keep the extraneous flushes counter bumped while doing the flush itself. All this can be coded in a shorter form and without the need to carry shared_promise<> around. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210716164237.10993-1-xemul@scylladb.com>	2021-07-17 00:42:20 +02:00
Avi Kivity	3058c42171	Update seastar submodule * seastar 8ed9771ae9...ef320940c2 (6): > reactor: reactor_backend_aio: allow tuning number of network iocbs Ref #9051. > aio_general_context: flush: handle io_submit short return > aio_general_context: prevent overflow > file: Do not assume nowait_works by default > Merge "reactor: use sched_clock consistently" from Michael > testing: Lazily create seastar::app thread	2021-07-16 18:07:10 +03:00
Pavel Emelyanov	9d59f1daf3	tests: Update boost long tests list Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:25:07 +03:00
Pavel Emelyanov	cbb4837b77	test.py: Parallelize test-cases run (for boost tests) The parallelizm is acheived by listing the content of each (boost) test and by adding a test for each case found appending the '--run_test={case_name}' option. Also few tests (logallog and memtable) have cases that depend on each other (the former explicitly stated this in the head comment), so these are marked as "no_parallel_cases" in the suite.yaml file. In dev mode tests need 2m:5s to run by default. With parallelizm (and updated long-running tests list) -- 1m 35s. In debug mode there are 6 slow _cases_ that overrun 30 minutes. They finish last and deserve some special (incremental) care. All the other tests run ~1h by default vs ~25m in parallel. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:25:07 +03:00
Pavel Emelyanov	3cac5173b7	test.py: Prepare BoostTest for running individual cases This means adding the casename argument to its describing class and handling it: 1. appending to the shortname 2. adding the --run_test= argument to boost args Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:25:07 +03:00
Pavel Emelyanov	0baee5d423	test.py: Prepare TestSuite::create_test() for parallelizm The method in question is in charge of creating a single entry in the list of tests to be run. The BoostTestSuite's method is about to create several entries and this patch prepares it for this: - makes it distinguish individual arguments - lets it select the test.id value itself Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:25:07 +03:00
Pavel Emelyanov	a547677502	test.py: Treat shortname as composite When running tests in parallel-cases mode the test.uname must include the case name to make different log and xml files for different runs and to show which exact case is run when shown by the tabular-output. At the same time the test shortname identifies the binary with the whole test. This patch makes class Test treat the shortname argument as a dot-separated string where the 0th component is the binary with the test and the rest is how test identifies itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:25:07 +03:00
Pavel Emelyanov	f188dd3396	test.py: Reformat tabluar output This change solves several issues that would arise with the case-by-case run. First, the currently printed name is "$binary_name.$id". For case-by-case run the binary name would coinside for many cases and it will be inconvenient to identify the test case. So the tests uname is printed instead. Second, the tests uname doesn't contain suite name (unlike the test binary name which does), so this patch also adds the explicit suite name back as a separate column (like MODE) Third, the testname + casename string length will be far above the allocated 50 characters, so the test name is moved at the tail of the line. Fourth, the total number of cases is 2100+, the field of 7 characters is not enough to print it, so it's extended. Finally the test.py output would look like this for parallel run: ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ [1/2108] raft dev [ PASS ] etcd_test.test_progress_leader.40 0.06s [2/2108] raft dev [ PASS ] etcd_test.test_vote_from_any_state.45 0.03s [3/2108] raft dev [ PASS ] etcd_test.test_progress_flow_control.43 0.04s [4/2108] raft dev [ PASS ] etcd_test.test_progress_resume_by_append_resp.41 0.05s [5/2108] raft dev [ PASS ] etcd_test.test_leader_election_overwrite_newer_logs.44 0.04s [6/2108] raft dev [ PASS ] etcd_test.test_progress_paused.42 0.05s [7/2108] raft dev [ PASS ] etcd_test.test_log_replication_2.47 0.06s ... or like this for regular: ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ [1/184] raft dev [ PASS ] fsm_test.41 0.06s [2/184] raft dev [ PASS ] etcd_test.40 0.06s [3/184] cql dev [ PASS ] cassandra_cql_test.2 1.87s [4/184] unit dev [ PASS ] btree_stress_test.30 1.82s ... Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-16 17:24:36 +03:00
Tomasz Grabiec	97aa335a60	Merge "test: raft: randomized_nemesis_test: refactors and improvements" from Kamil A couple of improvements to prepare for the next patchset. We move `logical_timer` and `ticker` to their own headers due to the generality of these data structures. They are not very specific to the test. `logical_timer` is extended with a `schedule` function, allowing to schedule any given function to be called at the given time point. The interface of `network` in `randomized_nemesis_test` is extended by `add_grudge` and `remove_grudge` functions for implementing network partitioning nemeses. Furthermore `network` can be now constructed with an arbitrary network delay, which was previously hardcoded. `with_env_and_ticker` is now generic w.r.t. return values (previously `future<>` was assumed). `environment` exposes a reference to the `network` through a getter. The `not_a_leader` exception now shows the leader's ID in the exception message. Useful for logging. In `logical_timer::with_timeout`, when we timeout, we don't just return `timed_out_error`. The returned exception now actually contains the original future... well almost; in any case, the user can now do something different to the future other than simply discarding it. We also fix some `broken_promise` exceptions appearing in discarded futures in certain scenarios. See the corresponding commit for detailed explanation. We handle `raft::dropped_entry` in the `call` function. `persistence` is fixed to avoid creating gaps in the log when storing snapshots and to support complex state types. Waiting for leader was refactored into a separate function and generalized (we wait for a set of nodes to elect a leader instead of a single node to elect itself) to be useful in more situations. Finally, we introduce `reconfigure`, a higher-level version of `set_configuration` which performs error handling and supports timeouts. * kbr/raft-nemesis-improvements-v4: test: raft: randomized_nemesis_test: `reconfigure` function test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots test: raft: randomized_nemesis_test: persistence: handle complex state types test: raft: randomized_nemesis_test: `call`: handle `raft::dropped_entry` test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels test: raft: randomized_nemesis_test: environment: expose the network test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold test: raft: randomized_nemesis_test: generalize `with_env_and_ticker` test: raft: randomized_nemesis_test: network: `add_grudge`, `remove_grudge` functions test: raft: randomized_nemesis_test: move `ticker` to its own header test: raft: randomized_nemesis_test: ticker: take `logger` as a constructor parameter test: raft: logical_timer: handle immediate timeout test: raft: logical_timer: on timeout, return the original future in the exception test: raft: logical_timer: add `schedule` member function test: raft: randomized_nemesis_test: move `logical_timer` to its own header test: raft: include the leader's ID in the `not_a_leader` exception's message	2021-07-16 16:12:05 +02:00
Benny Halevy	a44c06d776	storage_proxy: query: log also errors If log trace level is enabled, log also error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210712070509.24102-1-bhalevy@scylladb.com>	2021-07-16 16:12:05 +02:00
Nadav Har'El	5183e0cbe9	Merge 'Fix artificial view update size limit' from Piotr Sarna The series which split the view update process into smaller parts accidentally put an artificial 10MB limit on the generated mutation size, which is wrong - this limit is configurable for users, and, what's more important, this data was already validated when it was inserted into the base table. Thus, the limit is lifted. The series comes with a cql-pytest which failed before the fix and succeeds now. This bug is also covered by `wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view` dtest, but it needs over a minute to run, as opposed to cql-pytest's <1 second. Fixes #9047 Tests: unit(release), dtest(wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view) Closes #9048 * github.com:scylladb/scylla: cql-pytest: add a materialized views suite with first cases db,view: drop the artificial limit on view update mutation size	2021-07-15 17:03:07 +03:00
Piotr Sarna	c05340c4bf	cql-pytest: add a materialized views suite with first cases cql-pytest did not have a suite for materialized views, so one is created. At the same time, test cases for building/updating a view on a base table with large cells is added as a regression test for #9047.	2021-07-15 15:40:38 +02:00
Piotr Sarna	697e2fc66d	db,view: drop the artificial limit on view update mutation size The series which split the view update process into smaller parts accidentally put an artificial 10MB limit on the generated mutation size, which is wrong - this limit is configurable for users, and, what's more important, this data was already validated when it was inserted into the base table. Thus, the limit is lifted. Tests: unit(release), dtest(wide_rows_test)	2021-07-15 14:09:37 +02:00
Tomasz Grabiec	1f255c420e	flat_mutation_reader_v2: Make is_end_of_stream() reflect consumer-side state of the stream Currently, flat_mutation_reader_v2::is_end_of_stream() returns flat_mutation_reader_v2::impl::_end_of_stream, which means the producer is done. The stream may be still not yet fully consumed even if producer is done, due to internal buffering. So consumers need to make a more elaborate check: rd.is_end_of_stream() && rd.is_buffer_empty() It would be cleaner if flat_mutation_reader_v2::is_end_of_stream() returned the state of the consumer-side of the stream, since it belongs to the consumer-side of the API. The consumption will be as simple as: while (!rd.is_end_of_stream()) { consume_fragment(rd()); } This patch makes the change on the v2 of the reader interface. v1 is not changed to avoid problems which could happen when backporting code which assumes new semantics into a version with the old semantics. v2 is not in any old branch yet so it doesn't have this problem and it's a good time to make the API change. Note that it's always safe to use the new semantics in the context which assumes the old semantics, so v1 users can be safely converted to v2 even if they are unware of the change. Fixes #3067 Message-Id: <20210715102833.146914-1-tgrabiec@scylladb.com>	2021-07-15 14:00:48 +03:00
Calle Wilund	b8b5f69111	messaging_service: Bind to listen address, not broadcast Refs #8418 Broadcast can (apparently) be an address not actually on machine, but on the other side of NAT. Thus binding local side of outgoing connection there will fail. Bind instead to listen_address (or broadcast, if listen_to_broadcast), this will require routing + NAT to make the connection looking like from broadcast from node connected to, to allow the connection (if using partial encryption). Note: this is somewhat verified somewhat limitedly. I would suggest verifying various multi rack/dc setups before relying on it. Closes #8974	2021-07-15 13:18:10 +03:00
Tomasz Grabiec	21f1a7be8b	sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads Reads which bypass cache will use a private temporary instance of cached_file which dies together with the index cursor. The cursor still needs a cached_file with cachig layer. Binary searching needs caching for performance, some of the pages will be reused. Another reason to still use cached_file is to work with a common interface, and reusing it requires minimal changes.	2021-07-15 12:14:28 +02:00
Tomasz Grabiec	f4227c303b	sstables: Do not populate partition index cache for "bypass cache" reads Index cursor for reads which bypass cache will use a private temporary instance of the partition index cache. Promoted index scanner (ka/la format) will not go through the page cache.	2021-07-15 12:13:20 +02:00
Avi Kivity	ed6c01a9fa	test: increase timeout to account for flat_mutation_reader_v2 tests Since `fce124bd90` ("Merge "Introduce flat_mutation_reader_v2" from Tomasz") tests involving mutation_reader are a lot slower due to the new API testing. On slower machines it's enough to time out. Work underway to improve the situation, and it will also revert back to the original timing once the flat_mutation_reader_v2 work is done, but meanwhile, increase the timeout. Closes #9046	2021-07-15 12:33:43 +03:00
Avi Kivity	1643549d08	Merge 'Coroutinize the sstable reader' from Wojciech Mitros This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one. This patch makes the main sstable reader process a coroutine, allowing to simplify it, by: - using the state saved in the coroutine instead of most of the states saved in the _state variable - removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code - removing repetitive ifs for read statuses, by adding them to the coroutine implementation The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```). Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines. However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader. In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips. Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement. You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt) Test: unit(dev) Refs: #7952 Closes #9002 * github.com:scylladb/scylla: mx sstable reader: reduce code blocks mx sstable reader: make ifs consistent sstable readers: make awaiter for read status mx sstable reader: don't yield if the data buffer is not empty mx sstable reader: combine FLAGS and FLAGS_2 states mx sstable reader: reduce placeholder state usage mx sstable reader: replace non_consuming states with a bool mx sstable reader: reduce placeholder state usage mx sstable reader: replace unnecessary states with a placeholder mx sstable reader: remove false if case mx sstable reader: remove row_body_missing_columns_label mx sstable reader: remove row_body_deletion_label mx sstable reader: remove column_end_label mx sstable reader: remove column_cell_path_label mx sstable reader: remove column_ttl_label mx sstable reader: remove column_deletion_time_label mx sstable reader: remove complex_column_2_label mx sstable reader: remove row_body_missing_columns_read_columns_label mx sstable reader: remove row_body_marker_label mx sstable reader: remove row_body_shadowable_deletion_label mx sstable reader: remove row_body_prev_size_label mx sstable reader: remove ck_block_label mx sstable reader: remove ck_block2_label mx sstable reader: remove clustering_row_label and complex_column_label mx sstable reader: remove labels with only one goto mx sstable reader: replace the switch cases with gotos and a new label mx sstable reader: remove states only reached consecutively or from goto mx sstable reader: remove switch breaks for consecutive states mx sstable reader: convert readers main method into a coroutine kl sstable reader: replace states for ending with one state, simplify non_consuming kl sstable reader: remove unnecessary states kl sstable reader: remove unnecessary yield kl sstable reader: remove unnecessary blocks kl sstable reader: fix indentation kl sstable reader: replace switch with standard flow control kl sstable reader: remove state::CELL case kl sstable reader: move states code only reachable from one place kl sstable reader: remove states only reached consecutively kl sstable reader: remove switch breaks for consecutive states kl sstable reader: remove unreachable case kl sstable reader: move testing hack for fragmented buffers outside the coroutine kl sstable reader: convert readers main method into a coroutine sstable readers: create a generator class for coroutines	2021-07-15 12:06:14 +03:00
Wojciech Mitros	45058776c2	mx sstable reader: reduce code blocks Some blocks of code were surrounded by curly braces, because a variable was declared inside a switch case. After changes, some of the variable declarations are in if/else/while cases, and no longer need to be in separate code blocks, while other blocks can be extended to entire labels for simplicity.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9b333908e4	mx sstable reader: make ifs consistent In several places we're checking the return value of our consumers' consume_* calls. Because the behaviour in all cases is the same, let us use the same notation as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	dc38605f75	sstable readers: make awaiter for read status After each read* call of the primitive_consumer we need to check if the entire primitive was in our current buffer. We can check it in the proceed_generator object by yielding the returned read status: if the yielded status is ready, the yield_value method returns a structure whose await_ready() method returns true. Otherwise it returns false. The returned structure is co_awaited by the coroutine (due to co_yield), and if await_ready() returns true, the coroutine isn't stopped, conversely, if it returns false, (technical: and because its await_suspend methods returns void) the coroutine stops, and a proceed::yes value is saved, indicating that we need more buffers.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	09a0cd7c05	mx sstable reader: don't yield if the data buffer is not empty The skip() method returns a skip_bytes object if we want to skip the entire buffer, otherwise it returns a proceed::yes and trims the buffer. If the buffer is only trimmed we don't need to interrupt the coroutine, we simply continue instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	5dc64532bd	mx sstable reader: combine FLAGS and FLAGS_2 states We don't differentiate between FLAGS and FLAGS_2 in verify_end_state(), so we can merge them into one state.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	ab1e6f4211	mx sstable reader: reduce placeholder state usage After the changes to non_consuming states, we can remove some state::OTHER assignments again.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	c904ab12c8	mx sstable reader: replace non_consuming states with a bool The non_consuming() method is only used after assuring that primitive_consumer::active() (in continuous_data_consumer::process()) so we don't need states where primitive_consumer::active(), which is most of them. We still need to make sure that the states change when they need to, so we replace all the concerned states with the placeholder state, and for the few states from the non_consuming() OR, where the primitive_consumer::active() returns true, we set the value of _consuming to false, changing it back when the state is no longer non_consuming.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b05d3eefed	mx sstable reader: reduce placeholder state usage We can remove state assignments that we know are changing a state to itself. Similarily, if a state is changed in the same way in an if and an else, it can be changed before the if/else instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b2e3fbffd0	mx sstable reader: replace unnecessary states with a placeholder After removing the switch, the state is only used for verify_end_state() and non_consuming(), so we can replace states that are not used there with a single one, so that the state still stops being one of the appearing states when it needs to.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9a7a8fa86c	mx sstable reader: remove false if case consume_row_marker_and_tombstone does not return proceed::no in the mp_row_consumer_m implementation, and even if it did, we would most likely want to yield proceed::no in that case as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	2262aac11a	mx sstable reader: remove row_body_missing_columns_label row_body_missing_columns_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	99b5a332db	mx sstable reader: remove row_body_deletion_label row_body_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	cbce22a88b	mx sstable reader: remove column_end_label column_end_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	925d921cb4	mx sstable reader: remove column_cell_path_label column_cell_path_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	e85987a439	mx sstable reader: remove column_ttl_label column_ttl_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	4b3607e97b	mx sstable reader: remove column_deletion_time_label column_deletion_time_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	8cf23c3b01	mx sstable reader: remove complex_column_2_label complex_column_2_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	fbe28d18f3	mx sstable reader: remove row_body_missing_columns_read_columns_label row_body_missing_columns_read_columns_label is only reached consecutively, or from a goto after the label. This is changed to a while loop starting at the label and ending at the goto. The code executed in the only case we do not reach the goto (so when exiting the loop) is moved after the while.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3b512ea2c2	mx sstable reader: remove row_body_marker_label row_body_marker_label is only reached from one goto inside an else case, or consecutively, so the code omitted by goto can be moved inside the corresponding if case.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0bcde69319	mx sstable reader: remove row_body_shadowable_deletion_label row_body_shadowable_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3d0fdf9f3b	mx sstable reader: remove row_body_prev_size_label row_body_prev_size_label is only reached consecutively, or from a goto not far after the label. This is changed to a while loop starting at the label and ending at the goto.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b27166c36f	mx sstable reader: remove ck_block_label ck_block_label is only reached consecutively, or from a few gotos not far after the label. This is changed to a while loop with gotos replaced with continue's.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	ec6c2f0e07	mx sstable reader: remove ck_block2_label ck_block2_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	1e59e249ec	mx sstable reader: remove clustering_row_label and complex_column_label clustering_row_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else). Also remove complex_column_label because it is next to its only goto.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	440aba61a9	mx sstable reader: remove labels with only one goto If a case is reached only after after jumping with a single goto, that goto may be replaced with the target code.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	65f7eb5ada	mx sstable reader: replace the switch cases with gotos and a new label Because the number of remaining cases is moderately low, and after finishing a case we always enter another one, the switch is removed completely, and the last remaining cases are handled by 3 additional gotos and 1 new label.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0398c68797	mx sstable reader: remove states only reached consecutively or from goto If a state is never reached from the top of the switch, but only by continuing from the previous case, we don't need to have a case: for it. Similarily, if there is a label that we goto, we don't need the switch case.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	f87b27b9e4	mx sstable reader: remove switch breaks for consecutive states If _state at the end of a switch case has the same value as the next case, instead of breaking the switch, we can just fall through.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	32b996aca5	mx sstable reader: convert readers main method into a coroutine (same as in kl sstable reader) The function is converted to a coroutine simply by adding an infinite loop around the switch, and starting another iteration after yielding a value, instead of returning. Because the coroutine resume() function does not take any arguments, a new member is introduced to remember the "data" buffer, that was previously an argument to the method.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	4816e8120b	kl sstable reader: replace states for ending with one state, simplify non_consuming After removing the switch, the only use for states in the sstable reader are methods non_consuming() and verify_end_state(). The non_consuming() method is only used after assuring that !primitive_consumer::active() (in continuous_data_consumer::process()) so we don't need states where primitive_consumer::active() for this method, and is actually all of them. We don't differentiate between ATOM_START and ATOM_START_2 in verify_end_state(), so we can just merge them into one. While we need tho remember times when we enter states used in verify_end_state(), we also need to remember when we exit them. For that reason we introduce a new state "NOT_CLOSING", that fails all comparisons in verify_end_state(), and replaces all states that aren't used in verify_end_state()	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0c284a8b5e	kl sstable reader: remove unnecessary states After removing the switch, the state is only used for verify_end_state() and non_consuming(), so we can remove states that are not used there (and which do not change them).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	35c30e6178	kl sstable reader: remove unnecessary yield We don't need to yield row_consumer::proceed::yes if we are not parsing a primitive using primitive_consumer, we can just continue execution.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	97c7b5fe76	kl sstable reader: remove unnecessary blocks Some blocks of code were surrounded by curly braces, because a variable was declared inside a switch case. With standard flow control, it's no longer needed.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	914e4f27e9	kl sstable reader: fix indentation To simplify review, the code moved in previous commits didn't change its indentation. This commit fixes it.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	7a6729159f	kl sstable reader: replace switch with standard flow control We get rid of the switch by using the infinite loop around the switch for jumping to the first case, adding an infinite loop around the second case (one break from the switch with the state of the first case becomes a break of the new while), and adding an if around the first case (because we never break in the first case).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	cfe6a46a60	kl sstable reader: remove state::CELL case The CELL state is only set in the if/else block immediately before the CELL case, so we don't need to have a case for it.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	c41f49d2e5	kl sstable reader: move states code only reachable from one place If a case is reached only after exiting a certain other case (or goto) its code may as well be moved to that place.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	5f27413c1f	kl sstable reader: remove states only reached consecutively If a state is never reached from the top of the switch, but only by continuing from the previous case, we don't need to have a case: for it.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	e226fc12c9	kl sstable reader: remove switch breaks for consecutive states If _state at the end of a switch case has the same value as the next case, instead of breaking the switch, we can just fall through.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	bc7ed3f596	kl sstable reader: remove unreachable case The STOP_THEN_ATOM_START is never reached, so it can be removed altogether.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	63d1a44d12	kl sstable reader: move testing hack for fragmented buffers outside the coroutine The testing hack can't be done inside the coroutine, because we don't have the original "data" buffer	2021-07-14 20:50:30 +02:00
Wojciech Mitros	6fff9aed3c	kl sstable reader: convert readers main method into a coroutine The function is converted to a coroutine simply by adding an infinite loop around the switch, and starting another iteration after yielding a value, instead of returning. Because the coroutine resume() function does not take any arguments, a new member is introduced to remember the "data" buffer, that was previously an argument to the method.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	01c2f406df	sstable readers: create a generator class for coroutines The data_consume_rows_context and data_consume_rows_context_m are classes, that use primitive_consumer read* methods to get primitives from a streamed sstable, and using their corresponding consumers' ( mp_row_consumer_k_l and mp_row_consumer_m) consume* methods, they fill the buffer of the corresponding flat_mutation_reader. The main procedure where we decide which read* and consume* methods to call, is do_process_state. We save the current state of the procedure in the _state variable, to remember where to continue in the next call. For each call, the do_process_state method returns an information about whether we can keep filling the buffer using more buffers from the stream (proceed::yes), or not (proceed::no). The saved state can be (mostly) removed by using a generator coroutine, whose state is saved when its execution is halted, and which yields the values, that do_process_state would return before. The processing_result_generator is a class for managing a generator coroutine. When the coroutine halts, the proceed_generator saves the value yielded by the coroutine, and returns it to the caller.	2021-07-14 20:50:27 +02:00
Piotr Sarna	3d816b7c16	Merge 'Move the reader concurrency semaphore in front of the cache' from Botond This patchset combines two important changes to the way reader permits are created and admitted: 1) It switches admission to be up-front. 2) It changes the admission algorithm. (1) Currently permits are created before the read is started, but they only wait for admission when going to the disk. This leaves the resources consumption of cache and memtables reads unbounded, possibly leading to OOM (rare but happens). This series changes this that permits are admitted at the moment they are creating making admission up-front -- at least those reads that pass admission at all (some don't). (2) Admission currently is based on availability of resources. We have a certain amount of memory available, which derived from the memory available to the shard, as well a hardcoded count resource. Reads are admitted when a count and a certain amount (base cost) of memory is available. This patchset adds a new aspect to this admission process beyond the existing resource availability: the number of used/blocked reads. Namely it only admits new reads if in addition to the necessary amount of resources being available, all currently used readers are blocked. In other words we only admit new reads if all currently admitted reads requires something other than CPU to progress. They are either waiting on I/O, a remote shard, or attention from their consumers (not used currently). The reason for making these two changes at the same time is that up-front admission means cache reads now need to obtain a permit too. For cache reads the optimal concurrency is 1. Anything above that just increases latency (without increasing throughput). So we want to make sure that if a cache reader hits it doesn't get any competition for CPU and it can run to completion. We admit new reads only if the read misses and has to go to disk. A side effect of these changes is that the execution stages from the replica-side read path are replaced with the reader concurrency semaphore as an execution stage. This is necessary due to bad interaction between said execution stages and up-front admission. This has an important consequence: read timeouts are more strictly enforced because the execution stage doesn't have a timeout so it can execute already timed-out reads too. This is not the case with the semaphore's queue which will drop timed-out reads. Another consequence is that, now data and mutation reads share the same execution stage, which increases its effectiveness, on the other hand system and user reads don't anymore. Fixes: #4758 Fixes: #5718 Tests: unit(dev, release, debug) * 'reader-concurrency-semaphore-in-front-of-the-cache/v5.3' of https://github.com/denesb/scylla: (54 commits) test/boost/reader_concurrency_semaphore_test: add used/blocked test test/boost/reader_concurrency_semaphore_test: add admission test reader_permit: add operator<< for reader_resources reader_concurrency_semaphore: add reads_{admitted,enqueued} stats table: make_sstable_reader(): fix indentation table: clean up make_sstable_reader() database: remove now unused query execution stages mutation_reader: remove now unused restricting_reader sstables: sstable_set: remove now unused make_restricted_range_sstable_reader() reader_permit: remove now unused wait_admission() reader_concurrency_semaphore: remove now unused obtain_permit_nowait() reader_concurrency_semaphore: admission: flip the switch database: increase semaphore max queue size test: index_with_paging_test: increase semaphore's queue size reader_concurrency_semaphore: add set_max_queue_size() test: mutation_reader_test: remove restricted reader tests reader_concurrency_semaphore: remove now unused make_permit() test: reader_concurrency_semaphore_test: move away from make_permit() test: move away from make_permit() treewide: use make_tracking_only_permit() ...	2021-07-14 16:22:56 +02:00
Botond Dénes	e2dfb2df71	test/boost/reader_concurrency_semaphore_test: add used/blocked test Make sure that releasing a bunch of used/blocked guards in random order doesn't break the permit state.	2021-07-14 17:19:02 +03:00
Botond Dénes	0337d3ea4a	test/boost/reader_concurrency_semaphore_test: add admission test Checking every conceivable admission scenario (hopefully).	2021-07-14 17:19:02 +03:00
Botond Dénes	b81f39cec9	reader_permit: add operator<< for reader_resources And use it in tests, it results in actually useful error messages.	2021-07-14 17:19:02 +03:00
Botond Dénes	1666ad078a	reader_concurrency_semaphore: add reads_{admitted,enqueued} stats Primarily for tests, but we could also export these, should we want to.	2021-07-14 17:19:02 +03:00
Botond Dénes	46c9106bdf	table: make_sstable_reader(): fix indentation	2021-07-14 17:19:02 +03:00
Botond Dénes	7ddde9107e	table: clean up make_sstable_reader() Remove all the now unneeded mutation sources.	2021-07-14 17:19:02 +03:00
Botond Dénes	ae4df99e6b	database: remove now unused query execution stages	2021-07-14 17:19:02 +03:00
Botond Dénes	16d3cb4777	mutation_reader: remove now unused restricting_reader Move the now orphaned new_reader_base_cost constant to database.hh/table.cc, as its main user is now `table::estimate_read_memory_cost()`.	2021-07-14 17:19:02 +03:00
Botond Dénes	2bab76c80e	sstables: sstable_set: remove now unused make_restricted_range_sstable_reader()	2021-07-14 17:19:02 +03:00
Botond Dénes	5b8d6f02eb	reader_permit: remove now unused wait_admission()	2021-07-14 17:19:02 +03:00
Botond Dénes	c86573813f	reader_concurrency_semaphore: remove now unused obtain_permit_nowait()	2021-07-14 17:19:02 +03:00
Botond Dénes	1b7eea0f52	reader_concurrency_semaphore: admission: flip the switch This patch flips two "switches": 1) It switches admission to be up-front. 2) It changes the admission algorithm. (1) by now all permits are obtained up-front, so this patch just yanks out the restricted reader from all reader stacks and simultaneously switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By doing this admission is now waited on when creating the permit. (2) we switch to an admission algorithm that adds a new aspect to the existing resource availability: the number of used/blocked reads. Namely it only admits new reads if in addition to the necessary amount of resources being available, all currently used readers are blocked. In other words we only admit new reads if all currently admitted reads requires something other than CPU to progress. They are either waiting on I/O, a remote shard, or attention from their consumers (not used currently). We flip these two switches at the same time because up-front admission means cache reads now need to obtain a permit too. For cache reads the optimal concurrency is 1. Anything above that just increases latency (without increasing throughput). So we want to make sure that if a cache reader hits it doesn't get any competition for CPU and it can run to completion. We admit new reads only if the read misses and has to go to disk. Another change made to accommodate this switch is the replacement of the replica side read execution stages which the reader concurrency semaphore as an execution stage. This replacement is needed because with the introduction of up-front admission, reads are not independent of each other any-more. One read executed can influence whether later reads executed will be admitted or not, and execution stages require independent operations to work well. By moving the execution stage into the semaphore, we have an execution stage which is in control of both admission and running the operations in batches, avoiding the bad interaction between the two.	2021-07-14 17:19:02 +03:00
Botond Dénes	01a4bb33de	database: increase semaphore max queue size Queued reads don't take 10KB (not even 1KB) for years now. But the real motivation of this patch is that due to a soon-to-come change to admission we expect larger queues especially in tests, so be more forgiving with queue sizes.	2021-07-14 17:19:02 +03:00
Botond Dénes	dcf49dcb67	test: index_with_paging_test: increase semaphore's queue size To allow the flood of reads generated by this test to be queued up during up-front admission without failing the test.	2021-07-14 17:19:02 +03:00
Botond Dénes	79fefc490c	reader_concurrency_semaphore: add set_max_queue_size()	2021-07-14 17:19:02 +03:00
Botond Dénes	388da36bbb	test: mutation_reader_test: remove restricted reader tests Soon we will switch to up-front admission which will break these tests. No point in trying to fix them as once the switch is done we'll retire the restricted reader too. Remove these tests now so they are not in the way of progress.	2021-07-14 17:19:02 +03:00
Botond Dénes	00511100a4	reader_concurrency_semaphore: remove now unused make_permit()	2021-07-14 17:19:02 +03:00
Botond Dénes	bacfaf9582	test: reader_concurrency_semaphore_test: move away from make_permit() Migrate to the appropriate up-front admission variants.	2021-07-14 17:19:02 +03:00
Botond Dénes	c07db00b70	test: move away from make_permit() Use the most appropriate up-front admission variant.	2021-07-14 17:19:02 +03:00
Botond Dénes	7bfa40a2f1	treewide: use make_tracking_only_permit() For all those reads that don't (won't or can't) pass through admission currently.	2021-07-14 17:19:02 +03:00
Nadav Har'El	8bdff97d8d	Merge 'Fix propagating view update generation failures' from Piotr Sarna When the generate-and-propagate-view-updates routine was rewritten to allow partial results, one important validation got lost: previously, an error which occured during update generation was propagated to the user - as an example, the indexed column value must be smaller than 64kB, otherwise it cannot act as primary key part in the underlying view. Errors on view update propagation are however ignored in this layer, because it becomes a background process. During the rewrite these two got mixed up and so it was possible to ignore an error that should have been propagated. This behavior is now fixed. Fixes #9013 Closes #9021 * github.com:scylladb/scylla: cql-pytest: add a case for too large value in SI table: stop ignoring view generation errors on write path	2021-07-14 15:49:48 +02:00
Piotr Sarna	91b4e24db5	Merge 'Tests for Alternator's TTL feature' from Nadav Har'El This series includes a comprehensive test suite for the DynamoDB API's TTL (item expiration) feature described in issue #5060. Because we have not yet implemented the TTL feature in Alternator, all of the tests still xfail, but they all pass on DynamoDB and demonstrate exactly how the TTL feature works and how it interacts with other features such as LSI, GSI and Streams. The patch which introduces these tests is heavily commented to explain exactly what it tests, and why. Because DynamoDB only expires items some 10-30 minutes after their expiration time (the documentation even suggests it can be delayed by 24 hours!), some of these tests are extremely long (up to 30 minutes!), so we also introduce in this series a new marker for "verylong" tests. verylong tests are skipped by default, unless the "--runverylong" option is given. In the future, when we implement the TTL feature in Alternator and start testing it, we may be able to configure it with a much shorter expiration timeout and then we might be able to run these tests in a reasonable time and make them run by default. Closes #8564 * github.com:scylladb/scylla: test/alternator: add tests for the Alternator TTL feature test/alternator: add marker for "veryslow" tests test/alternator: add new_test_table() utility function	2021-07-14 15:49:48 +02:00
Botond Dénes	0ced9c83b7	mutation_reader: evictable_reader: futurize resume_or_recreate_reader() In preparation for waiting for readmission after eviction in a later patch.	2021-07-14 16:48:43 +03:00
Botond Dénes	f37e26c73d	querier: remove now unused cache_context	2021-07-14 16:48:43 +03:00
Botond Dénes	7f2813e3fa	database: mutation_query(): handle querier lookup/save on the database level Instead of passing down the querier_cache_ctx to table::mutation_query(), handle the querier lookup/save on the level where the cache exists. The real motivation behind this change however is that we need to move the lookup outside the execution stage, because the current execution stage will soon be replaced by the one provided by the semaphore and to use that properly we need to know if we have a saved permit or not.	2021-07-14 16:48:43 +03:00
Botond Dénes	f9d302bf49	database: mutation_query(): convert into coroutine To facilitate further patching (and reading).	2021-07-14 16:48:43 +03:00
Botond Dénes	d2f5393a43	database: query(): handle querier lookup/save on the database level Instead of passing down the querier_cache_ctx to table::query(), handle the querier lookup/save on the level where the cache exists. The real motivation behind this change however is that we need to move the lookup outside the execution stage, because the current execution stage will soon be replaced by the one provided by the semaphore and to use that properly we need to know if we have a saved permit or not.	2021-07-14 16:48:43 +03:00
Botond Dénes	c28a6e8537	database: query(): convert into coroutine To facilitate further patching (and reading).	2021-07-14 16:48:43 +03:00
Botond Dénes	6efb278ea3	querier_cache: insert(): close refused queriers The querier cache refuses to cache queriers that read in reverse. These queriers are also not closed, with the caller having no way to determine whether the querier it just moved into `insert()` needs a close afterwards or not, requiring a `close()` on the moved-from querier just to be sure. Avoid this by consistently closing all passed-in queriers, including those the cache refuses to save. For this, the internal `insert_querier()` methods has to be made a member to be able to use the closing gate.	2021-07-14 16:48:43 +03:00
Botond Dénes	5291494a50	mutation_reader: shard reader: use reader_lifecycle_policy::obtain_reader_permit() Co-routinize the reader creation lambda in the process.	2021-07-14 16:48:43 +03:00
Botond Dénes	426b46c4ed	mutation_reader: reader_lifecycle_policy: add obtain_reader_permit() This method is both a convenience method to obtain the permit, as well as an abstraction to allow different implementations to get creative. For example, the main implementation, the one in multishard mutation query returns the permit of the saved reader one was successful. This ensures that on a multi-paged read the same permit is used across as much pages as possible. Much more importantly it ensures the evictable reader wrapping the actual reader both use the same permit.	2021-07-14 16:48:43 +03:00
Botond Dénes	7fcf4a63c5	multishard_mutation_query: use the passed-in permit to create new reader Ensure that when the reader has to be created anew the passed-in permit is used to create it, instead of the one left over in remote-parts, which is that of the already evicted reader. This lays the groundwork to ensure the same permit is used across all pages of a read, by a future patch which creates the wrapping reader with the existing permit.	2021-07-14 16:48:43 +03:00
Botond Dénes	97a03f9027	database: make_multishard_streaming_reader: use external permit As a preparation for up-front admission, add a permit parameter to `make_multishard_streaming_reader()`, which will be the admitted permit once we switch to up-front admission. For now it has to be a non-admitted permit. A nice side-effect of this patch is that now permits will have a use-case specific description, instead of the generic "multishard-streaming-reader" one	2021-07-14 16:48:43 +03:00
Botond Dénes	5293bd21cf	streaming/stream_session: use database::obtain_reader_permit()	2021-07-14 16:48:43 +03:00
Botond Dénes	292a8819ec	repair/row_level: use database::obtain_reader_permit()	2021-07-14 16:48:43 +03:00
Botond Dénes	f28b5018f2	view/view_update_generator: use obtain_reader_permit()	2021-07-14 16:48:43 +03:00
Botond Dénes	999169e535	database: make_streaming_reader(): require permit As a preparation for up-front admission, add a permit parameter to `make_streaming_reader()`, which will be the admitted permit once we switch to up-front admission. For now it has to be a non-admitted permit. A nice side-effect of this patch is that now permits will have a use-case specific description, instead of the generic "streaming" one.	2021-07-14 16:48:43 +03:00
Botond Dénes	3ec149222d	database: add obtain_reader_permit() A convenience method for obtaining an admitted permit for a read on a given table. For now it uses the nowait semaphore obtaining method, as all normal reads still use the old admission method. Migrating reads to this method will make the switch easier, as there will be one central place to replace the nowait method with the proper one.	2021-07-14 16:48:43 +03:00
Botond Dénes	a6b59f0d89	table: add estimate_read_memory_cost() To be used for determining the base cost of reads used in admission. For now it just returns the already used constant. This is a forward looking change, to when this will be a real estimation, not just a hardcoded number.	2021-07-14 16:48:43 +03:00
Botond Dénes	af8f39a775	reader_concurrency_semaphore: make it an execution stage The execution stage functionality is exposed via two new member functions, `with_permit()` and `with_ready_permit()`. Both accept a function to be run. The former obtains a permit then runs the passed in function through the execution stage. The latter allows an already obtained permit to be passed in.	2021-07-14 16:48:43 +03:00
Botond Dénes	5d3ddba2c7	reader_concurrency_semaphore: make_permit(): add up-front admission variants Three new methods are added for creating permits: 1) obtain_permit() 2) obtain_permit_nowait() 3) make_tracking_only_permit() (1) is meant to replace `make_permit()` + `wait_admission()`, by integrating the waiting for admission into the process of creating the permit. This is the method meant to be used to create permits from here on, ensuring that each read passes admission before even being started. (2) is a bridge between the old and new world. Up-front admission cannot coexist with the restricted reader in the same read, so those reads that have a restricted reader in their stack can use this method to create a non-admitted permit to be admitted by the restricted reader later. Once we have migrated all reads to (1) or (2), we can get rid of the restricted reader and just replace (1) with (2) in the codebase. (2) returns a future to make this a simple rename, the churn of dealing with a future<reader_permit> return type already having been dealt with by then. (3) is for reads that bypass admission, yet their resource usage does participate in the admission of other reads. This is the equivalent of reads that don't pass admission at all. The following patches will gradually transition the codebase away from the old permit API, and once the transition is complete, we can switch over to do the admission up-front at once.	2021-07-14 16:48:43 +03:00
Botond Dénes	844a99a91a	reader_concurrency_semaphore: prepare for up-front admission We want to make permits be admitted up-front, before even being created. As part of this change, we will get rid of the `wait_admission()` method on the permit, instead, the permit will be created as a result of waiting for admission (just like back some time ago). To allow evicted readers to wait for re-admission, a new method `maybe_wait_readmission()` is created, which waits for readmission if the permit is in evicted state. Also refactor the internals of the semaphore to support and favor up-front admission code. As up-front admission is the future we want the permit code to be organized in such a way that it is natural to use with it. This means that the "old-style" admission code might suffer but we tolerate this as it is on its way out. To this end the following changes were done: * Add a _base_resources field to reader_permit which tracks the base cost of said permit. This is passed in the constructor and is used in the first and subsequent admissions. * The base cost is now managed internally by the permit, instead of relying on an external `resource_units` instance, though the old way is still supported temporarily. * Change the admission pipeline to favor the new permit-internally managed base cost variant. * Compatibility with old-style admission: permits are created with 0 base resources, base resources are set with the compatibility method `set_base_resources()` right before admission, then externalized again after admission with `base_resource_as_resource_units()`. These methods will be gone when the old style admission is retired (together with `wait_admission()`).	2021-07-14 16:48:43 +03:00
Botond Dénes	05e6881c73	reader_permit: allow constructing reader_permit from impl& By enabling shared from this for impl and adding a reader permit constructor which takes a shared pointer to an impl. This allows impl members to invoke functions requiring a `reader_permit` instance as a parameter.	2021-07-14 16:48:43 +03:00
Botond Dénes	ea2345c944	db/size_estimates_virtual_reader: mark as blocked when obtaining local ranges	2021-07-14 16:48:43 +03:00
Botond Dénes	b5cbd19383	mutation_reader: shard_reader: mark permit as blocked when waiting on remote shard	2021-07-14 16:48:43 +03:00
Botond Dénes	6f6a8f5cf8	mutation_reader: shard_reader: coroutinize fill_buffer() and fast_forward_to() To facilitate further patching (and make the code look nicer too).	2021-07-14 16:48:43 +03:00
Botond Dénes	26e83bdde8	mutation_reader: foreign_reader: mark permit as blocked when waiting on remote shard	2021-07-14 16:48:43 +03:00
Botond Dénes	434f2efde5	sstables: continuous_data_consumer: mark permit as blocked when doing IO	2021-07-14 16:48:43 +03:00
Botond Dénes	aa480fa3f9	reader_permit: allow marking blocked Distinguish between permits that are blocked and those that are not. Conceptually a blocked permit is one that needs to wait on either I/O or a remote shard to proceed. This information will be used by admission, which will only admit new reads when all currently used ones are blocked. More on that in the commit introducing this new admission type. This patch only adds the infrastructure, block sites are not marked yet.	2021-07-14 16:48:43 +03:00
Botond Dénes	9cb36cc516	test: continuous_data_consumer_test: mark permit as used	2021-07-14 16:48:43 +03:00
Botond Dénes	47342ae8a8	mutation_reader: shard_reader: mark underlying permit as used	2021-07-14 16:48:43 +03:00
Botond Dénes	a5dc48b4b1	reader_permit: allow marking it as used Distinguish between permits that are used and those that are not. These are two subtypes of the current 'active' state (and replace it). Conceptually a permit is used when any readers associated with it have a pending call to any of their async methods, i.e. the consumer is actively consuming from them. This information will be used for admission, together with a new blocked state introduced by a future patch. This patch only adds the infrastructure, use sites are not marked yet.	2021-07-14 16:48:43 +03:00
Botond Dénes	5a20861a1d	reader_permit: add reader_permit_opt	2021-07-14 16:48:43 +03:00
Botond Dénes	a251cc2368	reader_permit: introduce evicted state We want to introduce more fine-grained states for permits than what we have currently, splitting the current 'active' state into multiple sub-states. As a preparatory step, introduce an evicted state too, to keep track of permits that were evicted while being inactive. This will be important in determining what permits need to re-wait admission, once we keep permits across pages. Having an evicted state also aids validating internal state transitions.	2021-07-14 16:48:43 +03:00
Botond Dénes	5416fc6d1b	reader_concurrency_semaphore: add current_permits to permit_stats	2021-07-14 16:48:43 +03:00
Botond Dénes	c97fc16105	reader_concurrency_semaphore: extract waiter admission into separate function Because soon we will have more than one place to trigger waiter admission from.	2021-07-14 16:48:43 +03:00
Nadav Har'El	2acfee8118	test/alternator: add tests for the Alternator TTL feature This patch adds a comprehensive test suite for the DynamoDB API's TTL (item expiration) feature. The tests check the two new API commands added by this feature (UpdateTimeToLive and DescribeTimeToLive), and also how items are expired in practice, and how item expiration interacts with other features such as GSI, LSI and DynamoDB Streams. Because DynamoDB has extremely long delays until items are expired, or until expiration configuration may be changed, several of these tests take up to 30 minutes to complete. We mark these tests with the "verylong" marker, so they are skipped in ordinary test runs - use the "--runverylong" option to run them. All these tests currently pass on DynamoDB, but xfail on Alternator because the two commands UpdateTimeToLive and DescribeTimeToLive are currently rejected by Alternator. Refs #5060 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-07-14 14:08:55 +03:00
Avi Kivity	6df3139455	install-dependencies.sh: add gdb gdb is used for testing scylla-gdb.py (since `3c2e852dd`), so it needs to be listed as a dependency. Add it there. It was listed as a courtesy dependency in the frozen toolchain (which is why it still worked), so it's removed from there. Closes #9034	2021-07-14 10:15:54 +03:00
Eliran Sinvani	ccdef39d21	Service Level Controller: Stop configuration polling loop upon leaving the cluster This change subscribes service_level_controller for nodes life cycle notifications and uses the notification of leaving the cluster for the current node to stop the configuration polling loop. If the loop continues to run it's queries will fail consistently since the nodes will not answers to queries. It is worth mentioning that the queries failing in the current state of code is harmles but noisy since after 90 seconsd, if the scylla process is not shut down the failures will start to generate failure logs every 90 seconds which is confusing for users. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-07-14 09:31:40 +03:00
Eliran Sinvani	55e2aabbae	main: Stop using get_local_storage_service in main This change removes the use of service::get_local_storage_service, instead, the approach taken is similar to other modules, for example storage_proxy where a reference to the `sharded` module is obtained once and then the local() method in combination with capturing is used.	2021-07-14 09:31:36 +03:00
Avi Kivity	64ad31c26f	build: enable -Wc++1z-extensions It was disabled for the move to clang, but now apparently no longer needed. So re-enable that warning. Closes #9026	2021-07-14 08:28:26 +03:00
Nadav Har'El	9d07ce3cb6	test/alternator: add marker for "veryslow" tests Until now, Alternator test have all been very fast, taking milliseconds or at worst seconds each - or a bit longer on DynamoDB. However, sometimes we need to write tests which take a huge amount of time - for example, tests for the TTL feature may take 10 minutes because the item expiration is delayed by that much. Because a 10 minute test is ridiculous (all 500 Alternator tests together take just one minute today!), we would normally run such test once, and then mark it "skip" so will never run again. One annoying thing about skipped tests is that there is no way to temporarily "unskip" them when we want to run such a test anyway. So in this patch, we introduce a better option for these very slow tests instead of the simple "skip": The patch introduces a marker "@pytest.mark.veryslow". By default, a test with this marker is skipped. However, an command-line option "--runveryslow" is introduced which causes tests with the veryslow mark to be run anyway, and not skipped. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-07-14 00:26:21 +03:00
Nadav Har'El	2fb379bb94	test/alternator: add new_test_table() utility function This patch adds a convenient function new_test_table() that Alternator tests can use to safely create a temporary table, and be sure it is deleted in any case. This function is used in a "with", as follows: with new_test_table(dynamodb, ...) as table: do_something(table) # at this point table has already been deleted. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-07-14 00:26:21 +03:00
Piotr Sarna	11a054c1fc	cql-pytest: add a case for too large value in SI The test case tries to insert a too-large value into an indexed column and expects to see a write failure. Refs #9013	2021-07-13 17:26:18 +02:00
Piotr Sarna	73f7702a69	table: stop ignoring view generation errors on write path When the generate-and-propagate-view-updates routine was rewritten to allow partial results, one important validation got lost: previously, an error which occured during update generation was propagated to the user - as an example, the indexed column value must be smaller than 64kB, otherwise it cannot act as primary key part in the underlying view. Errors on view update propagation are however ignored in this layer, because it becomes a background process. During the rewrite these two got mixed up and so it was possible to ignore an error that should have been propagated. This behavior is now fixed. Fixes #9013	2021-07-13 17:20:38 +02:00
Benny Halevy	c8e7bd9a26	storage_proxy: abstract_read_resolver: catch semaphore_timed_out before timed_out_error Prepare for making semaphore_timed_out derived from timed_out_error in seastar. When this happens in seastar, we would need to catch the derived, more-specific exception first to avoid the following warning: ``` service/storage_proxy.cc:2818:18: error: exception of type 'seastar::semaphore_timed_out &' will be caught by earlier handler [-Werror,-Wexceptions] } catch (semaphore_timed_out&) { ^ service/storage_proxy.cc:2815:18: note: for type 'seastar::timed_out_error &' } catch (timed_out_error&) { ^ ``` Later on, after the seastar change is applied to the scylla repo, we can eliminate the duplication and catch only timed_out_error. Test: unit(dev) (w/ the seastar changes to semaphore_timed_out and rpc::timeout_error to inherit from timed_out_error). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210713132858.294504-1-bhalevy@scylladb.com>	2021-07-13 16:39:17 +03:00
Nadav Har'El	c174eeae06	alternator: do not allow LSI on base table with no sort key The purpose of an LSI (local secondary index) in Alternator is to allow a different sort key for the existing partitions, keeping the same division into partititions. So it doesn't make sense to create an LSI on a table that did not originally have a sort key (i.e., single-item partitions). DynamoDB indeed doesn't allow this case, and Alternator forgot to forbid it - so this patch adds the missing check to the CreateTable operation. This patch also adds a test case for this, test_lsi_wrong_no_sort_key, which failed before the patch and passes after it (and also passes on DynamoDB). Also, the existing test_lsi_wrong tests for bad LSI creation attempts by mistake used a base table without a sort key - so while they encountered an error as expected, it was not the right error! So we fix that test (and split it into two tests), adding the missing sort key and exposing the actual errors that the tests were meant to expose. That test passed before this patch and also afterwards - but at least after the patch it is actually testing what it was meant to be testing. Fixes #9018. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210713123747.1012954-1-nyh@scylladb.com>	2021-07-13 15:12:01 +02:00
Nadav Har'El	1ff1c3735b	Merge 'Remove the mutation-based restriction checks' from Piotr Sarna This series unifies the interface for checking if CQL restrictions are satisfied. Previously, an additional mutation-based approach was added in the materialized views layer, but the decision was reached that it's better to have a single API based on partition slices. With that, the regular selection path gets simplified at the cost of more complicated view generation path, which is a good tradeoff. Note that in order to unify the interface, the view layer performs ugly transformations in order to adjust the input for `is_satisfied_by`. Reviewers, please take a close look at this code (`matches_view_filter`, `clustering_prefix_matches`, `partition_key_matches`), because it looks error-prone and relies on dirty internals of our serialization layer. If somebody has a better suggestion on how to do the transformation, I'm all ears. Tests: unit(release), manual(playing with materialized views with custom filters) Fixes #7215 Closes #8979 * github.com:scylladb/scylla: db,view,table: drop unneeded time point parameter cql3,expr: unify get_value cql3,expr: purge mutation-based is_satisfied_by db,view: migrate key checks from the deprecated is_satisfied_by db,view: migrate checking view filter to new is_satisfied_by db,view: add a helper result builder class db,view: move make_partition_slice helper function up	2021-07-13 12:42:13 +03:00
Kamil Braun	b5a7220da4	test: raft: randomized_nemesis_test: `reconfigure` function Instead of calling `set_configuration` directly on a `raft::server`, the caller will use the higher-level `reconfigure`. Similarly to `call`, the function converts exceptions into return values (inside a `variant`) and allows passing in a timeout parameter.	2021-07-13 11:15:26 +02:00
Kamil Braun	eb4a8d48aa	test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function	2021-07-13 11:15:26 +02:00
Kamil Braun	69c59ec801	test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots When storing a snapshot `snap`, if `snap.idx > e.idx` where `e` is the last entry in the log (if any), we need to clear all previous entries so that we don't create a gap in the log. The log must remain contiguous. One case is controversial: what to do if `snap.idx == e.idx + 1`. Technically no gap would be created between the entry and the snapshot. However, if we now want to store a new entry with index `e.idx + 2`, that would create a gap between two entries which is illegal.	2021-07-13 11:15:26 +02:00
Kamil Braun	f381a97f6f	test: raft: randomized_nemesis_test: persistence: handle complex state types The usage of `template <..., State init_state>` in `persistence` permitted using only a very restricted class of types (so called "structural types"). Pass the initial state through `persistence`'s constructor instead. Also modify the member functions so the State type doesn't need to have a default constructor.	2021-07-13 11:15:25 +02:00
Kamil Braun	59e04b2b2e	test: raft: randomized_nemesis_test: `call`: handle `raft::dropped_entry` This exception happens when the leader stops being a leader in the middle of a call. Expect it to happen and return it in the result variant.	2021-07-13 11:15:25 +02:00
Kamil Braun	d97cf1a254	test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels Inside `call`, if `add_entry` failed or the operation timed out, the output channel promise would be dropped without setting a value, causing a `broken_promise` exception. Furthermore the output future would be dropped, so we get a discarded `broken_promise` future. The fix: 1. When we drop a channel without a result (inside `impure_state_machine::with_output_channel`), set an explicit exception with a dedicated type. 2. Discard the channel future in a controlled way, explicitly handling the `output_channel_dropped` exception.	2021-07-13 11:15:25 +02:00
Kamil Braun	f51ff786bd	test: raft: randomized_nemesis_test: environment: expose the network Let the user of `environment` access the `network` directly for e. g. introducing network partitions.	2021-07-13 11:15:25 +02:00
Kamil Braun	26d2f99cad	test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold The following are now passed to `environement` as parameters: - network delay, - failure detector convict threshold. Environment passes them further down when constructing the underlying objects.	2021-07-13 11:15:25 +02:00
Kamil Braun	035ae2eb1b	test: raft: randomized_nemesis_test: generalize `with_env_and_ticker` Generalize the type of the callback: use a template parameter instead of `noncopyable_function` and don't assume the return type of the callback. This allows returning a result from `with_env_and_ticker`, e.g. for performing analysis or logging the results after a part of the test that used the environment and ticker have finished.	2021-07-13 11:15:25 +02:00
Kamil Braun	25fb195bc7	test: raft: randomized_nemesis_test: network: `add_grudge`, `remove_grudge` functions Extend the interface of `network` to allow introducing and removing "grudges" which prevent the delivery of messages from one given server to another (when the time comes to deliver a message but there's a grudge, the message is dropped).	2021-07-13 11:15:25 +02:00
Kamil Braun	774ef653b1	test: raft: randomized_nemesis_test: move `ticker` to its own header	2021-07-13 11:15:25 +02:00
Kamil Braun	a45e8e0db0	test: raft: randomized_nemesis_test: ticker: take `logger` as a constructor parameter Remove the global dependency on `tlogger`.	2021-07-13 11:15:25 +02:00
Kamil Braun	21b5a6d9f7	test: raft: logical_timer: handle immediate timeout If the user calls `with_timeout` with a time point that's already been reached, we return `timed_out_error` immediately.	2021-07-13 11:15:25 +02:00
Kamil Braun	ed8e9a564a	test: raft: logical_timer: on timeout, return the original future in the exception More specifically, return a future which is equivalent to the original future (when the original future resolves, this future will contain its result). Thus we don't discard the future, the user gets it back. Let them decide what to do with it.	2021-07-13 11:15:25 +02:00
Kamil Braun	c86ff1eb7c	test: raft: logical_timer: add `schedule` member function It allows scheduling the given function to be called at the given logical time point.	2021-07-13 11:15:25 +02:00
Kamil Braun	cf0d503a92	test: raft: randomized_nemesis_test: move `logical_timer` to its own header	2021-07-13 11:15:25 +02:00
Kamil Braun	9f5eeec56a	test: raft: include the leader's ID in the `not_a_leader` exception's message	2021-07-13 11:15:25 +02:00
Piotr Sarna	a1813c9b34	db,view,table: drop unneeded time point parameter Now that restriction checking is translated to the partition-slice-style interface, checking the partition/clustering key restrictions for views can be performed without the time point parameter. The parameter is dropped from all relevant call sites.	2021-07-13 10:40:08 +02:00
Piotr Sarna	1e0880e345	cql3,expr: unify get_value Now that there's only one helper function for getting values, the call can be inlined instead.	2021-07-13 10:40:08 +02:00
Piotr Sarna	95002bb8d4	cql3,expr: purge mutation-based is_satisfied_by The interface is now unified, and all callers use the original CQL3-backed API.	2021-07-13 10:40:08 +02:00
Piotr Sarna	37fc3f4b5b	db,view: migrate key checks from the deprecated is_satisfied_by Last two users of the mutation-based is_satisfied_by function were in the partition/clustering key checks. These functions are now translated to use the original API.	2021-07-13 10:40:07 +02:00
Piotr Sarna	d6b0a8338a	db,view: migrate checking view filter to new is_satisfied_by In order to unify the interfaces, the is_satisfied_by flavor for mutations is getting deprecated. In order to be able to remove it, one of its biggest users, the matches_view_filter() function, is translated to the other variant.	2021-07-13 10:04:03 +02:00
Piotr Sarna	786db7e9a8	db,view: add a helper result builder class In order to migrate from mutation-based restriction checks, code in view.cc needs to have a way of translating results to partition-slice-based representation. A slightly simplified builder from multishard_mutation_query.cc is injected into the view code.	2021-07-13 10:04:03 +02:00
Piotr Sarna	32d87837b1	db,view: move make_partition_slice helper function up No functional changes, it will be needed for a future patch.	2021-07-13 10:04:02 +02:00
Botond Dénes	2bbfb76cc5	compaction/leveled_compaction_strategy.cc: remove unused <ranges> include Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210713063506.419658-1-bdenes@scylladb.com>	2021-07-13 10:34:22 +03:00
Benny Halevy	1db0612a06	cql3: query_processor: delete service_level_controller param The query_processor internal_state doesn't use the service_level_controller as it only needs service::client_state::for_internal_calls() Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210713055703.131099-1-bhalevy@scylladb.com>	2021-07-13 10:34:05 +03:00
Benny Halevy	90e5181192	main: defer-stop sl_controller after start and drain before storage_proxy drain_on_shutdown As per `32bcbe59ad`, the sl_controller is stopped after set_distributed_data_accessor is called. However if scylla shuts down before that happens, the sl_controller still needs to be stopped. We need to drain the service level controller before storage_proxy::drain_on_shutdown is called to prevent queries by the update loop from starting after the storage_proxy has been drained - leading to issues similar to #9009. Fixes #9014 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210713055606.130901-1-bhalevy@scylladb.com>	2021-07-13 10:33:29 +03:00
Avi Kivity	f0e2f31839	Merge "Implement validation compaction" from Botond " Currently, when sstables are suspected to be corrupt, one has a few bad choices on how to verify that they are indeed correct: * Obtain suspect sstable files and manually inspect them. This is problematic because it requires a scylla engineer to have direct access to data, which is not always simple or even possible due to privacy protection rules. * Run sstable scrub in abort mode. This is enough to confirm whether there is any corruption or not, but only in a binary manner. It is not possible to explore the full scope of the corruption, as the scrub will abort on the first corruption. * Run sstable scrub in non-abort mode. Although this allows for exploring the full scope of the corruption and it even gets rid of it, it is a very intrusive and potentially destructive process that some users might not be willing to even risk. This patchset offers an alternative: validation compaction. This is a completely non-intrusive compaction that reads all sstables in turn and validates their contents, logging any discrepancies it can find. It does not mutate their content, it doesn't even re-writes them. It is akin to a dry-run mode for sstable scrub. The reason it was not implemented as such is that the current compaction infrastructure assumes that input sstables are replaced by output sstables as part of the compaction process. Lifting this assumption seemed error-prone and risky, so instead I snatched the unused "Validation" compaction type for this purpose. This compaction type completely bypasses the regular compaction infrastructure but only at the low-level. It still integrates fully into compaction-manager. Fixes: #7736 Refs: https://github.com/scylladb/scylla-tools-java/issues/263 Tests: unit(dev) " * 'validation-compaction/v5' of https://github.com/denesb/scylla: test/boost/sstable_datafile_test: add test for validation compaction test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function api: storage_service: expose validation compaction sstables/compaction_manager: add perform_sstable_validation() sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME sstables/compaction_manager: add maintenance scheduling group sstables/compaction_manager: drop _scheduling_group field sstables/compaction_manager: run_custom_job(): replace parameter name with compaction type sstables/compaction_manager: run_custom_job(): keep job function alive sstables/compaction_descriptor: compaction_options: add validation compaction type sstables/compaction: compaction_options::type(): add static assert for size of index_to_type sstables/compaction: implement validation compaction type sstables/compaction: extract compaction info creation into static method sstables/compaction: extract sstable list formatting to a class sstables/compaction: scrub_compaction: extract reporting code into static methods position_in_paritition{_view}: add has_key() mutation_fragment_stream_validator: add schema() accessor	2021-07-13 10:29:40 +03:00
Tomasz Grabiec	e947fac74c	database: Fix cache metrics not being registered Introduced in `6a6403d`. The default constructor with dummy_app_stats is also used by production code. Fixes #9012 Message-Id: <20210712221447.71902-1-tgrabiec@scylladb.com>	2021-07-13 07:50:44 +03:00
Avi Kivity	058afbcee8	build: re-enable -Wmisleading-indentation This can catch mismatches between visual indication about control flow and what the compiler actually does. Looks like boost cleaned up its indentation since it was disabled in `7f38634080` ("dist/debian: Switch to g++-7/boost-1.63 on Ubuntu 14.04/16.04"). It's unlikely to pop back since modern compilers enable it by default. Closes #9015	2021-07-12 22:29:19 +03:00
Avi Kivity	8fb4fe2f24	Merge "reader_concurrency_semaphore: relax on destroy stop checks" from Botond " Currently we `assert(_stopped)` in the destructor, but this is too harsh, especially on freshly created semaphore instances that weren't even used yet. This basically mandates semaphores to be initialized at the end of the constructor body, which is very cumbersome. Further to that, this series relaxes the checks on destroying an unstopped previously (but not currently) used semaphore. As destroying such a semaphore without stop is risky an error is still logged. Tests: unit(dev) " * 'reader-concurrency-semaphore-relax-stop-check/v1' of https://github.com/denesb/scylla: reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore reader_concurrency_semaphore: allow destroying without stop() when not used yet reader_concurrency_semaphore: add permit-stats	2021-07-12 20:07:01 +02:00
Nadav Har'El	f540a69a82	Update tools/java submodule * tools/java 5013321823...79a441972d (2): > Add Zstd compressor > Settings Schema: fix typo in settings printing Adding the Zstd compressor fixes #8887.	2021-07-12 20:07:00 +02:00
Avi Kivity	4d48e1e9e1	build: avoid sanitize/coverage builds in multi-mode targets The default target (i.e. what gets executed under "ninja") excludes sanitize and coverage modes (since they're useful in special cases only), but the other multi-mode targets such as "ninja build" do not. This means that two extra modes are built. Make things consistent by always using default_modes (which default to release,dev,debug). This can be overriden using the --mode switch to configure.py. Closes #8775	2021-07-12 20:07:00 +02:00
Botond Dénes	f8004c652b	reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore Further relax the conditions under which we abort on destroying a unstopped semaphore. We already allow destroying completely unused semaphores, this patch further relaxes this to allow destroying formerly used but presently not used semaphores without stopping. We still call `on_internal_error_noexcept()` even if destroying the semaphore is safe, because without calling `stop()`, destroying the semaphore depends on luck, which we shouldn't rely on.	2021-07-12 15:53:00 +03:00
Botond Dénes	750b20fd85	reader_concurrency_semaphore: allow destroying without stop() when not used yet To make it easier to construct objects with semaphore members. When the construction of such object fails, they can now just destroy their semaphore member as usual, without having to employ tricks to make sure its is stopped before.	2021-07-12 15:53:00 +03:00
Botond Dénes	03959a332b	reader_concurrency_semaphore: add permit-stats Which stores permit related stats. For now only total number of permits is maintained which is useful to determine whether the semaphore was used already or not.	2021-07-12 15:53:00 +03:00
Nadav Har'El	3fda13e20e	cql-pytest: fix sporadic failure in over-zealous TTL test We have been seeing rare failures of the cql-pytest (translated from Cassandra's unit tests) for testing TTL in secondary indexes: cassandra_tests/validation/entities/secondary_index_test.py::testIndexOnRegularColumnInsertExpiringColumn The problem is that the test writes an item with 1 second TTL, and then sleeps exactly 1.0 seconds, and expects to see the item disappear by that time. Which doesn't always happen: The problem with that assumption stems from Scylla's TTL clock ("gc_clock") being based on Seastar's lowres clock. lowres_clock only has a 10ms "granularity": The time Scylla sees when deciding whether an item expires may be up to 10ms in the past - the arbitrary point when the lowres timer happened to last run. In rare overload cases, the inaccuracy may be even grater than 10ms (if the timer got delayed by other things running). So when Scylla is asked to expire an item in 1 second - we cannot be sure it will be expired in exactly 1 second or less - the expiration can be also around 10ms later. So in this patch we change the test to sleep with more than enough margin - 1.1 seconds (i.e., 100ms more than 1 second). By that time we're sure the item must have expired. Before this patch, I saw the test failing once every few hundred runs, after this patch I ran if 2,000 times without a single failure. Fixes #9008 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210712100655.953969-1-nyh@scylladb.com>	2021-07-12 13:48:21 +03:00
Yaron Kaikov	aa7c40ba50	dist: build docker based on ubuntu 20.04 OS Today our docker image is based on Centos7 ,Since centos will be EOL in 2024 and no longer has stable release stream. let's move our docker image to be based on ubuntu 20.04 Based on the work done in https://github.com/scylladb/scylla/pull/8730, let's build our docker image based on local packages using buildah Closes #8849	2021-07-12 13:32:03 +03:00
Piotr Jastrzebski	c010cefc4d	cdc: Handle compact storage tables correctly When a table with compact storage has no regular column (only primary key columns), an artificial column of type empty is added. Such column type can't be returned via CQL so CDC Log shouldn't contain a column that reflects this artificial column. This patch does two things: 1. Make sure that CDC Log schema does not contain columns that reflect the artificial column from a base table. 2. When composing mutation to CDC Log, ommit the artificial column. Fixes #8410 Test: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8988	2021-07-12 12:17:35 +03:00
Nadav Har'El	2cc8c40c07	Merge 'Fix some issues found by gcc 11' from Avi Kivity This series fixes some issues that gcc 11 complains about. I believe all are correct errors from the standard's view. Clang accepts the changed code. Note that this is not enough to build with gcc 11, but it's a start. Closes #9007 * github.com:scylladb/scylla: utils: compact-radix-tree: detemplate array_of<> utils: compact-radix-tree: don't redefine type as member raft: avoid changing meaning of a symbol inside a class cql3: lists: catch polymorphic exceptions by reference	2021-07-12 11:17:57 +03:00
Avi Kivity	4433178ccb	utils: exceptions: convert sprint() to format() sprint() is type-unsafe, and we are converging on format(). Convert exceptions.hh to format(). Closes #9006	2021-07-12 11:17:57 +03:00
Botond Dénes	5c39a2921e	test/boost/sstable_datafile_test: add test for validation compaction Add a two new unit tests, one which cover the whole stack, and one which stresses the validation part.	2021-07-12 10:25:15 +03:00
Botond Dénes	8296759cce	test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function So the two tests having this almost identical code can shared it, and so that it can be used by new tests.	2021-07-12 10:25:15 +03:00
Botond Dénes	b0ef57c833	api: storage_service: expose validation compaction	2021-07-12 10:25:15 +03:00
Botond Dénes	47283ed151	sstables/compaction_manager: add perform_sstable_validation() Exposing validation compaction on the compaction manager level. To keep things simple, validation compaction uses the custom job infrastructure.	2021-07-12 10:25:15 +03:00
Botond Dénes	4c05e5f966	sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME Run this compaction in the maintenance group which is now available, resolving the FIXME asking for this.	2021-07-12 10:25:15 +03:00
Botond Dénes	c8f8e9232c	sstables/compaction_manager: add maintenance scheduling group rewrite_sstables() wants to be run in the maintenance group and soon we will add another compaction type which also wants to be run in the said group. To enable this propagate the maintenance scheduling group (both CPU and IO) to the compaction manager.	2021-07-12 10:25:15 +03:00
Botond Dénes	12b8b650b7	sstables/compaction_manager: drop _scheduling_group field Use the equivalent _compaction_controller.sg() instead.	2021-07-12 10:25:15 +03:00
Botond Dénes	75bad71f0e	sstables/compaction_manager: run_custom_job(): replace parameter name with compaction type All callers use it to do operations that are closely associated with one of the standard compaction types, so no reason to pass in a custom string instead of the compaction type enum.	2021-07-12 10:25:15 +03:00
Botond Dénes	ddf2700b2e	sstables/compaction_manager: run_custom_job(): keep job function alive For the duration of the job, allowing coroutine lambdas to be used as well.	2021-07-12 10:25:15 +03:00
Botond Dénes	891921377d	sstables/compaction_descriptor: compaction_options: add validation compaction type This enables starting validation compaction via `compact_sstables()`.	2021-07-12 10:25:15 +03:00
Botond Dénes	349a3ed4e8	sstables/compaction: compaction_options::type(): add static assert for size of index_to_type To remind those adding a member to the variant, to also add the corresponding entry here.	2021-07-12 10:25:15 +03:00
Botond Dénes	a57caf5229	sstables/compaction: implement validation compaction type Validation just reads all the passed-in sstables and runs the mutation stream through a mutation fragment stream validator, logging all errors found, and finally also logging whether all the sstables are valid or not. Validation is not really a compaction as it doesn't write any output. As such it bypasses most of the usual compaction machinery, so the latter doesn't have to be adapted to this outlier. This patch only adds the implementation, but it still cannot be started via `compact_sstables()`, that will be implemented by the next patches.	2021-07-12 10:25:15 +03:00
Botond Dénes	cae8624edb	sstables/compaction: extract compaction info creation into static method To make this snippet reusable by the soon-to-be-added validation compaction as well.	2021-07-12 07:53:11 +03:00
Botond Dénes	3b5ae0b894	sstables/compaction: extract sstable list formatting to a class To make it reusable both inside compaction class itself (between compaction start and end messages) and for outside code as well.	2021-07-12 07:11:29 +03:00
Botond Dénes	35f49a5baa	sstables/compaction: scrub_compaction: extract reporting code into static methods All the error messages reporting about invalid bits found in the stream. This allows reusing these messages in the soon-to-be-added validation compaction. In the process, the error messages are made more comprehensive and more uniform as well.	2021-07-12 07:11:29 +03:00
Botond Dénes	5e77f07263	position_in_paritition{_view}: add has_key()	2021-07-12 07:11:29 +03:00
Botond Dénes	7cf5b43bbc	mutation_fragment_stream_validator: add schema() accessor	2021-07-12 07:11:29 +03:00
Avi Kivity	29c9570556	utils: compact-radix-tree: detemplate array_of<> The radix tree template defines a nested class template array_of; both a generic template and a fully specialized version. However, gcc (I believe correctly) rejects the fully specialized template that happens to be a member of another class template. As it happens, we don't really need a template here at all. Define a non-template class for each of the cases we need, and use std::conditional_t to select the type we need.	2021-07-11 18:16:21 +03:00
Avi Kivity	f576ecb7cc	utils: compact-radix-tree: don't redefine type as member The `direct_layout` and `indirect_layout` template classes accept a template parameter named `Layout` of type `layout`, and re-export `Layout` as a static data member named `layout`. This redefinition of `layout` is disliked by gcc. Fix by renaming the static data member to `this_layout` and adjust all references.	2021-07-11 18:16:21 +03:00
Avi Kivity	332b5c395f	raft: avoid changing meaning of a symbol inside a class The construct struct q { a a; }; Changes the meaning of `a` from a type to a data member. gcc dislikes it and I agree. Fully qualify the type name to avoid an error.	2021-07-11 18:16:21 +03:00
Avi Kivity	cb8ef1489b	cql3: lists: catch polymorphic exceptions by reference gcc 11 notes that catching polymorphic exceptions is a bad idea; the resulting copy can slice the exception object. Fix by capturing by reference.	2021-07-11 17:34:43 +03:00
Avi Kivity	222ef17305	build, treewide: enable -Wredundant-move Returning a function parameter guarantees copy elision and does not require a std::move(). Enable -Wredundant-move to warn us that the move is unneeded, and gain slightly more readable code. A few violations are trivially adjusted. Closes #9004	2021-07-11 12:53:02 +03:00
Dejan Mircevski	7119730f2d	cql3: Don't look for indexed column in CK prefix When creating an index-table query, we form its clustering-key restrictions by picking the right restrictions from the WHERE clause. But we skip the indexed column, which isn't in the index-table clutering key. This is, however, both incorrect and unnecessary: It is incorrect because we compare the column IDs from different schemas (indexed table vs. base table). We should instead be comparing column names. It is unnecessary because this code is only executed when the whole partition key plus a clustering prefix is specified in the WHERE clause. In such cases, the index cannot possibly be on a member of the clustering prefix, as such a query would be satisfied out of the base table. Therefore, it is redundant to check for the indexed table among the CK prefix elements. A careful reader will note that this check was first introduced to fix the issue #7888 in commit `0bd201d`. But it now seems to me that that fix was misguided. The root problem was the old code miscalculating the clustering prefix by including too many columns in it; it should have stopped before reaching the indexed column. The new code, introduced by commit `845e36e76`, calculates the clustering prefix correctly, never reaching the indexed column. (Details, for the curious: the old code invoked clustering_key_restrictions::prefix_size(), which is buggy -- it doesn't check the restriction operator. It will, for instance, calculate the prefix of `c1=0 AND c2 CONTAINS 0 AND c3=0` as 3, because it restricts c1, c2, and c3. But the correct prefix is clearly 1, because c2 is not restricted by equality.) Tests: unit (dev, debug) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8993	2021-07-08 21:39:38 +03:00
Avi Kivity	9059514335	build, treewide: enable -Wpessimizing-move warning This warning prevents using std::move() where it can hurt - on an unnamed temporary or a named automatic variable being returned from a function. In both cases the value could be constructed directly in its final destination, but std::move() prevents it. Fix the handful of cases (all trivial), and enable the warning. Closes #8992	2021-07-08 17:52:34 +03:00
Avi Kivity	fe4002c6c4	Update seastar submodule * seastar eaa00e761f...8ed9771ae9 (5): > *: drop prometheus protobuf support > reactor: Fix calculations of bandwidth in legacy mode > gate: add gate::holder > Revert "gate: add gate::holder", does not build. > gate: add gate::holder	2021-07-08 17:42:39 +03:00
Avi Kivity	f756f34392	Merge "Add scylla-bench datasets to perf_fast_forward" from Tomasz " After this series one can use perf_fast_forward to generate the data set. It takes a lot less time this way than to use scylla-bench. " * 'perf-fast-forward-scylla-bench-dataset' of github.com:tgrabiec/scylla: tests: perf_fast_forward: Use data_source::make_ck() tests: perf_fast_forward: Move declaration of clustered_ds up tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default tests: perf_fast_forward: Add data sets which conform to scylla-bench schema	2021-07-08 17:33:30 +03:00
Nadav Har'El	d0546a9bb5	cql-pytest: improve README This patch adds to cql-pytest/README.md a paragraph on where run / run-cassandra expect to find Scylla or Cassandra, and how to override that choice. Also make a couple of trivial formatting changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708142730.813660-1-nyh@scylladb.com>	2021-07-08 17:29:20 +03:00
Avi Kivity	4f1e21ceac	Merge "reader_concurrency_semaphore: get rid of global semaphores" from Botond " When obtaining a valid permit was made mandatory, code which now had to create reader permits but didn't have a semaphore handy suddenly found itself in a difficult situation. Many places and most prominently tests solved the problem by creating a thread-local semaphore to source permits from. This was fine at the time but as usual, globals came back to haunt us when `reader_concurrency_semaphore::stop()` was introduced, as these global semaphores had no easy way to be stopped before being destroyed. This patch-set cleans up this wart, by getting rid of all global semaphores, replacing them with appropriately scoped local semaphores, that are stopped after being used. With that, the FIXME in `~reader_concurrency_semaphore()` can be resolved and we an finally `assert()` that the semaphore was stopped before being destroyed. This series is another preparatory one for the series which moves the semaphore in front of the cache. tests: unit(dev) " * 'reader-concurrency-semaphore-mandatory-stop/v2' of https://github.com/denesb/scylla: (26 commits) reader_concurrency_semaphore: assert(_stopped) in the destructor test/lib: remove now unused reader_permit.{hh,cc} test/boost: migrate off the global test reader semaphore test/manual: migrate off the global test reader semaphore test/unit: migrate off the global test reader semaphore test/perf: migrate off the global test reader semaphore test/perf: perf.hh: add reader_concurrency_semaphore_wrapper test/lib: migrate off the global test reader semaphore test/lib/simple_schema: migrate off the global test reader semaphore test/lib/sstable_utils: migrate off the global test reader semaphore test/lib/test_services: migrate off the global test reader semaphore test/lib/sstable_test_env: add reader_concurrency_semaphore member test/lib/cql_test_env: add make_reader_permit() test/lib: add reader_concurrency_semaphore.hh test/boost/sstable_test: migrate row counting tests to seastar thread test/boost/sstable_test: test_using_reusable_sst(): pass env to func test/lib/reader_lifecycle_policy: add permit parameter to factory function test/boost/mutation_reader_test: share permit between readers in a read memtable: migrate off the global reader concurrency semaphore mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore ...	2021-07-08 17:28:13 +03:00
Botond Dénes	42bd5c980f	reader_concurrency_semaphore: assert(_stopped) in the destructor Now that there are no more global semaphore which are impossible to stop properly we can resolve the related FIXME and arm the assert in the semaphore destructor. We can also remove all the other cleanup code from the destructor as they are taken care of by stop(), which we now assert to have been run.	2021-07-08 16:53:38 +03:00
Botond Dénes	6b941c4d34	test/lib: remove now unused reader_permit.{hh,cc} Finally getting rid of the global test reader concurrency semaphore.	2021-07-08 16:53:38 +03:00
Botond Dénes	2d2b9e7b36	test/boost: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	0bf07cde7b	test/manual: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	18e0c40c5d	test/unit: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	37a1e506b1	test/perf: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	2454811dd6	test/perf: perf.hh: add reader_concurrency_semaphore_wrapper A convenience, self-closing wrapper for those perf tests that have no way to stop the semaphore and wait for it too.	2021-07-08 16:53:38 +03:00
Nadav Har'El	e22a52e69c	cql-pytest: fix tests on Cassandra 3 After commit `76227fa` ("cql-pytest: use NetworkTopologyStrategy, not SimpleStrategy"), the cql-pytest tests now NetworkTopologyStrategy instead of SimpleStrategy in the test keyspaces. The tests continued to use the "replication_factor" option. The support for this option is a relatively recent, and was only added to Cassandra in the 4.0 release series (see https://issues.apache.org/jira/browse/CASSANDRA-14303). So users who happen to have Cassandra 3 installed and want to run a cql-pytest against it will see the test failing when it can't create a keyspace. This patch trivially fixes the problem by using the name of the current DC (automatically determined) instead of the word 'replication_factor'. Almost all tests are fixed by a single fix to the test_keyspace fixture which creates one keyspace used by most tests. Additional changes were needed in test_keyspace.py, for tests which explicitly create keyspaces. I tested the result on Cassandra 3.11.10, Cassandra 4 (git master) and Scylla. Fixes #8990 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708123428.811184-1-nyh@scylladb.com>	2021-07-08 15:35:21 +02:00
Nadav Har'El	eb11ce046c	cql-pytest: add reproducer for concurrent DROP KEYSPACE bug We know that today in Scylla concurrent schema changes done on different coordinators are not safe - and we plan to address this problem with Raft. However, the test in this patch - reproducing issue #8968 - demonstrates that even on a single node concurrent schema changes are not safe: The test involves one thread which constantly creates a keyspace and then a table in it - and a second thread which constantly deletes this keyspace. After doing this for a while, the schema reaches an inconsistent state: The keyspace is at a state of limbo where it cannot be dropped (dropping it succeeds, but doesn't actually drop it), and a new keyspace cannot be created under the same name). Note that to reproduce this bug, it was important that the test create both a keyspace and a table. Were the test to just create an empty keyspace, without a table in it, the bug would not be reproduced. Refs #8968. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210704121049.662169-1-nyh@scylladb.com>	2021-07-08 15:35:03 +02:00
Botond Dénes	0e78399051	test/lib: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	5fff314739	test/lib/simple_schema: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	d520655730	test/lib/sstable_utils: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	3679418e62	test/lib/test_services: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	0acc4d63da	test/lib/sstable_test_env: add reader_concurrency_semaphore member To enable tests using the test env to conveniently create permits for themselves, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	7174d1beee	test/lib/cql_test_env: add make_reader_permit() A convenience method, allowing tests using the cql test env to conveniently create a permit, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	b739525fb6	test/lib: add reader_concurrency_semaphore.hh Supplying a convenience semaphore wrapper, which stops the contained semaphore when destroyed. It also provides a more convenient `make_permit()`. This class is intended to make the migration to local semaphores less painful.	2021-07-08 15:28:36 +03:00
Benny Halevy	fa5d70da32	storage_proxy: abstract_read_resolver: handle semaphore_timed_out error semaphore_timed_out errors should be ignored, similar to rpc::timeout_error or seastar::timed_out_error, so that they eventually be converted to `read_timeout_exception` via the data/digest read resolver on_timeout() method. Otherwise, the semaphore timeout is mistranslated to read_failure_exception, via on_error(). Note that originally the intention was to change the exception thrown by the reader_concurrency_semaphore expiry_handler, but there are already several places in the code that catch and handle the semaphore_timed_out exception that would need to be changed, increasing the risk in this change. Fixes #8958 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210708083252.1934651-2-bhalevy@scylladb.com>	2021-07-08 15:23:30 +03:00
Benny Halevy	023d103fee	utils: exceptions: is_timeout_exception: add timed_out_error Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210708083252.1934651-1-bhalevy@scylladb.com>	2021-07-08 15:23:29 +03:00
Nadav Har'El	814c4ad4ce	cql-pytest: fix run-cassandra for older versions of Cassandra In older versions of Cassandra (such as 3.11.10 which I tried), the CQL server is not turned on by default, unless the configuration file explicitly has "start_native_transport: true" - without it only the Thrift server is started. So fix the cql-pytest/run-cassandra to pass this option. It also works correctly in Cassandra 4. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708113423.804980-1-nyh@scylladb.com>	2021-07-08 14:59:09 +03:00
Avi Kivity	7d214800d0	Merge 'Generate view updates in smaller parts' from Piotr Sarna In order to avoid large allocations and too large mutations generated from large view updates, granularity of the process is broken down from per-partition to smaller chunks. The view update builder now produces partial updates, no more than 100 view rows at a time. The series was tested manually with a particular scenario in mind - deleting a large base partition, which results in creating a view update per each deleted row - which, with sufficiently large partitions, can reach millions. Before the series, Scylla experienced an out-of-memory condition after the view update generation mechanism tried to load too much data into a contiguous buffer. Multiple large allocation warnings and reactor stalls were observed as well. After the series, the operation is still rather slow, but does not induce reactor stalls nor allocator problems. A reduced version of the above test is added as a unit test - it does not check for huge partitions, but instead uses a number just large enough to cause the update generation process to be split into multiple chunks. Fixes #8852 Closes #8906 * github.com:scylladb/scylla: cql-pytest: add a test case for base range deletion cql-pytest: add a test case for base partition deletion table: elaborate on why exceptions are ignored for view updates view: generate view updates in smaller parts table: coroutinize generating view updates db,view: move view_update_builder to the header	2021-07-08 12:57:05 +03:00
Piotr Sarna	bc0038913c	cql-pytest: add a test case for base range deletion The test case checks that deleting a base table clustering range works fine. This operation is potentially heavy, as it involves generating a view update for every row. With large enough ranges, the number can reach millions and beyond.	2021-07-08 11:43:08 +02:00
Piotr Sarna	ef47b4565c	cql-pytest: add a test case for base partition deletion The test case checks that deleting a whole base table partition works fine. This operation is potentially heavy, as it involves generating a view update for every row. With large enough partitions, the number can reach millions and beyond.	2021-07-08 11:42:54 +02:00
Botond Dénes	b9a5fd57bf	test/boost/sstable_test: migrate row counting tests to seastar thread To facilitate further patching.	2021-07-08 12:38:21 +03:00
Botond Dénes	fb310ec6e7	test/boost/sstable_test: test_using_reusable_sst(): pass env to func To facilitate further patching.	2021-07-08 12:38:19 +03:00
Botond Dénes	46d21e842d	test/lib/reader_lifecycle_policy: add permit parameter to factory function The factory method doesn't match the signature of `reader_lifecycle_policy::make_reader()`, notably the permit is missing. Add it as it is important that the wrapping evictable reader and underlying reader share the permits.	2021-07-08 12:31:36 +03:00
Botond Dénes	2a45d643b6	test/boost/mutation_reader_test: share permit between readers in a read Permits were designed such that there is one permit per read, being shared by all readers in that read. Make sure readers created by tests adhere to this.	2021-07-08 12:31:36 +03:00
Botond Dénes	0f36e5c498	memtable: migrate off the global reader concurrency semaphore Require the caller of `create_flush_reader()` to pass a permit instead.	2021-07-08 12:31:36 +03:00
Botond Dénes	7a4381b491	mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore Use a local one instead, and stop it when the writer is destroyed.	2021-07-08 12:31:36 +03:00
Botond Dénes	17a0e22cb1	sstables: mx/writer: migrate off the global reader concurrency_semaphore And use a local one instead, stopping it when the writer is destroyed.	2021-07-08 12:31:36 +03:00
Botond Dénes	f1c1e05a05	sstables: stop semaphores	2021-07-08 12:31:36 +03:00
Botond Dénes	c51892f02e	sstables: sstable::has_partition_key(): convert to coroutine	2021-07-08 12:31:36 +03:00
Botond Dénes	c0a8068c16	sstables: generate_summary(): fix indentation	2021-07-08 12:31:36 +03:00
Botond Dénes	fec137f3f6	sstables: generate_summary(): make it a coroutine Indentation is left broken.	2021-07-08 12:31:36 +03:00
Botond Dénes	c4e71fb9b8	reader_concurrency_semaphore: remove default name parameter Naming the concurrency semaphore is currently optional, unnamed semaphores defaulting to "Unnamed semaphore". Although the most important semaphores are named, many still aren't, which makes for a poor debugging experience when one of these times out. To prevent this, remove the name parameter defaults from those constructors that have it and require a unique name to be passed in. Also update all sites creating a semaphore and make sure they use a unique name.	2021-07-08 12:31:36 +03:00
Piotr Sarna	6a461d00c6	table: elaborate on why exceptions are ignored for view updates The generate_and_propagate_view_updates() function explicitly ignores exceptions reported from the underlying view update propagation layer. This decision is now explained in the comment.	2021-07-08 11:21:55 +02:00
Piotr Sarna	bf0777e97a	view: generate view updates in smaller parts In order to avoid large allocations and too large mutations generated from large view updates, granularity of the process is broken down from per-partition to smaller chunks. The view update builder now produces partial updates, no more than 100 view rows at a time.	2021-07-08 11:17:27 +02:00
Piotr Sarna	1000d52cfa	table: coroutinize generating view updates ... which will make the incoming changes easier to review.	2021-07-08 11:17:27 +02:00
Piotr Sarna	679dc4d824	db,view: move view_update_builder to the header The builder is going to be used directly by the callers, which requires making its definition public. No semantic changes were intended.	2021-07-08 11:17:27 +02:00
Raphael S. Carvalho	1924e8d2b6	treewide: Move compaction code into a new top-level compaction dir Since compaction is layered on top of sstables, let's move all compaction code into a new top-level directory. This change will give me extra motivation to remove all layer violations, like sstable calling compaction-specific code, and compaction entanglement with other components like table and storage service. Next steps: - remove all layer violations - move compaction code in sstables namespace into a new one for compaction. - move compaction unit tests into its own file Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>	2021-07-07 23:21:51 +03:00
Tomasz Grabiec	33cba08735	tests: perf_fast_forward: Use data_source::make_ck() Data sources differ in clustering key type. Make sure to use the right data_value instance to produce correct keys.	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	fa481e92c1	tests: perf_fast_forward: Move declaration of clustered_ds up	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	407e42f5d8	tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default This dataset exists for convenience, to be able to run scylla-bench against the data set generated by perf_fast_forward. It doesn't increase coverage. So do not include it by default to not waste resources on it.	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	d7250a12fd	tests: perf_fast_forward: Add data sets which conform to scylla-bench schema Useful for fast generation of test data.	2021-07-07 20:27:44 +02:00
Avi Kivity	5571ef0d6d	compression: define 'class' attribute for compression and deprecate 'sstable_compression' Cassandra 3.0 deprecated the 'sstable_compression' attribute and added 'class' as a replacement. Follow by supporting both. The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED to detect all uses and prevent future misuse. To prevent old-version nodes from seeing the new name, the compression_parameters class preserves the key name when it is constructed from an options map, and emits the same key name when asked to generate an options map. Existing unit tests are modified to use the new name, and a test is added to ensure the old name is still supported. Fixes #8948. Closes #8949	2021-07-07 19:15:20 +02:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Takuya ASADA	def81807aa	scylla-fstrim.timer: drop BindsTo=scylla-server.service To avoid restart scylla-server.service unexpectedly, drop BindsTo= from scylla-fstrim.timer. Fixes #8921 Closes #8973	2021-07-07 17:36:24 +03:00
Dejan Mircevski	7d6ef0de8d	cql3: Drop more dead code After `845e36e76` "cql3: Use expr for global-index partition slice", there is actually more dead code than was initially dropped. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8981	2021-07-07 13:59:58 +02:00
Calle Wilund	ce45ffdffb	commitlog: Use defensive copies of segment list in iterations Fixes #8952 In `5ebf5835b0` we added a segment prune after flushing, to deal with deadlocks in shutdown. This means that calls that issue sync/flush-like ops "for-all", need to operate on a defensive copy of the list. Closes #8980	2021-07-07 13:30:37 +02:00
Pavel Emelyanov	63a2fed585	hasher: More picky noexcept marking of feed_hash() Commit `5adb8e555c` marked the ::feed_hash() and a visitor lambda of digester::feed_hash() as noexcept. This was quite recklesl as the appending_hash<>::operator()s called by ::feed_hash() are not all marked noexcept. In particular, the appending_hash<row>() is not such and seem to throw. The original intent of the mentioned commit was to facilitate the partition_hasher in repair/ code. The hasher itself had been removed by the `0af7a22c21`, so it no longer needs the feed_hash-s to be noexcepts. The fix is to inherit noexcept from the called hashers, but for the digester::feed_hash part the noexcept is just removed until clang compilation bug #50994 is fixed. fixes: #8983 tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210706153608.4299-1-xemul@scylladb.com>	2021-07-07 12:00:16 +03:00
Pavel Solodovnikov	b959f5d394	test: lib: copy `query_options` in `single_node_cql_env::execute_cql()` `query_processor::execute_direct()` takes a non-const ref to query options, meaning it's not safe to pass the same instance to subsequent invocations of `execute_direct()` in the tests. Copy default query options at each invocation of `execute_cql()` so no possible side-effects can occur. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>	2021-07-07 11:46:50 +03:00
Nadav Har'El	775a64b003	test/alternator: test for change in CDC preimage In pull request #8568, the CDC API changed slightly, with preimage data gaining extra "delete$k" values for columns whose preimage was missing. In this new test, we verify that this change did not break Alternator. We didn't expect it to break Alternator, because it just outputs the known base-table columns and ignores the columns which weren't a real base-table column - like this "delete$k". In the test we set up a stream with preimages, ensure that a real column (note that an LSI key is a real column instead of a map element) has a null preimage - and see that the preimage is returned as expected, without fake columns like "delete$k". The test passes, showing that PR #8568 was ok. The test also passes, as expected, on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210504120121.915829-1-nyh@scylladb.com>	2021-07-06 14:53:42 +02:00
Nadav Har'El	76227fafad	cql-pytest: use NetworkTopologyStrategy, not SimpleStrategy All tests in cql-pytest use a test keyspace created with the SimpleStrategy replication strategy. This was never intentional. We are recommending to users that they should use NetworkTopologyStrategy instead, and even want to deprecate SimpleStrategy (this is #8586), so tests should stop using SimpleStrategy and should start using the same strategy users would use in their applications - NetworkTopologyStrategy. Almost all tests are fixed by a single change in conftest.py which changes how "test_keyspace" is created. But additionally, tests in test_keyspace.py which explicitly create keyspaces (that's the point of that test file...) also had to be modified to use NetworkTopologyStrategy. Note that none of the tests relied on any special features or implementation details of SimpleStrategy. This patch is part of the bigger effort to remove reliance on SimpleStrategy from all tests, of all types - which we will need to do if we ever want to forbid SimpleStrategy by default. The issue of that effort: Refs #8638 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210620102341.195533-1-nyh@scylladb.com>	2021-07-06 14:52:46 +02:00
Benny Halevy	ac7db8a043	repair: row_level: coroutinize clear_gently Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210706090252.1776563-1-bhalevy@scylladb.com>	2021-07-06 13:42:45 +03:00
Benny Halevy	612793c2d4	locator: token_metadata: reuse utils::stall_free clear_gently helpers Use the generic clear_gently functions that were added in `eca9f45c59`. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210706090243.1776466-1-bhalevy@scylladb.com>	2021-07-06 12:06:43 +03:00
Avi Kivity	4c01a88c9d	logalloc: do not capture backtraces by default in debug mode logalloc has a nice leak/double-free sanitizer, with the nice feature of capturing backtraces to make error reports easy to track down. But capturing backtraces is itself very expensive. This patch makes backtrace capture optional, reducing database_test runtime from 30 minutes to 20 minutes on my machine. Closes #8978	2021-07-06 00:18:22 +02:00
Avi Kivity	2c8d84b864	Merge "Make logging for sstable data corruption useful" from Raphael " When a corrupted sstable fails to be read either on regular read or in regular compaction, our logging is not useful as it can't pinpoint the SSTable that was being read from, also it may not print useful details about the corruption. For example, when a compaction fails on data corruption, a cryptic message as follow will be dumped: compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying there are two problems with the log above: 1) it doesn't tell us which sstable is corrupted 2) it doesn't tell us detailed info about the checksum failure on compressed chunk with those problems fixed, we'll now get a much more useful message: compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584): retrying tests: mode(dev). " * 'log_data_corruption_v2.1' of github.com:raphaelsc/scylla: sstables: Attach sstable name to exception triggered in sstable mutation reader test/broken_sstable_test: Make test more robust sstables: Make log more useful when compressed chunk fails checksum sstables: Use correct exception when compressed chunk fails checksum	2021-07-05 20:37:19 +03:00
Piotr Jastrzebski	27fe3c3aa0	partition_snapshot_flat_reader: Fix typo in next_range_rombstone Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2fd48c08092d6ed1b452b6fe2e43e2273a78c8c2.1625500334.git.piotr@scylladb.com>	2021-07-05 19:39:08 +03:00
Takuya ASADA	f19ebe5709	dist/redhat: fix systemd unit name of scylla-node-exporter systemd unit name of scylla-node-exporter is scylla-node-exporter.service, not node-exporter.service. Fixes #8966 Closes #8967	2021-07-05 18:06:51 +03:00
Avi Kivity	e2f865c739	Merge 'Use expressions to calculate the global-index partition slice' from Dejan Mircevski Another step towards dropping the `restrictions` class. When calculating the partition slice of a global-index table, use `expression` objects instead of a `restrictions` subclass. Refs #7217. Tests: unit (all dev, some debug) Closes #8950 * github.com:scylladb/scylla: cql3: Use expr for global-index partition slice cql3: Fully explain statement_restrictions members cql3: Pass schema reference not pointer cql3: Replace count_if with find_atom cql3: Fix _partition_range_is_simple calculation cql3: Add test cases for indexed partition column	2021-07-05 18:04:54 +03:00
Takuya ASADA	f71f9786c7	dist: stop removing /etc/systemd/system/.mount on package uninstall Listing /etc/systemd/system/.mount as ghost file seems incorrect, since user may want to keep using RAID volume / coredump directory after uninstalling Scylla, or user may want to upgrade enterprise version. Also, we mixed two types of files as ghost file, it should handle differently: 1. automatically generated by postinst scriptlet 2. generated by user invoked scylla_setup The package should remove only 1, since 2 is generated by user decision. However, just dropping .mount from %files section causes another problem, rpm will remove these files during upgrade, instead of uninstall (#8924). To fix both problem, specify .mount files as "%ghost %config". It will keep files both package upgrade and package remove. See scylladb/scylla-enterprise#1780 Closes #8810 Closes #8924 Closes #8959	2021-07-05 18:03:51 +03:00
Nadav Har'El	12b058abdf	Merge 'repair: row_level: clear_gently on close' from Benny Halevy To prevent a reactor stall as seen in #8926 Fixes #8926 Test: unit(dev) DTest: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_diff_shard_count_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8928 * github.com:scylladb/scylla: repair: row_level: clear_gently: clear_gently each repair_row repair: row_level: repair_meta: clear_gently on stop repair: row_level_repair: run: stop master repair_meta utils: stall_free: implemnt clear_gently of froeign_ptr utils: stall_free: define generic clear_gently methods	2021-07-04 23:00:37 +03:00
Piotr Grabowski	64e93ca38c	main: fix max-io-requests spelling in warning text An incorrect spelling of max-io-requests was used in creating the warning message text. Due to conversion to unsigned, the code would crash due to bad_lexical_cast exception. The spelling of this configuration name was fixed in the past (`44f3ad836b`), but only in the 'if' condition. Fix by using the correct spelling. Closes #8963	2021-07-04 18:37:43 +03:00
Tomasz Grabiec	2c727f37fb	api: Drop sstable index caches on system/drop_sstable_caches	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	f553db69f7	cached_file: Issue single I/O for the whole read range on miss Currently, reading a page range would issue I/O for each missing page. This is inefficient, better to issue a single I/O for the whole range and populate cache from that. As an optimization, issue a single I/O if the first page is missing. This is important for index reads which optimistically try to read 32KB of index file to read the partition index page.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	6a6403d19d	row_cache: cache_tracker: Do not register metrics when constructed for tests Some tests will create two cache_tracker instances because of one being embedded in the sstable test env. This would lead to double registration of metrics, which raises run time error. Avoid by not registering metrics in prometheus in tests at all.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	1f74863bf8	sstables, cached_file: Evict cache gently when sstable is destroyed We must evict before the _cached_index_file associated with the sstable goes away. Better to do it gently to avoid stalls.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	f14576f4be	sstables: Hide partition_index_cache implementation away from sstables.hh Reduces scope of the header to index_reader.hh which reduces recompilation time.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	7d34799f3f	sstables: Drop shared_index_lists alias	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	af4cc233c3	sstables: Destroy partition index cache gently There could be a lot of them so we should clear it gently to avoid reactor stalls.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	9f957f1cf9	sstables: Cache partition index pages in LSA and link to LRU As part of this change, the container for partition index pages was changed from utils::loading_shared_values to intrusive_btree. This is to avoid reactor stalls which the former induces with a large number of elements (pages) due to its use of a hashtable under the hood, which reallocates contiguous storage.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	b3728f7d9b	utils: Introduce lsa::weak_ptr<> Simplifies managing non-owning references to LSA-managed objects. The lsa::weak_ptr is a smart pointer which is not invalidated by LSA and can be used safely in any allocator context. Dereferenced will always give a valid reference. This can be used as a building block for implementing cursors into LSA-based caches. Example simple use: // LSA-managed struct X : public lsa::weakly_referencable<X> { int value; }; lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] { X* x = current_allocator().construct<X>(); return x->weak_from_this(); }); std::cout << x_ptr->value;	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	2a852cd0c9	sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache The new names are less confusing.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	934824394a	sstables, cached_file: Avoid copying buffers from cache when parsing promoted index	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	7b6f18b4ed	cached_file: Introduce get_page_units() Will be needed later for reading a page view which cannot use make_tracked_temporary_buffer(). Standardize on get_page_units(), converting existing code to wrap the units in a deleter.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	23bc19643f	sstables: read: Document that primitive_consumer::read_32() is alloc-free Callers will rely on it to assume that it does not invalidate references to LSA objects.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	b98e660a4a	sstables: read: Count partition index page evictions	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	8360a64f73	sstables: Drop the _use_binary_search flag from index entries It doesn't have to be set by the parser now that the cursors are created lazily, pass it to the cursor when it's created.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	06e373e272	sstables: index_reader: Keep index objects under LSA In preparation for caching index objects, manage them under LSA. Implementation notes: key_view was changed to be a view on managed_bytes_view instead of bytes, so it now can be fragmented. Old users of key_view now have to linearize it. Actual linearization should be rare since partition keys are typically small. Index parser is now not constructing the index_entry directly, but produces value objects which live in the standard allocator space: class parsed_promoted_index_entry; calss parsed_partition_index_entry; This change was needed to support consumers which don't populate the partition index cache and don't use LSA, e.g. sstable::generate_summary(). It's now consumer's responsibility to allocate index_entry out of parsed_partition_index_entry.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	20ef54e9ed	lsa: chunked_managed_vector: Adapt more to managed_vector For seamless transition.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	78e5b9fd85	utils: lsa: chunked_managed_vector: Make LSA-aware The max chunk size is set to be 10% of segment size.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	856e4a539d	test: chunked_managed_vector_test: Make exception_safe_class standard layout Required by managed_vector<> due to its use of offsetof() In preparation for swtiching chunked_managed_vector storage to managed_vector<>.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	c87ea09535	lsa: Copy chunked_vector to chunked_managed_vector In preparation for adapting it to LSA. Split into two steps to make reiew easier.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	1523a7d367	utils: managed_vector: Make clear_and_release() public Will be needed by index reader to ensure that destructor doesn't invoke the allocator so that all is destroyed in the desried allocation context before the object is destroyed.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	2b673478aa	sstables: index_reader: Do not expose index_entry references index_entry will be an LSA-managed object. Those have to be accessed with care, with the LSA region locked. This patch hides most of direct index_entry accesses inside the index_reader so that users are safe.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	a955e7971d	sstables: index_reader: Don't store schema reference inside index_entry To save space.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	9e7bf066a9	sstables: index_reader: Don't store file object inside promoted_index The file object which is currently stored there has per-request tracing wrappers (permit, trace_state) attached to it. It doesn't make sense once the entry is cached and shared. Annotate when the cursor is created instead.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	86b135056c	sstables: index_reader: Don't store front buffer inside promoted_index Index reads and promoted index reads are both using the same cached_file now, so there's no need to pass the buffers between the index parser and promoted index reader. Makes the promoted_index structure easier to move to LSA.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	484e06d69b	cached_file: Always start at offset 0 All current uses start at offset 0, so simplify the code by assuming it.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	078a6e422b	sstables: Cache all index file reads After this patch, there is a singe index file page cache per sstable, shared by index readers. The cache survives reads, which reduces amount of I/O on subsequent reads. As part of this, cached_file needed to be adjusted in the following ways. The page cache may occupy a significant portion of memory. Keeping the pages in the standard allocator could cause memory fragmentation problems. To avoid them, the cache_file is changed to keep buffers in LSA using lsa_buffer allocation method. When a page is needed by the seastar I/O layer, it needs to be copied to a temporary_buffer which is stable, so must be allocated in the standard allocator space. We copy the page on-demand. Concurrent requests for the same page will share the temporary_buffer. When page is not used, it only lives in the LSA space. In the subsequent patches cached_file::stream will be adjusted to also support access via cached_page::ptr_type directly, to avoid materializating a temporary_buffer. While a page is used, it is not linked in the LRU so that it is not freed. This ensures that the storage which is actively consumed remains stable, either via temporary_buffer (kept alive by its deleter), or by cached_page::ptr_type directly.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	b5ca0eb2a2	lsa: Introduce lsa_buffer lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns buffers allocated inside LSA segments. It uses an alternative allocation method which differs from regular LSA allocations in the following ways: 1) LSA segments only hold buffers, they don't hold metadata. They also don't mix with standard allocations. So a 128K segment can hold 32 4K buffers. 2) objects' life time is managed by lsa_buffer, an owning smart pointer, which is automatically updated when buffers are migrated to another segment. This makes LSA allocations easier to use and off-loads metadata management to the client (which can keep the lsa_buffer wherever he wants). The metadata is kept inside segment_descriptor, in a vector. Each allocated buffer will have an entangled object there (8 bytes), which is paired with an entabled object inside lsa_buffer. The reason to have an alternative allocation method is to efficiently pack buffers inside LSA segments.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	a23f27034f	lsa: Introduce entangled helper Will be useful in building higher-level LSA tools.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	056f14063e	lsa: Encapsulate segment_descriptor::_free_space access Prepares for reusing some of its bits for storing segment kind.	2021-07-02 19:02:13 +02:00
Dejan Mircevski	845e36e761	cql3: Use expr for global-index partition slice Instead of creating a single_column_clustering_key_restrictions object, create an equivalent vector of expr::expressions and calculate from it the clustering ranges just like we do for base-table queries. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	75f4325ee4	cql3: Fully explain statement_restrictions members Nail down the assumptions before making futher use of these variables. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	28e92dfa4c	cql3: Pass schema reference not pointer ... to get_single_column_clustering_bounds(). No need for the pointer; a reference is simpler and cleaner. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	de17b5449b	cql3: Replace count_if with find_atom count_if finds all matching atoms, which is redundant when we only want to find one. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	3a149daab5	cql3: Fix _partition_range_is_simple calculation Was updated for every restriction instead of only for partition ones. The only impact is on performance. The bug was introduced in `4661aa0` "cql3: Track IN partition-key restrictions". Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	53f376b83f	cql3: Add test cases for indexed partition column We didn't have a case when a global index exists on a partition column and the SELECT statement specifies the full partition key. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:28:56 +02:00
Tomasz Grabiec	019956739d	cached_file: Switch to bplus::tree In order to be able to move it to LSA later.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	f537d1a7e5	tests: sstables: Do not call open_data() twice make_sstable_containing() already calls open_data(), so does load(). This will trigger assertion failure added in a later patch: assert(!_cached_index_file); There is no need to call load() here.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	627a2ef087	test: cached_file: Add test for eof_error	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	8fbea0b5b7	utils: cached_file: Introduce file wrapper It's an adpator between seastar::file and cached_file. It gives a seastar::file which will serve reads using a given cached_file as a read-through cache.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	8e2118069b	sstables: cached_file: Account buffers returned by cached_file under read_permit We want buffers to be accounted only when they are used outside cached_file. Cached pages should not be accounted because they will stay around for longer than the read after subsequent commits.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	a5c72ed899	sstables, database: Keep cache_tracker reference inside sstables_manager So that sstable code can pick it up for caching (lru and region).	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	4b51e0bf30	row_cache: Move cache_tracker to a separate header It will be needed by the sstable layer to get the to the LRU and the LSA region. Split to avoid inclusion of whole row_cache.hh	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	7fa4e10aa0	row_cache: Use generic LRU for eviction In preparation for tracking different kinds of objects, not just rows_entry, in the LRU, switch to the LRU implementation form utils/lru.hh which can hold arbitrary element type.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	6b59c8cfb1	utils: Introduce general-purpose LRU The LRU can link objects of different types, which is achieved by having a virtual base class called "evictable" from which the linked objects should inherit. Whe the object is removed from the LRU, evictable::on_evicted() is called. The container is non-owning.	2021-07-02 10:25:58 +02:00
Benny Halevy	68bd748af2	repair: row_level: clear_gently: clear_gently each repair_row Rows might be large so free them gently by: - add bytes_ostream.clear_gently that may yield in the chunk freeing loop. - use that in frozen_mutation_fragment, contained in repair_row. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-01 19:16:11 +03:00
Benny Halevy	defe789d46	repair: row_level: repair_meta: clear_gently on stop To prevent a reactor stall as seen in #8926 Note: this patch doesn't use coroutines, to faciliate backporting. Coroutinization will be done in a follow-up patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-01 19:16:11 +03:00
Benny Halevy	0bf46751fa	repair: row_level_repair: run: stop master repair_meta Not only close it. Next patch will use clear_gently on stop to prevent reactor stalls. double-stop prevention code was added to stop() since, in the error-free case, repair_meta master is already stopped by `repair_row_level_stop` when auto_stop master deferred action is called. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-01 19:16:11 +03:00
Benny Halevy	9963d15613	utils: stall_free: implemnt clear_gently of froeign_ptr clear_gently of the foreign_ptr needs to run on the owning shard, so provide a specialization from the SmartPointer implementation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-01 19:16:11 +03:00
Benny Halevy	eca9f45c59	utils: stall_free: define generic clear_gently methods Define a bunch of clear_gently methods that asynchronously clear the contents of containers and allow yielding. This replaces clear_gently(std::list<T>&) used by row level repair by a more generic template implementation. Note that we do not use coroutines in this patch to facilitate backporting to releases that do not support coroutines and since a miscompilation bug was hit with clang++ 11 when attempting to coroutinize this patch (see https://bugs.llvm.org/show_bug.cgi?id=50345). Test: stall_free_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-07-01 19:00:49 +03:00
Avi Kivity	4209dfd753	Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond " This mini-series fixes two loosely related bugs around reader recreation in the evictable reader (related by both being around reader recreation). A unit test is also added which reproduces both of them and checks that the fixes indeed work. More details in the patches themselves. This series replaces the two independent patches sent before: * [PATCH v1] evictable_reader: always reset static row drop flag * [PATCH v1] evictable_reader: relax partition key check on reader recreation As they depend on each other, it is easier to add a test if they are in a series. Fixes: #8923 Fixes: #8893 Tests: unit(dev, mutation_reader_test:debug) " * 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla: test: mutation_reader_test: add more test for reader recreation evictable_reader: relax partition key check on reader recreation evictable_reader: always reset static row drop flag	2021-07-01 14:15:46 +03:00
Asias He	e95f7f4af0	storage_service: Make heartbeat response log debug level It is too noisy. It is supposed to be debug level. I forgot to move back to debug level after testing during development. Refs #8825 Closes #8960	2021-07-01 11:32:43 +03:00
Avi Kivity	0d87744ba0	Revert "dist: stop removing /etc/systemd/system/*.mount on package uninstall" This reverts commit `a677c46672`. It causes upgrade from a version that did not have a commit to a version that does have the commit to lose the .mount files, since they change from being owned by the package (via %ghost) to not being owned. Fixes #8924.	2021-07-01 08:55:54 +03:00
Nadav Har'El	7a5111c580	Merge 'messaging_service: do not listen on port 0' from Benny Halevy We never want to listen on port 0, even if configured so. When the listen port is set to 0, the OS will choose the port randomly, which makes it useless for communicating with other nodes in the cluster, since we don't support that. Also, it causes the listen_ports_conf_test internode_ssl_test to fail since it expects to disable listening on storage_port or ssl_storage_port when set to 0, as seen in https://github.com/scylladb/scylla-dtest/issues/2174. Fixes #8957 Test: unit(dev) DTest: listen_ports_conf_test (modified) Closes #8956 * github.com:scylladb/scylla: messaging_service: do_start_listen: improve info log accuracy messaging_service: never listen on port 0	2021-06-30 18:41:58 +03:00
Nadav Har'El	7ab48b405f	CQL: always validate NetworkTopologyStrategy replication factor The replication factor passed to NetworkTopologyStrategy (which we call by the confusing name "auto expand") may or may not be used (see explanation why in #8881), but regardless, we should validate that it's a legal number and not some non-numeric junk, and we should report the error. Before this patch, the two commands CREATE KEYSPACE name WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } ALTER KEYSPACE name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 'foo' } succeed despite the invalid replication factor "foo". After this patch, the second command fails. The problem fixed here is reproduced by the existing test test_keyspace.py::test_alter_keyspace_invalid when switching it to use NetworkTopologyStrategy, as suggested by issue #8638. Fixes #8880 Refs #8881 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210620100442.194610-1-nyh@scylladb.com>	2021-06-30 16:49:46 +03:00
Benny Halevy	51bc6c8b5a	messaging_service: do_start_listen: improve info log accuracy Make sure to log the info message when we actually start listening. Also, print a log message when listening on the broadcast address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-30 16:25:21 +03:00
Benny Halevy	df442d4d24	messaging_service: never listen on port 0 We never want to listen on port 0, even if configured so. When the listen port is set to 0, the OS will choose the port randomly, which makes it useless for communicating with other nodes in the cluster, since we don't support that. Also, it causes the listen_ports_conf_test internode_ssl_test to fail since it expects to disable listening on storage_port or ssl_storage_port when set to 0, as seen in https://github.com/scylladb/scylla-dtest/issues/2174. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-30 16:24:54 +03:00
Botond Dénes	75e8d2d04a	test: mutation_reader_test: add more test for reader recreation	2021-06-30 11:21:58 +03:00
Botond Dénes	852bf6befd	evictable_reader: relax partition key check on reader recreation When recreating the underlying reader, the evictable reader validates that the first partition key it emits is what it expects to be. If the read stopped at the end of a partition, it expects the first partition to be a larger one. If the read stopped in the middle of a certain partition it expects the first partition to be the same it stopped in the middle of. This latter assumption doesn't hold in all circumstances however. Namely, the partition it stopped in the middle of might get compacted away in the time the read was paused, in which case the read will resume from a greater partition. This perfectly valid cases however currently triggers the evictable reader's self validation, leading to the abortion of the read and a scary error to be logged. Relax this check to accept any partition that is >= compared to the one the read stopped in the middle of.	2021-06-30 11:21:53 +03:00
Botond Dénes	2740dd2ae4	evictable_reader: always reset static row drop flag When the evictable reader recreates in underlying reader, it does it such that it continues from the exact mutation fragment the read was left off from. There are however two special mutation fragments, the partition start and static row that are unconditionally re-emitted at the start of a new read. To work around this, when stopping at either of these fragments the evictable reader sets two flags _drop_partition_start and _drop_static_row to drop the unneeded fragments (that were already emitted by it) from the underlying reader. These flags are then reset when the respective fragment is dropped. _drop_static_row has a vulnerability though: the partition doesn't necessarily have a static row and if it doesn't the flag is currently not cleared and can stay set until the next fill buffer call causing the static row to be dropped from another partition. To fix, always reset the _drop_static_row flag, even if no static row was dropped (because it doesn't exist).	2021-06-30 10:05:35 +03:00
Nadav Har'El	029991bfc2	test/cql-pytest: test that SSL CQL port doesn't accept unencrypted connections Scylla doesn't allow unencrypted connections over encrypted CQL ports (Cassandra does allow this, by setting "optional: true", but it's not secure and not recommended). Here we add a test that in indeed, we can't connect to an SSL port using an unencrypted connection. The test passes on Scylla, and also on Cassandra (run it on Cassandra with "test/cql-pytest/run-cassandra --ssl" - for which we added support in a recent patch). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210629121514.541042-1-nyh@scylladb.com>	2021-06-29 16:42:22 +03:00
Nadav Har'El	dc4c05b2e3	test/cql-pytest: switch some fixture scopes from "session" to "module" Fixtures in conftest.py (e.g., the test_keyspace fixture) can be shared by all tests in all source files, so they are marked with the "session" scope: All the tests in the testing session may share the same instance. This is fine. Some of test files have additional fixtures for creating special tables needed only in those files. Those were also, unnecessarily, marked "session" scope as well. This means that these temporary tables are only deleted at the very end of test suite, event though they can be deleted at the end of the test file which needed them - other test source files don't have access to it anyway. This is exactly what the "module" fixture scope is, so this patch changes all the fixtures that are private to one test file to use the "module" scope. After this patch, the teardown of the last test in the suite goes down from 0.26 seconds to just 0.06 seconds. Another benefit is that the peak disk usage of the test suite is lower, because some of the temporary tables are deleted sooner. This patch does not change any test functionality, and also does not make any test faster - it just changes the order of the fixture teardowns. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #8932	2021-06-29 16:10:47 +03:00
Calle Wilund	a40b6a2f54	commitlog: Use disk file alignment info (with lower value if possible) Previously, the disk block alignment of segments was hardcoded (due to really old code). Now we use the value as declared in the actual file opened. If we are using a previously written file (i.e. o_dsync), we can even use the sometimes smaller "read" alignment. Also allow config to completely override this with a disk alignment config option (not exposed to global config yet, but can be). v2: * Use overwrite alignment if doing only overwrite * Ensure to adjust actual alignment if/when doing file wrapping v3: * Kill alignment config param. Useless and unsafe. Closes #8935	2021-06-29 16:00:49 +03:00
Nadav Har'El	7e4bef96af	test/cql-pytest: support "--ssl" option in run-cassandra This patch adds support for the "--ssl" option in run-cassandra, which will now be able, like run (which runs Scylla), to run Cassandra with listening to a SSL-encrypted CQL connection. The "--ssl" option is also passed to the tests, so they know to encrypt their CQL connections. We already had support for this feature in the test/cql-pytest/run script - which runs Scylla. Adding this also to the run-cassandra script can help verify that a behavior we notice in Scylla's SSL support and we want to add to a test - is also shared by Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210629082532.535229-1-nyh@scylladb.com>	2021-06-29 12:05:40 +03:00
Raphael S. Carvalho	ef76cdb2c7	sstables: Attach sstable name to exception triggered in sstable mutation reader When compaction fails due to a failure that comes from a specific sstable, like on data corruption, the log isn't telling which sstable contributed to that. Let's always attach the sstable name to the exception triggered in sstable mutation reader. Exceptions in la and mx consumer attached sst name, but now only sst mutation reader will do it so as to avoid duplicating the sst name. Now: ERROR 2021-06-11 16:07:34,489 [shard 0] compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584): retrying Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-06-28 12:54:24 -03:00
Raphael S. Carvalho	95cc48508c	test/broken_sstable_test: Make test more robust Test breaks very easily whenever there's a change in the message formatted for malformed_sstable_exception. Make test more robust by not checking exact message, but that the message contains both the expected exception and the sstable filename. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-06-28 12:27:29 -03:00
Raphael S. Carvalho	c7424ca6e6	sstables: Make log more useful when compressed chunk fails checksum The current log is useless when checksum fails, you don't know which compressed chunk failed checksum, the expected and the actual checksum, the size of chunk and so on. Before: compressed chunk failed checksum Now: compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-06-28 12:25:51 -03:00
Raphael S. Carvalho	9daf5d1ab8	sstables: Use correct exception when compressed chunk fails checksum Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-06-28 12:25:44 -03:00
Takuya ASADA	edd54a9463	reloc: add arch to relocatable package filename Add architecture name for relocatable packages, to support distributing both x86_64 version and aarch64 version. Also create symlink from new filename to old filename to keep compatibility with older scripts. Fixes #8675 Closes #8709 [update tools/python3 submodule: * tools/python3 ad04e8e...afe2e7f (1): > reloc: add arch to relocatable package filename ]	2021-06-28 15:01:09 +03:00
Avi Kivity	f660726773	Update seastar submodule * seastar 0e48ba883...eaa00e761 (3): > memory: reduce statistics TLS initialization even more > Merge "Sanitize io-topology creation on start" from Pavel E > doc/prometheus: note that metric family is passed by query name	2021-06-28 11:52:36 +03:00
Botond Dénes	09309f5dbf	reader_concurrency_semaphore: on_permit_created(): remove noexcept The permit creation path enters the semaphore's permit gate in on_permit_created(). Entering this gate can throw so this method is not noexcept. Remove the noexcept specifier accordingly. Also enter the gate before adding the permit to the permit list, to save some work when this fails. Fixes: #8933 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210628074941.32878-1-bdenes@scylladb.com>	2021-06-28 11:04:38 +03:00
Avi Kivity	c0c1e26014	Merge 'Remove code writing LA/KA sstables' from Piotr Jastrzębski Now that all supported versions write mc/md sstables, we can deprecate the MC_SSTABLE feature bit and consider it implicitly true, and with it the ability to write la/ka sstables. We still need to support reading them, e.g. from restoring old snapshots or migrating data from legacy clusters. Test: unit(dev, debug) Fixes #8352 Closes #8884 * github.com:scylladb/scylla: compress: Remove unused make_compressed_file_k_l_format_output_stream sstables: move sstable_writer to separate header sstable_writer: remove get_metadata_collector sstables: stop including metadata_collector.hh in sstables.hh sstables: Remove duplicated friend declaration sstables: remove unused KL writer sstables: Always use MC/MD writer sstable_datafile_test: switch tests to use latest sstables format sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version test_offstrategy_sstable_compaction: test all writable sstables compaction_with_fully_expired_table: Remove some LA specific code sstable_mutation_test: test latest sstable format instead of LA sstable_test: Test MX sstables instead of KA/LA sstable_datafile_test: Fix schema used by check_compacted_sstables sstables: Remove LA/KA sstable writting tests that check exact format sstables: define writable_sstable_versions features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled	2021-06-27 20:50:51 +03:00
Avi Kivity	121971ec0f	Merge "storage_proxy: specialize query_singular() for non-IN queries" from Gleb " query_singular() accepts a partition_range_vector, corresponding to an IN query. But such queries are rare compared to single-partition queries. Co-routinise the code and special case non-IN queries by avoiding the call to map_reduce. Also replace executers array with small_vector to avoid an allocation in the common case. perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10: before: median 204545.04 tps ( 81.1 allocs/op, 15.1 tasks/op, 48828 insns/op) after: median 219769.97 tps ( 74.1 allocs/op, 12.1 tasks/op, 46495 insns/op) So, a ~7% improvement in tps and 5% improvement in instructions per op. Also large reduction in tasks and allocations. This is an alternative proposal to https://github.com/scylladb/scylla/pull/8909. The benefit of this one is that it does not duplicate any code (almost). " * 'query_singular-coroutine' of github.com:scylladb/scylla-dev: storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried storage_proxy: use small_vector in storage_proxy::query_singular to store executors storage_proxy: co-routinize storage_proxy::query_singular()	2021-06-27 16:30:19 +03:00
Piotr Jastrzebski	10228b35c5	compress: Remove unused make_compressed_file_k_l_format_output_stream Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	430fd5cfa9	sstables: move sstable_writer to separate header This class is used in only few places and does not have to be included everywhere sstable class is needed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	9e7144f719	sstable_writer: remove get_metadata_collector This function is only called internally so it does not have to be exposed and can be inlined instead. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	2d6608bb88	sstables: stop including metadata_collector.hh in sstables.hh metadata collector is rarely used so it's better to include it only in those few places. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	39851f76fc	sstables: Remove duplicated friend declaration Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	c7096470bf	sstables: remove unused KL writer Previous two patches removed the usage of KL writer so the code is now dead and can be safely removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	8293384189	sstables: Always use MC/MD writer Previous patch made MC the lowest sstables format in use so the removed check is always true now and we can remove the else part. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	314bc0e8a5	sstable_datafile_test: switch tests to use latest sstables format instead of LA. Ability to write LA and KA sstables will be removed by the following patches so we need to switch all the tests to write newer sstables. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	f03ed9b9a7	sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	1ed298b08b	test_offstrategy_sstable_compaction: test all writable sstables Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:12 +02:00
Gleb Natapov	e5154e77d3	storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried Not using IN queries is a common case, so avoid map_reduce overhead for them.	2021-06-27 14:49:44 +03:00
Gleb Natapov	8018eb4612	storage_proxy: use small_vector in storage_proxy::query_singular to store executors Having only one pk to query is the common case, so avoid allocating executor vector for that case.	2021-06-27 14:48:15 +03:00
Gleb Natapov	d908912dbd	storage_proxy: co-routinize storage_proxy::query_singular() The code becomes much easier to understand and allocation for used_replicas can be dropped.	2021-06-27 14:47:19 +03:00
Avi Kivity	b7cb687d36	Merge "Harden storage_service::stop_transport" from Pavel E " Stopping transport (cql, thrift, messaging, etc.) can happen from several places -- drain, decommission, stop, isolation. Some of them can even run in parallel. This patch makes transport stopping bulletproof. tests: unit(dev) start-stop (dev) start-drain-stop (dev) fixes: #8911 " * 'br-stop-transport-races' of https://github.com/xemul/scylla: storage_service: Indentation fix storage_service: Make stop_transport re-entrable storage_service: Stop transport on decommission	2021-06-27 14:46:46 +03:00
Pavel Emelyanov	7014da9404	storage_service: Unregister disk error handlers on stop Storage service install disk error handlers in constructor and these connections are not unregistered. It's not a problem in real life, because storage service is not stopped, but in some tests this can lead to use-after-frees. The sstables_datafile_test runs some of the testcases in cql_test_env which starts and (!) stops the storage service. Other testcases are run in a lightweight sstables_test_env which does not mess with the storage service at all. Now, if a case of the 2nd kind is run after the one of the 1st and (for whatever reason) generates a disk error it will trigger use-after-free -- after the 1st testcase the storage service disk error registration would remain, but the storage service itself would already be stopped, thus triggering the disk error will try to access stopped sharded storage service inside the .isolate(). The fix is to keep the scoped connection on the storage service list of various listeners. On stop it will go away automagically. tests: unit(dev), sstables_datafile_test with forced disk error Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210625062648.27812-1-xemul@scylladb.com>	2021-06-27 14:41:55 +03:00
Avi Kivity	6676ceabde	Merge 'Prevent reactor stall in utils::merge_to_gently' from Benny Halevy std::copy_if runs without yielding. See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480 Also, eliminate extraneous loop on merge first1 will point to the inserted value which is a copy of first2. Since list2 is sorted in ascending order, the next item from list2 will never be less than the one we've just inserted, so we waste an iteration to merely increment first1 again. Fixes #8897 Test: unit(dev), stall_free_test(debug) DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev) Closes #8925 github.com:scylladb/scylla: utils: merge_to_gently: eliminate extraneous loop on merge utils: merge_to_gently: prevent stall in std::copy_if	2021-06-27 13:56:32 +03:00
Raphael S. Carvalho	29c93ae592	compaction: Reduce backlog of compacting SSTable properly It was observed that as compaction progresses the backlog of compacting SSTable is being reduced very slowly, which causes shares to be higher than needed, and consequently compaction acts much more aggressively than it has to. https://user-images.githubusercontent.com/1409139/120237819-93dfc080-c232-11eb-9042-68114e285ea0.png The graph above shows the amount of backlog that is reduced from a SSTable being compacted. The red line denotes the total backlog of the SSTable, before it's selected for compaction. The expectation is that the more a SSTable is compacted the more backlog will be reduced from it. However, in the current implementation, it can be seen that the backlog to be reduced, from the SSTable being compacted, starts being inversely proportional to the amount of data already compacted. Turns out that this problem happens because the implementation of backlog formula becomes incorrect when the SSTable is being compacted. Backlog for a sstable is currently defined as: Bi = Ei * log (T / Ei) where Ei = Si - Ci (bytes left to be compacted) and Si = size of SStable and Ci = total bytes compacted and T = total size of table The formula above can also be rewritten as follows: Bi = Ei * log (T) - Ei * log (Ei) the second term `Ei * log (Ei)` can be rewritten as: = (Si - Ci) * log (Ei) = Si * log (Ei) - Ci * log (Ei) However, digging backlog implementation, turns out that we're incorrectly implementing that second term as: = Si * log (Si) - Ci * log (Ei) Given that Si > Ei, for a SSTable being compacted, the backlog will be higher than it should. the following table shows how the backlog of a SSTable being compacted behaves now versus how it's supposed to behave: https://gist.github.com/raphaelsc/42e14be0d7d4ed264e538c2d217c8f95 Turns out that this is not the only problem. It was a mistake to change the formula from `Ei * log(T / Si)` to `Ei * log(T / Ei)`, when fixing the shrinking table issue, because that also causes the backlog of a compacting SSTable to be incorrectly reduced. With the formula rewritten as follows: Bi = Ei * log (T) - Ei * log (Ei) It becomes clear that the more a SSTable is compacted, the slower it becomes for backlog to be reduced, as T / Ei can increase considerably over time. So we're reverting the formula back to `Ei * log(T / Si)`. The graph below shows a better backlog behavior when table is shrinking: https://user-images.githubusercontent.com/1409139/123495186-06a54700-d5f9-11eb-9386-3fcf4dd8e4d3.png While analyzing the problem when table is shrinking, realized that it's because T in the formula is implemented as the effective size (total + partial - compacted). With the new formula rewritten as follows: Bi = Ei * log (T) - Ei * log (Si) It becomes clearer that T cannot be lower than Si whatsoever, otherwise the backlog becomes negative. Also, while table is shrinking, it can happen that the backlog will be so low that compaction will barely make any progress. To fix both issues, let's implement T as total size (sum of all Si) rather than effective size (sum of all Ei). The graph below shows that this change prevents the backlog from going negative while still providing similar and expected behavior as before, see: https://user-images.githubusercontent.com/1409139/123495185-060cb080-d5f9-11eb-89f7-ed445729702a.png Fixes #8768. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210626003133.3011007-1-raphaelsc@scylladb.com>	2021-06-27 11:43:48 +03:00
Pavel Emelyanov	a89ae9a8e7	storage_service: Indentation fix Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:21:10 +03:00
Pavel Emelyanov	bd2a58060e	storage_service: Make stop_transport re-entrable It may happen that disk error opccurs and subsequent isolation runs in parallel with drain or decommission or shutdown. In this case the stop_transport method would be running two times in parallel. Also the drain after decommission is not disabled, so it may happen that stop_transport will be called two times in a row (however -- not in parallel). Using shared_promise solves all the possible reentrances. (the indentation is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:18:43 +03:00
Pavel Emelyanov	b0199b005d	storage_service: Stop transport on decommission The stop_transport sequence is: - stop client services (cql, thrift, alternator) - stop gossiping - stop messaging - stop stream manager The decommissioning goes very similarly - stop client services - stop batchlog manager - stop gossiping - stop messaging So this change makes decommission stop all networking _before_ batchlog, like it's already done on drain, and additionally stop the streaming manager. This change is prerequisite for fixing race between transport stop and isolation (on disk error). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:15:38 +03:00
Piotr Jastrzebski	995eb8c274	compaction_with_fully_expired_table: Remove some LA specific code Following patches will switch all sstable writing tests to use the latest sstables format. compaction_with_fully_expired_table contains some test for a LA specific behaviour so let's remove it to make the switch possible. For more context see https://github.com/scylladb/scylla/issues/2620 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	8ff37bec17	sstable_mutation_test: test latest sstable format instead of LA Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	80f8f970e9	sstable_test: Test MX sstables instead of KA/LA Replace calls to make_compressed_file_k_l_format_input_stream with calls to make_compressed_file_m_format_input_stream. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	131a0babc0	sstable_datafile_test: Fix schema used by check_compacted_sstables check_compacted_sstables is used in compact_02 test which uses sstables created by compact_sstables. The problem is that schema used in check_compacted_sstables and compact_sstables is not the same. The type of r1 column is different. This was not a problem when the test was running on LA sstables but following patches will switch all the tests to use MC and then sstable schema becomes validated when reading the sstable and the test will fail such validation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	680e341f54	sstables: Remove LA/KA sstable writting tests that check exact format Those tests check that created sstables have exactly the expected bytes inside. This won't work with other sstable formats and writting LA/KA sstables will be removed by the following patches so there's nothing we can do with those tests but to remove them. Otherwise they will be failing after LA/KA writting capability is removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	2bd6ad1e2f	sstables: define writable_sstable_versions and use it instead of all_sstable_versions in tests that check writting of sstables. Following patches remove LA/KA writer so we want tests to be ready for that and not break by trying to write LA/KA sstables. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	1bdcef6890	features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled These features have been around for over 2 years and every reasonable deployment should have them enabled. The only case when those features could be not enabled is when the user has used enable_sstables_mc_format config flag to disable MC sstable format. This case has been eliminated by removing enable_sstables_mc_format config flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Pavel Emelyanov	3552e99ce7	scylla-gdb: Bring scylla netw back to work The netw command tries to access the netw::_the_messaging_service that was removed long ago. The correct place for the messaging service is in debug:: namespace. The scylla-gdb test checks that, but the netw command sees that the ptr in question is not initialized, thinks it's not yet sharded::start()-ed and exits without errors. tests: unit(gdb) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210624135107.12375-1-xemul@scylladb.com>	2021-06-24 20:59:27 +03:00
Nadav Har'El	4d7f55a29f	cql: add configurable restriction of DateTieredCompactionStrategy DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time (users should use TimeWindowCompactionStrategy, TWCS, instead). This patch adds a new configuration option - restrict_dtcs - which can be used to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE statements. This is part of a "safe mode" effort to allow an installation to restrict operations which are un-recommended or dangerous. The new restrict_dtcs option has three values: "true", "false", and "warn": For the time being, "false" is still the default, and means DTCS is not restricted and can still be used freely. We can easily change this default in a followup patch. Setting a value of "true" means that DTCS is restricted - trying to create a a table or alter a table with it will fail with an error. Setting a value of "warn" will allow the create or alter operation, but will warn the user - both with a warning message which will immediately appear in cqlsh (for example), and with a log message. Fixes #8914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210624122411.435361-1-nyh@scylladb.com>	2021-06-24 20:59:27 +03:00
Benny Halevy	b96eeaefe4	utils: merge_to_gently: eliminate extraneous loop on merge first1 will point to the inserted value which is a copy of first2. Since list2 is sorted in ascending order, the next item from list2 will never be less than the one we've just inserted, so we waste an iteration to merely increment first1 again. Note that the standard states that no iterators or references are invalidated on insert so we can safely keep looking at `first1` after inserting a copy of `first2` before it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-24 14:58:12 +03:00
Benny Halevy	453e7c8795	utils: merge_to_gently: prevent stall in std::copy_if std::copy_if runs without yielding. See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480 Note that the standard states that no iterators or references are invalidated on insert so we can keep inserting before last1 when merging the remainder of list2 at the tail of list1. Fixes #8897 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-24 14:47:25 +03:00
Benny Halevy	9bbe7b1482	sstables: mx_sstable_mutation_reader: enforce timeout Check if the timeout has expired before issuing I/O. Note that the sstable reader input_stream is not closed when the timeout is detected. The reader must be closed anyhow after the error bubbles up the chain of readers and before the reader is destroyed. This might already happen if the reader times out while waiting for reader_concurrency_semaphore admission. Test: unit(dev), auth_test.test_alter_with_timeouts(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>	2021-06-24 12:26:57 +02:00
Kamil Braun	a3f3563828	storage_service: check for existing normal token owners before bootstrapping The bootstrap procedure starts by "waiting for range setup", which means waiting for a time interval specified by the `ring_delay` parameter (30s by default) so the node can receive the tokens of other nodes before introducing its own tokens. However it may sometimes happen that the node doesn't receive the tokens. There are no explicit checks for this. But the code may crash in weird ways if the tokens-received assuption is false, and we are lucky if it does crash (instead of, for example, allowing the node to incorrectly bootstrap, causing data loss in the process). Introduce an explicit check-and-throw-if-false: a bootstrapping node now checks that there's at least one NORMAL token in the token ring, which means that it had to have contacted at least one existing node in the cluster, which means that it received the gossip application states of all nodes from that node; in particular the tokens of all nodes. Also add an assert in CDC code which relies on that assumption (and would cause weird division-by-zero errors if the assumption was false; better to crash on assert than this). Ref #8889. Closes #8896	2021-06-24 13:19:08 +03:00
Asias He	2ad8fb756e	gossip: Promote gossip quarantine over log to info level 1) Start node n1, n2, n3 2) Bootstrap n4 and kill n4 in the middle of bootstrap 3) Wipe data on n4 and start n4 again After step 2, n1, n2 and n3 will remove n4 from gossip after fat_client_timeout and put n4 in quarantine for quarantine_delay(). If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2 and n3 will ignore gossip updates from n4, and n4 will not learn gossip updates from the cluster. After PR #8896, the bootstrap will be rejected. This patch promotes the gossip quarantine over log to info level, so that dtest can wait for the log to bootstrap the node again. Refs #8889 Refs #8890 Closes #8905	2021-06-24 12:51:32 +03:00
Michael Livshin	9b9efb2b42	disable caching of the system.large_* tables The cache of system.large_{partition,rows,cells} accumulates range tombstones (https://github.com/scylladb/scylla/issues/7750), and those range tombstones can be evicted only together with their partition (https://github.com/scylladb/scylla/issues/3288). Making the system.large_* tables uncached should work around the problem until #3288 is fixed. Fixes #8874 Refs #7750 Refs #3288 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>	2021-06-24 12:26:45 +03:00
Piotr Sarna	ae9e52a774	Merge 'Cleanup and improvements for docs/alternator/alternator.md' from Nadav Har'El Make some improvements to docs/alternator.md as suggested by a user who had trouble understanding the previous version, and also a few other random cleanups. Closes #8910 * github.com:scylladb/scylla: docs/alternator/alternator.md: improve "Running Alternator" section docs/alternator/alternator.md: correct minor typos docs/alternator/alternator.md: fix link format	2021-06-24 12:03:26 +03:00
Avi Kivity	14252c8b71	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695 ) (v3)' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. 4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to issue segment prunes from end_flush() so we wake up actual file deletion/recycling 5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate 6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above. 7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test. Closes #8764 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog_test: Improve test assertion commitlog: Add waitable future for background sync/flush commitlog: abort queues on shutdown commitlog: break out "abort" calls into member functions commitlog: Do explicit discard+delete in shutdown commitlog: Recycle or not should not depend on shutdown state commitlog: Issue discard_unused_segments on segment::flush end IFF deletable commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-06-24 12:03:26 +03:00
Pavel Emelyanov	a61afe8421	btree: Improve unlink_leftmost_without_rebalance() The helper is used to walk the tree key-by-key destroying it in the mean time. Current implementation of this method just uses the "regular" erasing code which actually rebalances the tree despite the name. The biggest problem with removing the rebalancing is that at some point non-balanced tree may have the left-most key on an inner node, so to make 100% rebalance-less unlink every other method of the tree would have to be prepared for that. However, there's an option to make "light rebalance" (as it's called in this patch) that only maintains this crucial property of the tree -- the left-most key is on the leaf. Some more tech details. Current rebalancer starts when the node population falls below 1/2 of its capacity and tries to - grab a key from one of the siblings if it's balanced - merge two siblings together if they are small enough The light rebalance is lighter in two ways. First, it leaves the node unbalanced until it becomes empty. And then it goes ahead and replaces it with the next sibling. This change removes ~60% of the keys movements on random test. Keys still move when the sibling replace happens because in this case the separation key needs to be placed at the right sibling 0 position which means shifting all its keys right. tests: unit(debug) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210623083836.27491-1-xemul@scylladb.com>	2021-06-24 12:03:26 +03:00
Raphael S. Carvalho	ab9d08d80e	sstables: Remove unused filtering reader from sstable_set::make_local_shard_sstable_reader() It's been a long time since table no longer accepts shared sstables, so this code which creates a filtering reader, if sstable is shared, is never used. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210618200026.1002621-2-raphaelsc@scylladb.com>	2021-06-24 12:03:26 +03:00
Raphael S. Carvalho	88119a5c81	distributed_loader: Kill table's _sstables_opened_but_not_loaded _sstables_opened_but_not_loaded was needed because the old loader would open sstables from all shards before loading them. In the new loader, introduced with reshape, make_sstables_available() is called on each shard after resharding and reshape finished, so there's no need whatsoever for that mess. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>	2021-06-24 12:03:26 +03:00
Tomasz Grabiec	ee28eb4100	Merge "test: raft: move some tests to `raft` folder" from Pavel Solodovnikov Move `raft_sys_table_storage_test` and `raft_address_map_test` to `test/raft` folder since they naturally belong here, not in `test/boost` folder. Tests: unit(dev) * manmanson/move_some_raft_tests_to_raft_folder: test: raft: move `raft_address_map_test` to `raft` folder test: raft: move `raft_sys_table_storage_test` to `raft` folder configure: add extended raft testing dependencies	2021-06-24 12:03:26 +03:00
Pavel Emelyanov	e031e7b0a7	scylla-gdb: Do not leave random offset in smp-queues known vptrs The process of getting a queue pointer is quite tricky here. First, it checks if the vptr resolves into 'vtable for async_work_item' and puts a None mark into known_vptrs dict. Then, if the entry is present there are two options. First, if it's NOT None, it's translated directly into the queue object. But if it's None, then a loop over an offset starts that tries to check is the vptr + offset maps to a queue. And here's the problem -- if no offsets were mapped to any specific queues the last checked offset is put into the known vptrs dict, so the next vptrs will miss the 2nd offset checking, but will go ahead and use the "random" offset that had failed previously. tests: unit(gdb) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210624085723.7156-1-xemul@scylladb.com>	2021-06-24 12:03:22 +03:00
Nadav Har'El	b965bc76e0	docs/alternator/alternator.md: improve "Running Alternator" section A user complained that the "Running Alternator" section was confusing. It didn't say outright which two configurations are necessary and you had to read a few paragraph to reach it, and it mixed the YAML names of options and the command-line names, which are subtly different. This patch tries to improve this.	2021-06-23 19:41:52 +03:00
Tomasz Grabiec	a60e73fe14	Merge "raft: allow to initiate leader stepdown process explicitly" from Gleb Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. * scylla-dev/raft-stepdown-v4: raft: test: test leadership transfer timeout raft: allow to initiate leader stepdown process	2021-06-23 00:14:46 +02:00
Pavel Solodovnikov	a96ddbec35	test: raft: move `raft_address_map_test` to `raft` folder Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-22 23:33:22 +03:00
Pavel Solodovnikov	cf5025c44e	test: raft: move `raft_sys_table_storage_test` to `raft` folder Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-22 23:31:41 +03:00
Pavel Solodovnikov	6912f76e45	configure: add extended raft testing dependencies Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies` and introduce `scylla_raft_dependencies` that contains `scylla_core` (i.e., all scylla source files). The new `scylla_raft_dependencies` variable will be used for `raft_address_map_test` and `raft_sys_table_storage_test`, which use a lot of machinery from scylla. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-22 23:26:18 +03:00
Nadav Har'El	3895d4bb99	docs/alternator/alternator.md: correct minor typos Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-22 20:03:48 +03:00
Benny Halevy	4ab4f63efe	sstables: mx/writer: flush_tmp_bufs: maybe_yield in loop This loop may cause pretty long reactor stalls as seen in https://github.com/scylladb/scylla/issues/8900 Apparently output_stream<CharType>::slow_write returns a ready future and no yielding is considered, so add a check in the top level loop (that must already be called from a seastar thread). Fixes #8900 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210622152206.156302-1-bhalevy@scylladb.com>	2021-06-22 18:56:12 +03:00
Avi Kivity	d27e88e785	Merge "compaction: prevent broken_promise or dangling reader errors" from Benny " This series prevents broken_promise or dangling reader errors when (resharding) compaction is stopped, e.g. during shutdown. At the moment compaction just closes the reader unilaterally and this yanks the reader from under the queue_reader_handle feet, causing dangling queue reader and broken_promise errors as seen in #8755. Instead, fix queue_reader::close to set value on the _full/_not_full promises and detach from the handle, and return _consume_fut from bucket_writer::consume if handle is terminated. Fixes #8755 Test: unit(dev) DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug) " * tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla: mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated queue_reader_handle: add get_exception method queue_reader: close: set value on promises on detach from handle	2021-06-22 18:52:11 +03:00
Nadav Har'El	5bb4966cac	docs/alternator/alternator.md: fix link format Unfortunately the scylla.docs.scylladb.com formatter which generates https://scylla.docs.scylladb.com/master/alternator/alternator.html doesn't know how to recognize HTTP URLs and convert them into proper HTML links (something which github's formatter does). So convert the two URLs we had in alternator.md into markdown links which both github and our formatter recognize. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-22 18:43:27 +03:00
Calle Wilund	373fa3fa07	table: ensure memtable is actually in memtable list before erasing Fixes #8749 if a table::clear() was issued while we were flushing a memtable, the memtable is already gone from list. We need to check this before erase. Otherwise we get random memory corruption via std::vector::erase v2: * Make interface more set-like (tolerate non-existance in erase). Closes #8904	2021-06-22 15:58:56 +02:00
Asias He	ffa211a8c7	repair: Avoid copy rows in apply_rows_on_master_in_thread The rows are not used after the call to to_repair_rows_list. Use std::move() to avoid copying. Fixes #8902 Closes #8903	2021-06-22 15:58:56 +02:00
Benny Halevy	02917c79b6	logalloc: get rid of unused _descendant_blocked_requests Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>	2021-06-22 15:58:56 +02:00
Piotr Dulikowski	de1679b1b9	hints: make hints concurrency configurable and reduce the default Previously, hinted handoff had a hardcoded concurrency limit - at most 128 hints could be sent from a single shard at once. This commit makes this limit configurable by adding a new configuration option: `max_hinted_handoff_concurrency_per_shard`. This option can be updated in runtime. Additionally, the default concurrency per shard is made lower and is now 8. The motivation for reducing the concurrency was to mitigate the negative impact hints may have on performance of the receiving node due to them not being properly isolated with respect to I/O. Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) Refs: #8624 Closes #8646	2021-06-22 15:58:56 +02:00
Gleb Natapov	09528b8671	raft: test: test leadership transfer timeout Test that if leadership transfer cannot be done in configured time frame fsm cancels the leadership transfer process. Also check that timeout_now message is resent on each tick while leadership transfer is in progress.	2021-06-22 14:42:50 +03:00
Gleb Natapov	ed49d29473	raft: allow to initiate leader stepdown process Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. We already have a mechanism to transfer the leadership in case an active leader is removed. The patch exposes it as an external interface with a timeout period. If a node is still a leader after the timeout the operation will fail.	2021-06-22 14:36:42 +03:00
Konstantin Osipov	bd410da77a	raft: (service) rename raft_services service to raft_group_registry This is a more informative name. Helps see that, say, group0 is a separate service and not bundle all raft services together. Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Konstantin Osipov	025f18325e	raft: (service) move raft service to namespace service Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Calle Wilund	fdb5801704	table: Always use explicit commitlog discard + clear out rp_set Fixes #8733 If a memtable flush is still pending when we call table::clear(), we can end up doing a "discard-all" call to commitlog, followed by a per-segment-count (using rp_set) _later_. This will foobar our internal usage counts and quite probably cause assertion failures. Fixed by always doing per-memtable explicit discard call. But to ensure this works, since a memtable being flushed remains on memtable list for a while (why?), we must also ensure we clear out the rp_set on discard. v3: * Fix table::clear to discard rp_sets before memtables Closes #8894	2021-06-21 14:53:54 +03:00
Takuya ASADA	a677c46672	dist: stop removing /etc/systemd/system/.mount on package uninstall Listing /etc/systemd/system/.mount as ghost file seems incorrect, since user may want to keep using RAID volume / coredump directory after uninstalling Scylla, or user may want to upgrade enterprise version. Also, we mixed two types of files as ghost file, it should handle differently: 1. automatically generated by postinst scriptlet 2. generated by user invoked scylla_setup The package should remove only 1, since 2 is generated by user decision. See scylladb/scylla-enterprise#1780 Closes #8810	2021-06-21 14:53:54 +03:00
Calle Wilund	0a7823e683	commitlog_test: Add test case for usage/disk size threshold mismatch Refs #8270 Tries to simulate case where we mismatch segments usage with actual disk footprint and fail to flush enough to allow segment recycling	2021-06-21 06:01:19 +00:00
Calle Wilund	954da1f0a9	commitlog_test: Improve test assertion Changes it so actual data is printed, not just error.	2021-06-21 06:01:19 +00:00
Calle Wilund	d6113912cd	commitlog: Add waitable future for background sync/flush Commitlog timer issues un-waited syncs on all segments. If such a sync takes too long we can end up keeping a segment alive across a shutdown, causing the file to be left on disk, even if actually clean. This adds a future in segment_manager that is "chained" with all active syncs (hopefully just one), and ensures we wait for this to complete in shutdown, before pruning and deleting segments	2021-06-21 06:01:19 +00:00
Benny Halevy	499357fb43	row_cache: autoupdating_underlying_reader: fast_forward_to: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210613104232.634621-2-bhalevy@scylladb.com>	2021-06-20 14:46:35 +03:00
Benny Halevy	3db7db5743	row_cache: autoupdating_underlying_reader: fast_forward_to: capture snapshot by value when updating reader Currently we capture the snapshot mutation_source by reference for calling create_underlying_reader after closing the reader. However, if close_reader yields, the snapshot reference passed may be gone, so capture it by value instead. Fixes #8848 Test: unit(dev) DTest: restore_snapshot_using_old_token_ownership_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>	2021-06-20 14:46:35 +03:00
Avi Kivity	5b3fb83ebe	Merge "Remove unused code here and there" from Pavel E " Few randomly spotted dead code locations over past time. Compile-test only. " * 'br-remove-unused-stuff' of https://github.com/xemul/scylla: database: Remove unused forward declarations feature: Remove unused friendship with gossiper schema_tables: Remove unused sharded<proxy> argument database: Remove few unused sharded<proxy> captures view_update_generator: Remove unused struct sstable_with_table storage_service: Remove write-only _force_remove_completion distributed_loader: Remove unused load-prio manipulations	2021-06-20 12:01:40 +03:00
Pavel Emelyanov	ab4fc41f25	database: Remove unused forward declarations Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	d606321575	feature: Remove unused friendship with gossiper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	96131349e8	schema_tables: Remove unused sharded<proxy> argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	0f36f00682	database: Remove few unused sharded<proxy> captures Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	64bb16af8a	view_update_generator: Remove unused struct sstable_with_table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	cbcbf648b6	storage_service: Remove write-only _force_remove_completion This boolean became effectively unused after `829b4c14` (repair: Make removenode safe by default) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	7396de72b1	distributed_loader: Remove unused load-prio manipulations Mostly this was removed by `6dfeb107` (distributed_loader: remove unused code). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pekka Enberg	055bc33f0f	Update tools/java submodule * tools/java 599b2368d6...5013321823 (4): > cassandra-stress: fix failure due to the assert exception on disconnect when test is completed > node_probe: toppartitions: Fix wrong class in getMethod > Fix NullPointerException in SettingsMode > cassandra-stress: Remove maxPendingPerConnection default	2021-06-18 14:19:34 +03:00
Pekka Enberg	2a9443a753	Update tools/jmx submodule * tools/jmx a7c4c39...5311e9b (2): > storage_service: takeSnapshot: support the skipFlush option > build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient	2021-06-18 14:19:29 +03:00
Avi Kivity	b099e7c254	Merge "Untie hints managers and storage service" from Pavel E " The storage service is carried along storage proxy, hints resource manager and hints managers (two of them) just to subscribe the hints managers on lifecycle events (and stop the subscription on shutdown) emitted from storage service. This dependency chain can be greatly simplified, since the storage proxy is already subscribed on lifecycle events and can kick managers directly from its hooks. tests: unit(dev), dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev) " * 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla: hints: Drop storage service from managers hints: Do not subscribe managers on lifecycle events directly	2021-06-17 17:12:31 +03:00
Nadav Har'El	a9b383f423	cql-pytest: improve test for SSL/TLS versions The existing test_ssl.py which tests for Scylla's support of various TLS and SSL versions, used a deprecated and misleading Python API for choosing the protocol version. In particular, the protocol version ssl.PROTOCOL_SSLv23 is not, despite it's name, SSL versions 2 or 3, or SSL at all - it is in fact an alias for the latest TLS version :-( This misunderstanding led us to open the incorrect issue #8837. So in this patch, we avoid the old Python APIs for choosing protocols, which were gradually deprecated, and switch to the new API introduced in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum desired protocol version. With this new API, we can correctly connect with various versions of the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the fixed test, we confirm that Scylla does not allow SSLv3 - as desired - so issue #8837 is a non-issue. Moreover, after issue #8827 was already fixed, this test now passes, so the "xfail" mark is removed. Refs #8837. Refs #8827. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210617134305.173034-1-nyh@scylladb.com>	2021-06-17 17:06:31 +03:00
Nadav Har'El	8f107ece9f	Update seastar submodule * seastar 813eee3e...0e48ba88 (5): > net/tls: on TLS handshake failure, send error to client > net/dns: fix build on gcc 11 > core: fix docstring for max_concurrent_for_each > test: alien_test: replace deprecated call to alien::submit_to() with new variant > alien: prepare for multi-instance use The fix "net/tls: on TLS handshake failure, send error to client" fixes #8827. The test test/cql-pytest/run --ssl test_ssl.py now xpasses, so I'll remove the "xfail" mark in a followup patch.	2021-06-17 16:24:57 +03:00
Pavel Emelyanov	92a4278cd1	hints: Drop storage service from managers The storage service pointer is only used so (un)subscribe to (from) lifecycle events. Now the subscription is gone, so can the storage service pointer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:09:36 +03:00
Pavel Emelyanov	acdc568ecf	hints: Do not subscribe managers on lifecycle events directly Managers sit on storage proxy which is already subscribed on lifecycle events, so it can "notify" hints managers directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:06:26 +03:00
Tomasz Grabiec	6d8440fe70	Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov The patch set introduces a few leadership transfer tests, some of them are adaptations of corresponding etcd tests (e.g. `test_leader_transfer_ignore_proposal` and `test_transfer_non_member`). Others test different scenarios ensuring that pending leadership transfer doesn't disrupt the rest of the cluster from progressing: Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and `test_leader_transferee_dies_upon_receiving_timeout_now`) as well as lost `vote_request(force)` from the new candidate (test_leader_transfer_lost_force_vote_request) don't impact the election process following that and the leader is elected as normal. * manmanson/leadership_transfer_tests_v3: raft: etcd_test: test_transfer_non_member raft: etcd_test: test_leader_transfer_ignore_proposal raft: fsm_test: test_leader_transfer_lost_force_vote_request raft: fsm_test: test_leader_transfer_lost_timeout_now raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now	2021-06-17 13:58:31 +02:00
Piotr Sarna	8cca68de75	cql3: add USING TIMEOUT support for deletes Turns out the DELETE statement already supports attributes like timestamp, so it's ridiculously easy to add USING TIMEOUT support - it's just the matter of accepting it in the grammar. Fixes #8855 Closes #8876	2021-06-17 14:21:01 +03:00
Nadav Har'El	45c2442f49	Merge 'Avoid large allocs in mv update code' from Piotr Sarna This series addresses #8852 by: * migrating to chunked_vector in view update generation code to avoid large allocations * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust. Tests: unit(release) Refs #8852 Closes #8856 * github.com:scylladb/scylla: db,view: limit the number of simultaneous view update futures db,view: use chunked_vector for view updates	2021-06-17 14:01:38 +03:00
Avi Kivity	4d70f3baee	storage_proxy: change unordered_set<inet_address> to small_vector in write path The write paths in storage_proxy pass replica sets as std::unordered_set<gms::inet_address>. This is a complex type, with N+1 allocations for N members, so we change it to a small_vector (via inet_address_vector_replica_set) which requires just one allocation, and even zero when up to three replicas are used. This change is more nuanced than the corresponding change to the read path `abe3d7d7` ("Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity"), for two reasons: - there is a quadratic algorithm in abstract_write_response_handler::response(): it searches for a replica and erases it. Since this happens for every replica, it happens N^2/2 times. - replica sets for writes always include all datacenters, while reads usually involve just one datacenter. So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2 =105 compares. We could remove this by sending the index of the replica in the replica set to the replica and ask it to include the index in the response, but I think that this is unnecessary. Those 105 compares need to be only 105/15 = 7 times cheaper than the corresponding unordered_set operation, which they surely will. Handling a response after a cross-datacenter round trip surely involves L3 cache misses, and a small_vector reduces these to a minimum compared to an unordered_set with its bucket table, linked list walking and managent, and table rehashing. Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000 --task-quota-ms show two allocations removed (as expected) and a nice reduction in instructions executed. before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op) after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op) Closes #8847	2021-06-17 13:46:40 +03:00
Avi Kivity	98cdeaf0f2	schema_tables: make the_merge_lock thread_local the_merge_lock is global, which is fine now because it is only used in shard 0. However, if we run multiple nodes in a single process, there will be multiple shard 0:s, and the_merge_lock will be accessed from multiple threads. This won't work. To fix, make it thread_local. It would be better to make it a member of some controlling object, but there isn't one. Closes #8858	2021-06-17 13:41:11 +03:00
Avi Kivity	00ff3c1366	Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy The option is provided by nodetool snapshot https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/ ``` nodetool [(-h <host> \| --host <host>)] [(-p <port> \| --port <port>)] [(-pp \| --print-port)] [(-pw <password> \| --password <password>)] [(-pwf <passwordFilePath> \| --password-file <passwordFilePath>)] [(-u <username> \| --username <username>)] snapshot [(-cf <table> \| --column-family <table> \| --table <table>)] [(-kc <kclist> \| --kc.list <kclist>)] [(-sf \| --skip-flush)] [(-t <tag> \| --tag <tag>)] [--] [<keyspaces...>] -sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data) ``` But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167) and not supported at the api level. This patch adds support for the option in advance from the api service level down via snapshot_ctl to the table class and snapshot implementation. In addition, a corresponding unit test was added to verify that taking a snapshot with `skip_flush` does not flush the memtable (at the table::snapshot level). Refs #8725 Closes #8726 * github.com:scylladb/scylla: test: database_test: add snapshot_skip_flush_works api: storage_service/snapshots: support skip-flush option snapshot: support skip_flush option table: snapshot: add skip_flush option api: storage_service/snapshots: add sf (skip_flush) option	2021-06-17 13:32:23 +03:00
Nadav Har'El	7fd7e90213	cql-pytest: translate Cassandra's tests for static columns This is a translation of Cassandra's CQL unit test source file validation/entities/StaticColumnsTest.java into our our cql-pytest framework. This test file checks various features of static columns. All these tests pass on Cassandra, and all but one pass on Scylla. The xfailing test, testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra allows but we don't. The new issue about that is: Refs #8869. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616141633.114325-1-nyh@scylladb.com>	2021-06-17 11:08:28 +02:00
Nadav Har'El	b6b4df9a47	heat-weighted load balancing: improve handling of near-perfect cache Consider two nodes with almost-100% cache hit ratio, but not exactly 100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we want to equalize the miss rate in both nodes. So we send the first node twice the number of requests we send to the second. But unless the disks are extremely limited, this doesn't make sense: As a numeric example, consider that we send 2000 requests to the first node and 1000 to the second, just so the number of misses will be the same - 2 (0.1% and 0.2% misses, respectively). At such low miss numbers, the assumption that the disk reads are the slowest part of the operation is wrong, so trying to equalize only this part is wrong. So above some threshold hit rate, we should treat all hit rates as equivalent. In the code we already had such a threshold - max_hit_rate, but it was set to the incredibly high 0.999. We saw in actual user runs (see issue #8815) that this threshold was too high - one node received twice the amount of requests that another did - although both had near-100% cache hit rates. So in this patch we lower the max_hit_rate to 0.95. This will have two consequences: 1. Two nodes with hit rates above 0.95 will be considered to have the same hit rate, so they will get equal amount of work - even if one has hit rate 0.98 and the other 0.99. 2. A cold node with it rate 0.0 will get 5% of the work of a node with the perfect hit rate limited to 0.95. This will allow the cold node to slowly warm up its cache. Before this patch, if the hot node happened to have a hit rate of 0.999 (the previous maximum), the cold node would get just 0.1% of the work and remain almost idle and fill its cache extremely slowly - which is a waste. Fixes #8815. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616180732.125295-1-nyh@scylladb.com>	2021-06-17 11:02:08 +02:00
Piotr Sarna	1fb831c8c1	db,view: limit the number of simultaneous view update futures Previously the view update code generated a continuation for each view update and stored them all in a vector. In certain cases the number of updates can grow really large (to millions and beyond), so it's better to only store a limited amount of these futures at a time.	2021-06-17 10:20:52 +02:00
Piotr Sarna	a7f7716ecf	db,view: use chunked_vector for view updates The number of view updates can grow large, especially in corner cases like removing large base partitions. Chunked vector prevents large allocations.	2021-06-17 10:15:17 +02:00
Avi Kivity	3c21833aac	cql3: expr: make column_value (and similar) a first-class expression Currently, column names can only appear in a boolean binary expression, but not on their own. This means that in the statement SELECT a FROM tab WHERE a > 3; We can represent the WHERE clause as an expression, but not the selector. To pave the way for using expressions in selector contexts, we promote the elements of binary_operator::lhs (column_value, column_value_tuple, token) to be expressions in their own right. binary_operator::lhs becomes an expression (wrapped in unique_ptr, because variants can't contain themselves). Note that all three new possibilities make sense in a selector: SELECT column FROM tab SELECT token(pk) FROM tab SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab There is some fallout from this: - because binary_operator contains a unique_ptr, it is no longer copyable. We add a copy constructor and assignment operator to compensate. - often, the new elements don't make sense when evaluating a boolean expression, which is the only context we had before. We call on_internal_error in these cases. The parser right now prevents such cases from being constructed in the first place (this is equivalent to if (some_struct_value) in C). - in statement_restrictions.cc, we need to evalute the lhs in the context of the full binary operator. I introduced with_current_binary_operator() for this; an alternative approach is to create a new sub-visitor. Closes #8797	2021-06-17 10:08:58 +03:00
Tomasz Grabiec	6bdf8c4c46	Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja Miscellaneous preparatory patches for group 0 discovery. * scylla-dev/raft-group-0-part-2-v4: raft: (service) servers map is gid -> server, not sid -> server system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID raft: (server) wait for configuration transition to complete raft: (server) implement raft::server::get_configuration() raft: (service) don't throw from schema state machine raft: (service) permit some scylla.raft cells to be empty raft: (service) properly handle failure to add a server raft: implement is_transient_error()	2021-06-17 00:15:40 +02:00
Asias He	7a32cab524	gossip: Fix use-after-free in real_mark_alive and mark_dead In commit `11a8912093` (gossiper: get_gossip_status: return string_view and make noexcept) get_gossip_status returns a pointer to an endpoint_state in endpoint_state_map. After commit `425e3b1182` (gossip: Introduce direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive can yield in the middle of the function. It is possible that endpoint_state can be removed, causing use-after-free to access it. To fix, make a copy before we yield. Fixes #8859 Closes #8862	2021-06-16 21:16:26 +02:00
Konstantin Osipov	18e3fcdbf1	raft: (service) servers map is gid -> server, not sid -> server Raft Group registry should map Raft Group Id to Raft Server, not Raft Server ID (which is identical for all groups) to Raft server. Raft Group 0 ID works as a cluster identifier, so is generated when a new cluster is created and is shared by all nodes of the same cluster. Implement a helper to get raft::server by group id. Consistently throw a new raft_group_not_found exception if there is no server or rpc for the specified group id.	2021-06-16 19:05:50 +03:00
Calle Wilund	14559b5a86	commitlog: abort queues on shutdown In case we only have a single segment active when shutting down, the replenisher can be blocked even though we manually flush-deleted. Add a signal type and abort queues using this to wake up waiter and force them to check shutdown status.	2021-06-16 15:35:56 +00:00
Calle Wilund	227b573cdf	commitlog: break out "abort" calls into member functions	2021-06-16 15:35:56 +00:00
Calle Wilund	5cd9691f00	commitlog: Do explicit discard+delete in shutdown When we are shutting down, before trying to close the gate, we should issue a discard to ensure waking up the replenish task	2021-06-16 15:35:56 +00:00
Calle Wilund	03b8baaa8d	commitlog: Recycle or not should not depend on shutdown state If we are using recycling, we should always use recycle in delete_segments, otherwise we can cause deadlock with replenish task, since it will be waiting for segment, then shutdown is set, and we are called, and can't fulfil the alloc -> deadlock	2021-06-16 15:35:56 +00:00
Calle Wilund	5ebf5835b0	commitlog: Issue discard_unused_segments on segment::flush end IFF deletable If a segments, when finishing a flush call, is deletable, we should issue a manual call to discard function (which moves deleteable segments off segment list) asap, since we otherwise are dependent on more calls from flush handlers (memtable flush). And since we could have blocked segment allocation, this can cause deadlocks, at least in tests.	2021-06-16 15:35:56 +00:00
Calle Wilund	cbddcf46aa	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-06-16 15:35:56 +00:00
Calle Wilund	a0f559a44c	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-06-16 15:35:56 +00:00
Calle Wilund	bcf4d07f0b	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-06-16 15:35:56 +00:00
Calle Wilund	1187f5c181	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-06-16 15:35:56 +00:00
Avi Kivity	f05ddf0967	Merge "Improve LSA descriptor encoding" from Pavel " The LSA small objects allocation latency is greatly affected by the way this allocator encodes the object descriptor in front of each allocated slot. Nowadays it's one of VLE variants implemented with the help of a loop. Re-implementing this piece with less instructions and without a loop allows greatly reducing the allocation latency. The speed-up mostly comes from loop-less code that doesn't confuse branch predictor. Also the express encoder seems to benefit from writing 8 bytes of the encoded value in one go, rather than byte- -by-byte. Perf measurements: 1. (new) logallog test shows ~40% smaller times 2. perf_mutation in release mode shows ~2% increase in tps 3. the encoder itself is 2 - 4 times faster on x86_64 and 1.05 - 3 times faster on aarch64. The speed-up depends on the 'encoded length', old encoder has linear time, the new one is constant tests: unit(dev), perf(release), just encoder on Aarch64 " * 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla: lsa: Use express encoder uleb64: Add express encoding lsa: Extract uleb64 code into header test: LSA allocation perf test	2021-06-16 18:07:13 +03:00
Pavel Emelyanov	8d0780fb92	lsa: Use express encoder To make it possible to use the express encoder, lsa needs to make sure that the value is below express supreme value and provide the size of the gap after the encoded value. Both requirements can be satisfied when encoding the migrator index on object allocation. On free the encoded value can be larger, so the extended express encoder will need more instructions and will not be that efficient, so the old encoder is used there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:12 +03:00
Pavel Emelyanov	1782b0c6b9	uleb64: Add express encoding Standard encoding is compiled into a loop that puts values into memory byte-by-byte. This works slowly, but reliably. When allocating an object LSA uses ubel64 encoder with 2 features that allow to optimize the encoder: 1. the value is migrator.index() which is small enough to fit 2 bytes when encoded 2. After the descriptor there usually comes an object which is of 8+ bytes in size Feature #1 makes it possible to encode the value with just a few instructions. In O3 level clang makes it like mov %esi,%ecx and $0xfc0,%ecx and $0x3f,%esi lea (%rsi,%rcx,4),%ecx add $0x40,%ecx Next, the encoder needs to put the value into a gap whose size depends on the alignment of previous and current objects, so the classical algo loops through this size. Feature #2 makes it possible to put the encoded value and the needed amount of zeros by using 2 64-bit movs. In this case the encoded value gets off the needed size and overwrites some memory after. That's OK, as this overwritten memory is where the allocated object _will_ be, so the contents there is not of any interest. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:10 +03:00
Pavel Emelyanov	d8dea48248	lsa: Extract uleb64 code into header The LSA code encodes an object descriptor before the object itself. The descriptor is 32-bit value and to put it in an efficient manner it's encoded into unsigned little-endian base-64 sequence. The encoding code is going to be optimized, so put it into a dedicated header in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:46:44 +03:00
Avi Kivity	0948908502	Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond " The close path of the multishard combining reader is riddled with workarounds the fact that the flat mutation reader couldn't wait on futures when destroyed. Now that we have a close() method that can do just that, all these workarounds can be removed. Even more workarounds can be found in tests, where resources like the reader concurrency semaphore are created separately for each tested multishard reader and then destroyed after it doesn't need it, so we had to come up with all sorts of creative and ugly workarounds to keep these alive until background cleanup is finished. This series fixes all this. Now, after calling close on the multishard reader, all resources it used, including the life-cycle policy, the semaphores created by it can be safely destroyed. This greatly simplifies the handling of the multishard reader, and makes it much easier to reason about life-cycle dependencies. Tests: unit(dev, release:v2, debug:v2, mutation_reader_test:debug -t test_multishard, multishard_mutation_query_test:debug, multishard_combining_reader_as_mutation_source:debug) " * 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla: mutation_reader: reader_lifecycle_policy: remove convenience methods mutation_reader: multishard_combining_reader: store shard_reader via unique ptr test/lib/reader_lifecycle_policy: destroy_reader: cleanup context test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore test/lib/reader_lifecycle_policy: use a more robust eviction mechanism reader_concurrency_semaphore: wait for all permits to be destroyed in stop() test/lib/reader_lifcecycle_policy: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard reader_lifecycle_policy implementations: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter mutation_reader: shard_reader::close(): wait on the remote reader multishard_mutation_query: destroy remote parts in the foreground mutation_reader: shard_reader::close(): close _reader mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment	2021-06-16 17:25:50 +03:00
Benny Halevy	693d5d9e6b	mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated When the queue_reader_handle is terminated it was either explicitly aborted or the reader was closed prematurely. In this case _consume_fut should hold the root-cause error (e.g. when compaction is stopped). Return it instead of trying to push the mutation fragment. If no error is returned from _consume_fut, make to sure to return either the queue_reader_handle error, if available, or a generic error since the writer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-16 17:25:16 +03:00
Benny Halevy	b5efc3ceac	queue_reader_handle: add get_exception method To be used by the mutation_writer in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-16 17:25:16 +03:00
Benny Halevy	4830b6647c	queue_reader: close: set value on promises on detach from handle To prevent broken_promise exception. Since close() is manadatory the queue_reader destructor, that just detaches the reader from the handle, is not needed anymore, so remove it. Adjust the test_queue_reader unit test accordingly. Test: test_queue_reader(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-16 17:25:14 +03:00
Konstantin Osipov	9c93d77e74	system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID Fix a bug in definitions of system.raft, system.raft_snapshots, group_id is TIMEUUID, not long.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	c67c77ed03	raft: (server) wait for configuration transition to complete By default, wait for the server to leave the joint configuration when making a configuration change. When assembling a fresh cluster Scylla may run a series of configuration changes. These changes would all go through the same leader and serialize in the critical section around server::cas(). Unless this critical section protects the complete transition from C_old configuration to C_new, after the first configuration is committed, the second may fail with exception that a configuration change is in progress. The topology changes layer should handle this exception, however, this may introduce either unpleasant delays into cluster assembly (i.e. if we sleep before retry), or a busy-wait/thundering herd situation, when all nodes are retrying their configuration changes. So let's be nice and wait for a full transition in server::set_configuration().	2021-06-16 16:52:43 +03:00
Konstantin Osipov	631c89e1a6	raft: (server) implement raft::server::get_configuration() raft::server::set_configuration() is useless on application level if we can't query the previous configuration.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	867440f080	raft: (service) don't throw from schema state machine It's now started as Scylla starts, and state machine failure leads to panic at start.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	845ff9f344	raft: (service) permit some scylla.raft cells to be empty When loading raft state from scylla.raft, permit some cells to be empty. Indeed, the server is not obliged to persist all vote, term, snapshot once it starts. And the log can be empty.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	b8fa6c6e9c	raft: (service) properly handle failure to add a server future.get() is not available outside thread context and co_await is not available inside catch (...) block.	2021-06-16 16:47:11 +03:00
Konstantin Osipov	73c59865f7	raft: implement is_transient_error() Add a helper to classify Raft exceptions as transient.	2021-06-16 16:26:31 +03:00
Pavel Emelyanov	1e67361267	test: LSA allocation perf test The test measures the time it takes to allocate a bunch of small objects on LSA inside single segment. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 13:40:44 +03:00
Botond Dénes	b4e69cf63d	test/lib/test_utils: require(): also log failed conditions Currently `require()` throws an exception when the condition fails. The problem with this is that the error is only printed at the end of the test, with no trace in the logs on where exactly it happened, compared to other logged events. This patchs also adds an error-level log line to address this. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>	2021-06-16 12:05:25 +03:00
Botond Dénes	28c2b54875	mutation_reader: reader_lifecycle_policy: remove convenience methods These convenience methods are not used as much anymore and they are not even really necessary as the register/unregister inactive read API got streamlined a lot to the point where all of these "convenience methods" are just one-liners, which we can just inline into their few callers without loosing readability.	2021-06-16 11:29:37 +03:00
Botond Dénes	63f0839164	mutation_reader: multishard_combining_reader: store shard_reader via unique ptr No need for a shared pointer anymore, as we don't have to potentially keep the shard reader alive after the multishard reader is destroyed, we now do proper cleanup in close(). We still need a pointer as the shard reader is un-movable but is stored in a vector which requires movable values.	2021-06-16 11:29:37 +03:00
Botond Dénes	a69db31b5c	test/lib/reader_lifecycle_policy: destroy_reader: cleanup context Now that we don't rely on any external machinery to keep the relevant parts of the context alive until needed as its life-cycle is effectively enclosed in that of the life-cycle policy itself, we can cleanup the context in `destroy_reader()` itself, avoiding a background trip back to this shard.	2021-06-16 11:29:36 +03:00
Botond Dénes	d2ddaced4e	test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds The lifecycle of the reader lifecycle policy and all the resources the reads use is now enclosed in that of the multishard reader thanks to its close() method. We can now remove all the workarounds we had in place to keep different resources as long as background reader cleanup finishes.	2021-06-16 11:29:36 +03:00
Botond Dénes	5a271e42a5	test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore So that when this method returns the semaphore is safe to destroy. This in turn will enable us to get rid of all the machinery we have in place to deal with the semaphore having to out-live the lifecycle policy without a clear time as to when it can be safe to destroy.	2021-06-16 11:29:36 +03:00
Botond Dénes	c09c62a0fb	test/lib/reader_lifecycle_policy: use a more robust eviction mechanism The test reader lifecycle policy has a mode in which it wants to ensure all inactive readers are evicted, so tests can stress reader recreation logic. For this it currently employs a trick of creating a waiter on the semaphore. I don't even know how this even works (or if it even does) but it sure complicates the lifecycle policy code a lot. So switch to the much more reliable and simple method of creating the semaphore with a single count and no memory. This ensures that all inactive reads are immediately evicted, while still allows a single read to be admitted at all times.	2021-06-16 11:29:36 +03:00
Botond Dénes	578a092e4a	reader_concurrency_semaphore: wait for all permits to be destroyed in stop() To prevent use-after-free resulting from any permit out-living the semaphore.	2021-06-16 11:29:36 +03:00
Botond Dénes	a10a6e253e	test/lib/reader_lifcecycle_policy: fix indentation Left broken from the previous patch.	2021-06-16 11:29:36 +03:00
Botond Dénes	8c7447effd	mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard Currently shard_reader::close() (its caller) goes to the remote shard, copies back all fragments left there to the local shard, then calls `destroy_reader()`, which in the case of the multishard mutation query copies it all back to the native shard. This was required before because `shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on `smp::submit_to()`. But close can, so we can get rid of all this back-and-forth and just call `destroy_reader()` on the shard the reader lives on, just like we do with `create_reader()`.	2021-06-16 11:29:35 +03:00
Avi Kivity	c3838cbc3b	Merge 'Make calculating affected ranges yieldable' from Piotr Sarna This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism. This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations. Tests: unit(release) Closes #8853 * github.com:scylladb/scylla: db,view,table: futurize calculating affected ranges table: coroutinize do_push_view_replica_updates db,view: use chunked vector for view affected ranges interval: generalize deoverlap()	2021-06-16 11:26:49 +03:00
Botond Dénes	4ecf061c90	reader_lifecycle_policy implementations: fix indentation Left broken from the previous patch.	2021-06-16 11:21:38 +03:00
Botond Dénes	a7e59d3e2c	mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter The shard reader is now able to wait on the stopped reader and pass the already stopped reader to `destroy_reader()`, so we can de-futurize the reader parameter of said method. The shard reader was already patched to pass a ready future so adjusting the call-site is trivial. The most prominent implementation, the multishard mutation query, can now also drop its `_dismantling_gate` which was put in place so it can wait on the background stopping if readers. A consequence of this move is that handling errors that might happen during the stopping of the reader is now handled in the shard reader, not all lifecycle policy implementations.	2021-06-16 11:21:38 +03:00
Botond Dénes	13d7806b62	mutation_reader: shard_reader::close(): wait on the remote reader We now have a future<> returning close() method so we don't need to do the cleanup of the remote reader in the background, detaching it from the shard-reader under destruction. We can now wait for the cleanup properly before the shard reader is destroyed and just pass the stopped reader to reader_lifecycle_policy::destroy_reader(). This patch does the first part -- moving the cleanup to the foreground, the API change of said method will come in the next patch.	2021-06-16 11:21:38 +03:00
Botond Dénes	ab8d2a04a5	multishard_mutation_query: destroy remote parts in the foreground Currently the foreign fields of the reader meta are destroyed in the background via the foreign pointer's destructor (with one exception). This makes the already complicated life-cycle of these parts and their dependencies even harder to reason about, especially in tests, where even things like semaphores live only within the test. This patch makes sure to destroy all these remote fields in the foreground in either `save_reader()` or `stop()`, ensuring that once `stop()` returns, everything is cleaned up.	2021-06-16 11:21:38 +03:00
Botond Dénes	7552cc73cf	mutation_reader: shard_reader::close(): close _reader The reason we got away without closing _reader so far is that it is an `std::unique_ptr<evictable_reader>` which is a `flat_mutation_reader::impl` instance, without the `flat_mutation_reader` wrapper, which contains the validations for close.	2021-06-16 11:21:33 +03:00
Avi Kivity	fce124bd90	Merge "Introduce flat_mutation_reader_v2" from Tomasz " This series introduces a new version of the mutation fragment stream (called v2) which aims at improving range tombstone handling in the system. When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair), the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range. See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the accumulator needs to keep track of all the tombstoned ranges which are still active. Another, although more benign, problem is computational complexity needed to maintain that data structure. The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given partition are represented with range_tombstone_change fragments. At any point in the stream there is a single active clustered tombstone. It is initially equal to the neutral tombstone when the stream of each partition starts. The range_tombstone_change fragment type signify changes of the active clustered tombstone. All fragments emitted while a given clustered tombstone is active are affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone is independent from the partition tombstone carried in partition_start. The memory needed to compact a stream is now constant, because the compactor needs to only track the current tombstone. Also, there is no need to expire ranges on each fragment because the stream emits a fragment when the range ends. This series doesn't convert all readers to v2. It introduces adaptors which can convert between v1 and v2 streams. Each mutation source can be constructed with either v1 or v2 stream factory, but it can be asked any version, performing conversion under the hood if necessary. In order to guarantee that v1 to v2 conversion produces a well-formed stream, this series needs to impose a constraint on v1 streams to trim range tombstones to clustering restrictions. Otherwise, v1->v2 converted could produce range tombstone changes which lie outside query restrictions, making the stream non-canonical. The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream more canonical, meaning that for a given set of writes, querying the database should produce the same stream of fragments for a given restrictions. There is less ambiguity in how the writes are represented in the fragment stream. It wasn't the case with v1. For example, A given set of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped with other fragments. Making a stream canonical is easier for diff-calculating. The mc sstable reader was converted to v2 because it seemed like a comparable effort to do that versus implementing range tombstone trimming in v1. The classes related to mutation fragment streams were cloned: flat_mutation_reader_v2, mutation_fragment_v2, related concepts. Refs #8625. To fully fix #8625 we need to finish the transition and get rid of the converters. Converters accumulate range tombstones. Tests: - unit [dev] " * tag 'flat_mutation_reader_range_tombstone_split-v3.2' of github.com:tgrabiec/scylla: (26 commits) tests: mutation_source_test: Run tests with conversions inserted in the middle tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests() tests: Add tests for flat_mutation_reader_v2 flat_mutation_reader: Update the doc to reflect range tombstone trimming sstables: Switch the mx reader to flat_mutation_reader_v2 row_cache: Emit range tombstone adjacent to upper bound of population range tests: sstables: Fix test assertions to not expect more than they should flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments() clustering_ranges_walker: Emit range tombstone changes while walking tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream Clone flat_reader_assertions into flat_reader_assertions_v2 test: lib: simple_schema: Reuse new_tombstone() test: lib: simple_schema: Accept tombstone in delete_range() mutation_source: Introduce make_reader_v2() partition_snapshot_flat_reader: Trim range tombstones to query ranges mutation_partition: Trim range tombstones to query ranges sstables: reader: Inline specialization of sstable_mutation_reader sstables: k_l: reader: Trim range tombstones to query ranges clustering_ranges_walker: Introduce split_tombstone() position_range: Introduce contains() check for ranges ...	2021-06-16 11:10:54 +03:00
Piotr Sarna	f832a30388	db,view,table: futurize calculating affected ranges In order to avoid stalls on large inputs, calculating affected ranges is now able to yield.	2021-06-16 09:51:31 +02:00
Piotr Sarna	e3fa0246a1	table: coroutinize do_push_view_replica_updates Makes the code cleaner, but more importantly it will make it easier to futurize calculate_affected_clustering_ranges in the near future.	2021-06-16 09:51:30 +02:00
Avi Kivity	44f3ad836b	main: use correct max-io-requests option spelling We check for the existence of the option using one spelling, then read it using another, so we crash with bad_lexical_cast if it's present when casting the empty string to unsigned. Fix by using the correct spelling. Closes #8866	2021-06-16 09:35:05 +02:00
Tomasz Grabiec	605a6e0166	Merge "Remove int_or_strong_ordering concept" from Pavel The one was added to smothly switch tri-comparing stuff from int to strong-ordering. As for today only tests still need it and the conversion is pretty simple, plus operator<<(ostream&) for the std::strong_ordering type. * xemul/br-remove-int-or-strong-ordering-2: util: Drop int_or_strong_ordering concept tests: Switch total-order-check onto strong_ordering to_string: Add formatter for strong_ordering tests: Return strong-ordering from tri-comparators	2021-06-16 09:34:49 +02:00
Botond Dénes	114459684b	mutation_reader: foreign_reader::close() use on_internal_error_noexcept() Instead of the throwing on_internal_error(). `close()` is noexcept so we can't throw exceptions here. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210615133130.786048-1-bdenes@scylladb.com>	2021-06-16 09:34:49 +02:00
Asias He	11959173a4	storage_service: Add node_ops_cmd_heartbeat_updater helper Multiple node operations use a similar heart beat update logic. Add a helper to reduce the code duplication. Fixes #8825 Closes #8826	2021-06-16 09:34:49 +02:00
Gleb Natapov	580edcef27	raft: register metrics only after fsm is created Metrics access _fsm pointer, so we should register them only after the pointer is populated. Fixes: #8824 Message-Id: <YMilsCslLAeEnbaw@scylladb.com>	2021-06-16 09:34:49 +02:00
Asias He	c2cfdcd345	gossiper: Set minimum value for quarantine_delay When a new node bootstraps to join the cluster, it will be set in bootstrap gossip status. If the node is gone in the middle, the node will be removed by gossip after the new node fails to update gossip after fat_client_timeout, which reverts the new node as pending node. However, if the new node is slow to update gossip and it finishes bootstrapping after existing nodes have removed the new node after fat_client_timeout. In handle_state_normal handler, the existing nodes will fail to find the host id for the new node and throw and in turn terminate the scylla process. To mitigate the problem, we set fat_client_timeout which is half of quarantine_delay to a minimum value if users set a small ring_delay value. Refs #8702 Refs #8859 Closes #8860	2021-06-16 09:34:49 +02:00
Tomasz Grabiec	3fcd1f43ba	tests: mutation_source_test: Run tests with conversions inserted in the middle	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	cddcba27de	tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests() All readers are now flat so there is no need for this grouping. Will be needed for the next patch, which needs a single function with all test cases.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	ffb616fef6	tests: Add tests for flat_mutation_reader_v2	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	3deaa15751	flat_mutation_reader: Update the doc to reflect range tombstone trimming	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	a4275cf8bc	sstables: Switch the mx reader to flat_mutation_reader_v2 The main difficulty was in making sure that emitted range tombstone changes reflect range tombstones trimmed to clustering restrictions. This is handled by mutation_fragment_filter and clustering_ranges_walker. They return a list of range_tombstone_change fragments to emit for each hop as the reader walks over the clustering domain. Tests which were using a normalizing reader expected range tombstones to be split around rows. Drop this an adjust the tests accoridngly. No reader splits range tombstones around rows now.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	cf958b0ad0	row_cache: Emit range tombstone adjacent to upper bound of population range Cache populating reader was emitting the row entry which stands for the upper bound of the population range, but did not emit range tombstones for the clustering range corresponding to: [ before(key), after(key) ). This surfaces after sstable readers are changed to trim emitted range tombstones to the fast-forwarding range. Before, it didn't cause problems, because that range tombstone part would be emitted as part of the sstable read. The fix is to drop the optimization which pushes the row after population is done, and let the regular handling for copy_from_cache_to_buffer() take care of emitting the row and tombstones for the remaining range. A unit test is added which covers population from all sstable versions.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	5b182ff29a	tests: sstables: Fix test assertions to not expect more than they should Before this patch, the tests expected readers to emit range tombstones which are outside clustering restrictions. Readers do not have to emit range tombstones outside clustering restrictions, so fix tests to only expect the part which overlaps with query ranges. This is a preparatory patch before changing readers to trim range tombstones to clustering ranges.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	558d88ea17	flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments() This is needed to change the guarantees of flat_mutation_reader v1 to produce only range tombstones trimmed to clustering restrictions. The reason for this is so that v2 has a canonical representation in which all fragments have position inside clustering restrictions. Conversion from v1 to v2 can guarantee that only if v1 trims range tombstones.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	77c618f46e	clustering_ranges_walker: Emit range tombstone changes while walking The walker will now emit range tombstone change fragments while walking. This is in order to support the guarantee of flat_mutation_reader_v2 saying that clustering range tombstone information must be trimmed to clustering key restrictions. For example, for ranges: [1, 3) [5, 9) [10, 11) advancing generates the following changes: using rtc = range_tombstone_change; advance_to(0, {}) -> [] advance_to(2, t1) -> [ rtc(2, t1) ] advance_to(4, t2) -> [ rtc(3, {}) ] advance_to(15, t3) -> [ rtc(5, t2), rtc(9, {}), rtc(10, t2), rtc(11, {}) ]	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	ed055db63e	tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	276c68c867	Clone flat_reader_assertions into flat_reader_assertions_v2	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	a13e7b30b7	test: lib: simple_schema: Reuse new_tombstone()	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	7e01679c99	test: lib: simple_schema: Accept tombstone in delete_range()	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	79795a1a61	mutation_source: Introduce make_reader_v2() Mutation sources can now produce natively either v1 or v2 streams. We still have both v1 and v2 make_reader() variants, which wrap in appropriate converters under the hood.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	4046dda844	partition_snapshot_flat_reader: Trim range tombstones to query ranges This is needed to change the guarantees of flat_mutation_reader v1 to produce only range tombstones trimmed to clustering restrictions. The reason for this is so that v2 has a canonical representation in which all fragments have position inside clustering restrictions. Conversion from v1 to v2 can guarantee that only if v1 trims range tombstones.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	655bc9fba5	mutation_partition: Trim range tombstones to query ranges Current code was only selecting overlapping range tombstones. We will need range tombstones to be trimmed. This is needed to change the semantics of flat_mutation_reader v1 to produce only range tombstones trimmed to clustering restrictions. This constructor is used in unit tests which verify what reader produces.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	8784ffe07f	sstables: reader: Inline specialization of sstable_mutation_reader Needed before converting the mx reader to flat_mutation_reader_v2 because now it and the k_l reader cannot share the reader implementation. They derive from different reader impl bases and push different fragment types.	2021-06-16 00:23:49 +02:00
Pavel Solodovnikov	e9258f43cd	raft: etcd_test: test_transfer_non_member Test that a node outside configuration, that receives `timeout_now` message, doesn't disrupt operation of the rest of the cluster. That is, `timeout_now` has no effect and the outsider stays in the follower state without promoting to the candidate. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	2b6d73de98	raft: etcd_test: test_leader_transfer_ignore_proposal Test that a leader which has entered leader stepdown mode rejects new append requests. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	ab6b0e3d62	raft: fsm_test: test_leader_transfer_lost_force_vote_request 3-node cluster (A, B, C). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C). Wait up until the former leader commits the new configuration and starts leader transfer procedure, sending out the `timeout_now` message to one of the remaining nodes. But at that point it haven't received it yet. Deliver the `timeout_now` message to the target but lose all the `vote_request(force)` messages it attempts to send. This should halt the election process. Then wait for election timeout so that candidate node starts another normal election (without `force` flag for vote requests). Check that this candidate further makes progress and is elected a leader. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	97fe6f9d49	raft: fsm_test: test_leader_transfer_lost_timeout_now 3-node cluster (A, B, C). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C). Wait up until the former leader commits the new configuration and starts leader transfer procedure, sending out the `timeout_now` message to one of the remaining nodes. But at that point it haven't received it yet. Lose this message and verify that the rest of the cluster (B, C) can make progress and elect a new leader. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	c32497b798	raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now 4-node cluster (A, B, C, D). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C, D). Communicate the cluster up to the point where A starts to resign its leadership (calls `transfer_leadership()`). At this point, A should send a `timeout_now` message to one the remaining nodes (B, C or D) and the new configuration should be committed. But no nodes actually have received the `timeout_now` message yet. Determine on which node the message should arrive, accept the `timeout_now` message and disconnect the target from the rest of the group. Check that after that the cluster, which has only two live members, could progress and elect a new leader through a normal election process. tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:19 +03:00
Botond Dénes	98e5f0429b	mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment About the multishard reader not being able to wait on returned future. It can now via the `close()` method.	2021-06-15 15:23:32 +03:00
Tomasz Grabiec	53568f6939	sstables: k_l: reader: Trim range tombstones to query ranges This is needed to change the guarantees of flat_mutation_reader v1 to produce only range tombstones trimmed to clustering restrictions. The reason for this is so that v2 has a canonical representation in which all fragments have position inside clustering restrictions. Conversion from v1 to v2 can guarantee that only if v1 trims range tombstones.	2021-06-15 13:14:45 +02:00
Tomasz Grabiec	f339eb3e9c	clustering_ranges_walker: Introduce split_tombstone()	2021-06-15 13:14:45 +02:00
Tomasz Grabiec	08f471043e	position_range: Introduce contains() check for ranges	2021-06-15 13:14:45 +02:00
Tomasz Grabiec	c9f2daaa8e	range_tombstone: Introduce trim()	2021-06-15 13:14:45 +02:00
Tomasz Grabiec	1ef73abd82	flat_mutation_reader_v2: Implement read_mutation_from_flat_mutation_reader()	2021-06-15 13:14:45 +02:00
Tomasz Grabiec	9996b7ca18	flat_mutation_reader: Introduce adaptors between v1 and v2 of mutation fragment stream The transition to v2 will be incremental. To support that, we need adaptors between v1 and v2 which will be inserted at places which are boundaries of conversion. The v1 -> v2 converter needs to accumulate range tombstones, so has unbounded worst case memory footprint. The v2 -> v1 converter trims range tombstones around clustering rows, so generates more fragments than necessary. Because of that, adpators are a temporary solution and we should not release with them on the produciton code paths.	2021-06-15 13:10:47 +02:00
Tomasz Grabiec	08b5773c12	Adapt flat_mutation_reader_v2 to the new version of the API When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair), the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range. See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the accumulator needs to keep track of all the tombstoned ranges which are still active. Another, although more benign, problem is computational complexity needed to maintain that data structure. The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given partition are represented with range_tombstone_change fragments. At any point in the stream there is a single active clustered tombstone. It is initially equal to the neutral tombstone when the stream of each partition starts. The range_tombstone_change fragment type signify changes of the active clustered tombstone. All fragments emitted while a given clustered tombstone is active are affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone is independent from the partition tombstone carried in partition_start. The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream more canonical, meaning that for a given set of writes, querying the database should produce the same stream of fragments for a given restrictions. There is less ambiguity in how the writes are represented in the fragment stream. It wasn't the case with v1. For example, A given set of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped with other fragments. Making a stream canonical is easier for diff-calculating. The classes related to mutation fragment streams were cloned: flat_mutation_reader_v2, mutation_fragment_v2, and related concepts. Refs #8625.	2021-06-15 13:10:47 +02:00
Tomasz Grabiec	e3309322c3	Clone flat_mutation_reader related classes into v2 variants To make review easier, first clone the classes without chaning the logic. Logic and API will change in subsequent commits.	2021-06-15 13:10:09 +02:00
Tomasz Grabiec	eb0078d670	flat_mutation_reader: Document current guarantees about mutation fragment stream	2021-06-15 12:37:09 +02:00
Alejo Sanchez	9a22a30554	raft: replication test: split elect_new_leader for prevote Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-02-v3-01 Tests: unit ({dev}), unit ({debug}), unit ({release}) This fixes current election hangs in next. Message-Id: <20210610143558.131685-1-alejo.sanchez@scylladb.com>	2021-06-15 11:53:24 +02:00
Avi Kivity	10e75bc363	storage_proxy: remove excess continuations around abstract_read_executor::make_requests() abstract_read_executor::make_requests() calls make_{data,digest}_request(), which loop over endpoints in a parallel_for_each(), then collects the result of the parallel_for_each()es with when_all_succeed(), then a handle_execption() (or discard_result() in related callers). The caller of make_requests then attaches a finally() block to keep `this` alive, and discards the remaining future. So, a lot of continuations are generated to merge the results, all in order to keep a reference count alive. Remove those excess continuations by having individual make_*_request() variants elevate the reference count themselves. They all already have a continuation to uncorporate the result into the executor, all they need is an extra shared_from_this() call. The parallel_for_each() loops are converted to regular for loops. Note even a local request that hits cache ends up with a non-ready future due to an execution_stage for replica access, so these continuations generate reactor tasks. perf_simple_query reports: before: median 203905.19 tps ( 87.1 allocs/op, 20.1 tasks/op, 50860 insns/op) after: median 214646.89 tps ( 81.1 allocs/op, 15.1 tasks/op, 48604 insns/op)	2021-06-15 10:49:57 +02:00
Piotr Sarna	3592d9b36e	db,view: use chunked vector for view affected ranges There were large allocation reportsa from vectors used for calculating affected ranges. In order to reduce the pressure on the allocator, chunked vector is used for storing intermediate results.	2021-06-15 10:30:27 +02:00
Piotr Sarna	fbc83d5ac6	interval: generalize deoverlap() Instead of working only for std::vector, deoverlap is now capable of using other structures - including chunked_vector, which will help split large allocations into smaller ones.	2021-06-15 10:30:27 +02:00
Tomasz Grabiec	9d49a26e79	Merge "raft: randomized_nemesis_test: tick servers less often than the network in basic_test" from Kamil Previously `ticker` would use a single function, `on_tick`, which it called in a loop with yields in-between. In `basic_test` we would use this to tick every object in synchrony. However, to closely simulate a production environment, we want the tick ratios to be different. For example Raft servers should be ticked rarely compared to the network. We may also want to give the Seastar reactor more space between the function calls (e.g. if they cause a bunch of work to be created for the reactor that needs more than one tick to complete). To support these use cases we first generalize `ticker` to take a set of functions with associated numbers. These numbers are the call periods of their corresponding functions: given {n, f}, `f` will be called each `n`th tick. We use this new functionality to tick Raft servers less often than the network in basic_test. This patchset effectively reverts `01b6a2eb38` which caused the ticker to call `on_tick` only when the Seastar reactor had no work to do. This approach is unfortunately incompatible with the approach taken there. We do want the ticker to race with other work, potentially producing more work while already scheduled work is executing, and we want to see in tests what happens when we adjust the ticking ratios of different subsystems. The previous approach also had a problem where if there was an infinite task loop executing, the ticker wouldn't ever tick. The previous fix was introduced since the ticker caused too much work to be produced (so the reactor couldn't keep up) due to ticking the Raft servers too often (after each yield). This commit deals with the problem in a different way, by ticking the servers rarely, which also resembles "real-life" scenarios better. * kbr/tick-network-often-v4: raft: randomized_nemesis_test: generalize `ticker` to take a set of functions raft: randomized_nemesis_test: split `environment::tick` into two functions raft: randomized_nemesis_test: fix potential use-after-free in basic_test	2021-06-15 01:54:57 +02:00
Kamil Braun	8f1caa6a90	raft: randomized_nemesis_test: generalize `ticker` to take a set of functions ... with associated calling periods and use the new API in `basic_test`. Previously `ticker` would use a single function, `on_tick`, which it called in a loop with yields in-between. In `basic_test` we would use this to tick every object in synchrony. However, to closely simulate a production environment, we may want the tick ratios to be different. For example Raft servers should be ticked rarely compared to the network. We may also want to give the Seastar reactor more space between the function calls (e.g. if they cause a bunch of work to be created for the reactor that needs more than one tick to complete). To support these use cases we generalize `ticker` to take a set of functions with associated numbers. These numbers are the call periods of their corresponding functions: given {n, f}, `f` will be called each `n`th tick. We also modify `basic_test` to use this new approach: we tick Raft servers once per 10 network ticks (in particular, once per 10 reactor yields). This commit effectively reverts `01b6a2eb38` which caused the ticker to call `on_tick` only when the Seastar reactor had no work to do. This approach is unfortunately incompatible with the approach taken there. We do want the ticker to race with other work, potentially producing more work while already scheduled work is executing, and we want to see in tests what happens when we adjust the ticking ratios of different subsystems. The previous approach also had a problem where if there was an infinite task loop executing, the ticker wouldn't ever tick. The previous fix was introduced since the ticker caused too much work to be produced (so the reactor couldn't keep up) due to ticking the Raft servers too often (after each yield). This commit deals with the problem in a different way, by ticking the servers rarely, which also resembles "real-life" scenarios better. With this change we must also wait a bit longer for the first node to elect itself as a leader at the beginning of the test.	2021-06-14 16:54:38 +02:00
Kamil Braun	c0b80f1f8a	raft: randomized_nemesis_test: split `environment::tick` into two functions One for ticking the network and one for ticking the servers.	2021-06-14 16:54:38 +02:00
Kamil Braun	f42776aded	raft: randomized_nemesis_test: fix potential use-after-free in basic_test The test starts by waiting a certain number of ticks for the first node to elect itself as a leader. If this wait times out - i.e. the number of ticks passes before the node manages to elect itself - the future associated with the task which checks for the leader condition becomes discarded (it is passed to `with_timeout`) and the task may keep using the `environment` (which it has a reference to) even after the `environment` is destroyed. Furthermore, the aforementioned task is a coroutine which uses lambda captures in its body. Leaving `with_timeout` destroys the lambda object, causing the coroutine to refer to no-longer-existing captures. We fix the problems by: - making `environment` `weakly_referencable` and checking if its alive before it's used inside the task, - not capturing anything in the lambda but passing whatever's needed as function arguments (so these things get allocated inside the coroutine frame).	2021-06-14 16:54:38 +02:00
Nadav Har'El	3645c7104b	Merge: Wrap alternator start-stop into controller Merged patch series by Pavel Emelyanov: Alternator start and stop code is sitting inside the main() and it's a big piece of code out there. Havig it all in main complicates rework of start-stop sequences, it's much more handy to have it in alternator/. This set puts the mentioned code into transport- and thrift- like controller model. While doing it one more call for global storage service goes away. * 'br-alternator-clientize' of https://github.com/xemul/scylla: alternator: Move start-stop code into controller alternator: Move the whole starting code into a sched group alternator: Dont capture db, use cfg alternator: Controller skeleton alternator: Controller basement alternator: Drop storage service from executor	2021-06-14 15:44:10 +03:00
Michael Livshin	15b0e5c4d2	sstables: count read range tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210602152210.17948-2-michael.livshin@scylladb.com>	2021-06-14 14:37:33 +02:00
Michael Livshin	9ef2317248	row_cache: count range tombstones processed during read Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210602152210.17948-1-michael.livshin@scylladb.com>	2021-06-14 14:29:05 +02:00
Nadav Har'El	6726fe79b6	Merge 'view: fix use-after-move when handling view update failures' from Piotr Sarna The code was susceptible to use-after-move if both local and remote updates were going to be sent. The whole routine for sending view updates is now rewritten to avoid use-after-move. Fixes #8830 Tests: unit(release), dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build) Closes #8834 * github.com:scylladb/scylla: view: fix use-after-move when handling view update failures db,view: explicitly move the mutation to its helper function db,view: pass base token by value to mutate_MV	2021-06-14 13:15:35 +03:00
Alejo Sanchez	5c8092cf42	raft: fix election with disruptive candidate This patch also fixes rare hangs in debug mode for drops_04 without prevote. Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling Tests: unit ({dev}), unit ({debug}), unit ({release}) Changes in v2: - Fixed commit message @kostja Whithout prevote, a node disconnected for long enough becomes candidate. While disconnected (A) it keeps increasing its term. When it rejoins it disrupts the current leader (C) which steps down due to the higher term in (A)'s append_entries_reply and (C) also increases its term. Meanwhile followers (B) and (D) don't know (C) stepped down but see it alive according to the current failure detecture implementation, and also (A) has shorter log than them. So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers). Then (C) rejects voting for (A) because it has shorter log. And (C) becomes candidate but even though (A) votes for (C), the previous followers (B) and (D) ignore a vote request while leader (C) is still alive and election timeout has not passed. (A) and (C) alone can't reach quorum 2/4. So elections never succeed. This patch addresses this problem by making followers not ignore vote requests from who they think is the current leader even though election timout was not reached. As @kostja noted, if failure detector would consider a leader alive only as long as it sends heartbeats (append requests) this patch is no longer needed. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>	2021-06-14 11:07:38 +02:00
Piotr Jastrzebski	1ed92e37f8	database: Fix warning about deprecated update_shares_for_class usage This patch fixes the following compilation warning: database.cc:430:33: warning: 'update_shares_for_class' is deprecated: Use io_priority_class.update_shares [-Wdeprecated-declarations] _inflight_update = engine().update_shares_for_class(_io_priority, uint32_t(shares)); Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8751	2021-06-14 10:42:22 +03:00
Piotr Sarna	8a049c9116	view: fix use-after-move when handling view update failures The code was susceptible to use-after-move if both local and remote updates were going to be sent. The whole routine for sending view updates is now rewritten to avoid use-after-move. Refs #8830 Tests: unit(release), dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)	2021-06-14 09:36:10 +02:00
Piotr Sarna	7cdbb7951a	db,view: explicitly move the mutation to its helper function The `apply_to_remote_endpoints` helper function used to take its `mut` parameter by reference, but then moved the value from it, which is confusing and prone to errors. Since the value is moved-from, let's pass it to the helper function as rvalue ref explicitly.	2021-06-14 09:34:40 +02:00
Piotr Sarna	88d4a66e90	db,view: pass base token by value to mutate_MV The base token is passed cross-continuations, so the current way of passing it by const reference probably only works because the token copying is cheap enough to optimize the reference out. Fix by explicitly taking the token by value.	2021-06-14 09:30:38 +02:00
Nadav Har'El	6a8441ef03	Update seastar submodule * seastar 4506b878...813eee3e (12): > reactor: fix race with boost::barrier destructor during smp initialialization > Merge "Merge io-group and io-queue configs" from Pavel E > tests: add test for skipping data from a socket > tests: transform socket_test into a test suite > .gitignore: Add tags > tls: retain handshake error and return original problem on repeated failures > iostream: fix skipping from closed sockets > gitignore .cooking_memory > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy > io_priority_class: Make update_shares const > Remove <seastar/core/apply.hh> > smp: allow having multiple instances of the smp class The fix to make io_priority::update_shares() const will allow getting rid of one of the compilation warnings.	2021-06-14 10:27:14 +03:00
Nadav Har'El	061e43e9d4	Merge 'Fix some compilation warnings' from Piotr Jastrzębski Closes #8850 * github.com:scylladb/scylla: priority_manager: Fix warnings about deprecated register_one_priority_class usage main: Fix warning about deprecated usage of io_queue::capacity	2021-06-14 10:05:27 +03:00
Piotr Jastrzebski	831a60a6cd	priority_manager: Fix warnings about deprecated register_one_priority_class usage This patch fixes following warnings: service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] : _commitlog_priority(engine().register_one_priority_class("commitlog", 1000)) service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000)) service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _streaming_priority(engine().register_one_priority_class("streaming", 200)) service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _sstable_query_read(engine().register_one_priority_class("query", 1000)) service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _compaction_priority(engine().register_one_priority_class("compaction", 1000)) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-14 08:49:46 +02:00
Piotr Jastrzebski	3ec04433f7	main: Fix warning about deprecated usage of io_queue::capacity This patch fixes the following warning: main.cc:307:53: warning: 'capacity' is deprecated: modern I/O queues should use a property file [-Wdeprecated-declarations] auto capacity = engine().get_io_queue().capacity(); It's fine to just check --max-io-requests directly because seastar sets io_queue::capacity to the value of this parameter anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-14 08:49:42 +02:00
Raphael S. Carvalho	846f0bd16e	sstables: Fix incremental selection with compound sstable set Incremental selection may not work properly for LCS and ICS due to an use-after-free bug in partitioned set which came into existence after compound set was introduced. The use-after-free happens because partitioned set wasn't taking into account that the next position can become the current position in the next iteration, which will be used by all selectors managed by compound set. So if next position is freed, when it were being used as current position, subsequent selectors would find the current position freed, making them produce incorrect results. Fix this by moving ownership of next pos from incremental_selector_impl to incremental_selector, which makes it more robust as the latter knows better when the selection is done with the next pos. incremental_selector will still return ring_position_view to avoid copies. Fixes #8802. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>	2021-06-13 16:45:07 +03:00
Kamil Braun	9e85921006	storage_proxy: remove a feedback loop from the speculative retry latency metric To handle a read request from a client, the coordinator node must send data and digest requests to replicas, reconcile the obtained results (by merging the obtained mutations and comparing digests), and possibly send more requests to replicas if the digests turned out to be different in order to perform read repair and preserve consistency of observed reads. In contrast to writes, where coordinators send their mutation write requests to all replicas in the replica set, for reads the coordinators send their requests only to as many replicas as is required to achieve the desired CL. For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends its request to a subset of 2 nodes out of the 3 possible replicas. The choice of the 2-node subset is random; the distribution used for the random roll is affected by certain things such as the "cache hitrate" metric. The details are not that relevant for this discussion. If not all of the the initially chosen replicas answer within a certain time period, the coordinator may send an additional request to one more replica, hoping that this replica helps achieving the desired CL so the entire client request succeeds. This mechanism is called "speculative retry" and is enabled by default. This time period - call it `T` - is chosen based on keyspace configuration. The default value is "99.0PERCENTILE", which means that `T` is roughly equal to the 99th percentile of the latency distribution of previous requests (or at least the most recent requests; the algorithm uses an exponential decay strategy to make old request less relevant for the metric). The latencies used are the durations of whole coordinator read requests: each such duration measurement starts before the first replica request is sent and ends after the last replica request is answered, among the replica requests whose results were used for the reconciled result returned to the client (there may be more requests sent later "in the background" - they don't affect the client result and are not taken into account for the latency measurement). This strategy, however, gives an undesired effect which appears when a significant part of all requests require a speculative retry to succeed. To explain this effect it's best to consider a scenario which takes this to the extreme - where all requests require a speculative retry. Consider RF=3 and CL=QUORUM so each read request initially uses 2 replicas. Let {A, B, C} be the set of replicas. We run a uniformly distributed read workload. Initially the cluster operates normally. Roughly 1/3 of all requests go to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th percentile of read request latencies is 50ms. Suppose that the average round-trip latency between a coordinator and any replica is 10ms. Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power outage. This means that other nodes are initially not aware that C is down, they must wait for the failure detector to convict C as unavailable which happens after a configurable amount of time. The current default is 20s, meaning that by default coordinators will still attempt to send requests to C for 20s after it is hard-killed. During this period the following happens: - About 2/3 of all requests - the ones which were routed to {A, C} and {B, C} - do not finish within 50ms because C does not answer. For these requests to finish, the coordinator performs a speculative retry to the third replica which finishes after ~10ms (the average round-trip latency). Thus the entire request, from the coordinator's POV, takes ~60ms. - Eventually (very quickly in fact - assuming there are many concurrent requests) the P99 latency rises to 60ms. - Furthermore, the requests which initially use {A, C} and {B, C} start taking more than 2/3 of all requests because they are stuck in the foreground longer than the {A, B} requests (since their latencies are higher). - These requests do not finish within 60ms. Thus coordinators perform speculative retries. Thus they finish after ~70ms. - Eventually the P99 latency rises to 70ms. - These bad requests take an even longer portion of all requests. - These requests do not finish within 70ms. They finish after ~80ms. - Eventually the P99 latency rises to 80ms. - And so on. In metrics, we observe the following: - Latencies rise roughly linearly. They rise until they hit a certain limit; this limit comes from the fact that `T` is upper-bounded by the read request timeout parameter divided by 2. Thus if the read request timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`. Thus eventually all requests will take about `2.5s + 10ms` to finish (`2.5s` until speculative retry happens, `10ms` for the last round-trip), unless the node is marked as DOWN before we reach that limit. - Throughput decreases roughly proportionally to the y = 1/x function, as expected from Little's law. Everything goes back to normal when nodes mark C as DOWN, which happens after ~20s by default as explained above. Then coordinators start routing all requests to {A, B} only. This does not happen for graceful shutdowns, where C announces to the cluster that it's shutting down before shutting down, causing other nodes to mark it as DOWN almost immediately. The root cause of the issue is a feedback loop in the metric used to calculate `T`: we perform a speculative retry after `T` -> P99 request latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc. We fix the problem by changing the measurements used for calculating `T`. Instead of measuring the entire coordinator read latency, we measure each replica request separately and take the maximum over these measurements. We only take into account the measurements for requests that actually contributed to the request's result. The previous statistic would also measure failed requests latencies. Now we measure only latencies of successful replica requests. Indeed this makes sense for the speculative retry use case; the idea behind speculative retry is that we assume that requests usually succeed within a certain time period, and we should perform the retry if they take longer than that. To measure this time period, taking failed requests into account doesn't make much sense. In the scenario above, for a request that initially goes to {A, C}, the following would happen after applying the fix: - We send the requests to A and C. - After ~10ms A responds. We record the ~10ms measurement. - After ~50ms we perform speculative retry, sending a request to B. - After ~10ms B responds. We record the ~10ms measurement. The maximum over recorded measurements is ~10ms, not ~60ms. The feedback loop is removed. Experiments show that the solution is effective: in scenarios like above, after C is killed, latencies only rise slightly by a constant amount and then maintain their level, as expected. Throughput also drops by a constant amount and maintains its level instead of continuously dropping with an asymptote at 0. Fixes #3746. Fixes #7342. Closes #8783	2021-06-13 16:19:11 +03:00
Avi Kivity	d6f3a62c13	Merge 'Add option to forbid SimpleStrategy in CREATE/ALTER KEYSPACE' from Nadav Har'El This series adds a new configuration option - restrict_replication_simplestrategy - which can be used to restrict the ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE statement. This is part of a new effort (dubbed "safe mode") to allow an installation to restrict operations which are un-recommended or dangerous (see issue #8586 for why SimpleStrategy is bad). The new restrict_replication_simplestrategy option has three values: "true", "false", and "warn": For the time being, the default is still "false", which means SimpleStrategy is not restricted, and can still be used freely. Setting a value of "true" means that SimpleStrategy is restricted - trying to create a a table with it will fail: cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': 1 }; ConfigurationException: SimpleStrategy replication class is not recommended, and forbidden by the current configuration. Please use NetworkToplogyStrategy instead. You may also override this restriction with the restrict_replication_simplestrategy=false configuration option. Trying to ALTER an existing keyspace to use SimpleStrategy will similarly fail. The value "warn" allows - like "false" - SimpleStrategy to be used, but produces a warning when used to create a keyspace. This warning appears in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh user will see this warning), and also in Scylla's logs. For example: cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': 1 }; Warnings : SimpleStrategy replication class is not recommended, but was used for keyspace try1. The restrict_replication_simplestrategy configuration option can be changed to silence this warning or make it into an error. Fixes #8586 Closes #8765 * github.com:scylladb/scylla: cql: create_keyspace_statement: move logger out of header file cql: allow restricting SimpleStrategy in ALTER KEYSPACE cql: allow restricting SimpleStrategy in CREATE KEYSPACE config: add configuration option restrict_replication_simplestrategy config: add "tri_mode_restriction" type of configurable value utils/enum_option.hh: add implicit converter to the underlying enum	2021-06-13 15:39:18 +03:00
Nadav Har'El	6f813bd3a1	cql: create_keyspace_statement: move logger out of header file Move the logger declaration from the header file into the only source file that uses it. This is just a small cleanup similar to what the previous patch did in alter_keyspace_statement.{cc,hh}. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:40 +03:00
Nadav Har'El	dea075c038	cql: allow restricting SimpleStrategy in ALTER KEYSPACE In the previous patch we made CREATE KEYSPACE honor the "restrict_replication_simplestrategy" option. In this patch we do the same for ALTER KEYSPACE. We use the same function check_restricted_replication_strategy() used in CREATE KEYSPACE for the logic of what to allow depending on the configuration, and what errors or warnings to generate. One of the non-self-explanatory changes in this patch is to execute(): Previosuly, alter_keyspace_statement inherited its execute() from schema_altering_statement. Now we need to override it to check if the operation is forbidden before running schema_altering_statement's execute() or to warn after it is run. In the previous patch we didn't need to add a new execute() for create_keyspace_statement because we already had one. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:40 +03:00
Nadav Har'El	b9539d7135	cql: allow restricting SimpleStrategy in CREATE KEYSPACE This patch uses the configuration option which we added in the previous patch, "restrict_replication_simplestrategy", to control whether a user can use the SimpleStrategy replication strategy in a CREATE KEYSPACE operation. The next patch will do the same for ALTER KEYSPACE. As a tri_mode_restriction, the restrict_replication_simplestrategy option has three values - "true", "false", and "warn": The value "false", which today is still the default, means that SimpleStrategy is not restricted, and can still be used freely. The value "true" means that SimpleStrategy is restricted - trying to create a a table with it will fail: cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': 1 }; ConfigurationException: SimpleStrategy replication class is not recommended, and forbidden by the current configuration. Please use NetworkToplogyStrategy instead. You may also override this restriction with the restrict_replication_simplestrategy=false configuration option. The value "warn" allows - like "false" - SimpleStrategy to be used, but produces a warning when used to create a keyspace. This warning appears in the CREATE KEYSPACE statement's response (an interactive cqlsh user will see this warning), and also in Scylla's logs. For example: cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': 1 }; Warnings : SimpleStrategy replication class is not recommended, but was used for keyspace try1. The restrict_replication_simplestrategy configuration option can be changed to silence this warning or make it into an error. Because we plan to use the same checks and the same error messages also for ALTER TABLE (in the next patch), we encapsulate this logic in a function check_restricted_replication_strategy() which we will use for ALTER TABLE as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:25 +03:00
Nadav Har'El	8a4ac6914a	config: add configuration option restrict_replication_simplestrategy This patch adds a configuration option to choose whether the SimpleStrategy replication strategy is restricted. It is a tri_mode_restriction, allowing to restrict this strategy (true), to allow it (false), or to just warn when it is used (warn). After this patch, the option exists but doesn't yet do anything. It will be used in the following two patches to restrict the CREATE KEYSPACE and ALTER KEYSPACE operations, respectively. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:16 +03:00
Nadav Har'El	a3d6f502ad	config: add "tri_mode_restriction" type of configurable value This patch adds a new type of configurable value for our command-line and YAML parsers - a "tri_mode_restriction" - which can be set to three values: "true", "false", or "warn". We will use this value type for many (but not all) of the restriction options that we plan to start adding in the following patches. Restriction options will allow users to ask Scylla to restrict (true), to not restrict (false) or to warn about (warn) certain dangerous or undesirable operations. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:44:20 +03:00
Nadav Har'El	afacffc556	utils/enum_option.hh: add implicit converter to the underlying enum Add an implicit converter of the enum_option to the underyling enum it is holding. This is needed for using switch() on an enum_option. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 13:18:49 +03:00
Avi Kivity	ec60f44b64	main: improve process file limit handling We check that the number of open files is sufficent for normal work (with lots of connections and sstables), but we can improve it a little. Systemd sets up a low file soft limit by default (so that select() doesn't break on file descriptors larger than 1023) and recommends[1] raising the soft limit to the more generous hard limit if the application doesn't use select(), as ours does not. Follow the recommendation and bump the limit. Note that this applies only to scylla started from the command line, as systemd integration already raises the soft limit. [1] http://0pointer.net/blog/file-descriptor-limits.html Closes #8756	2021-06-13 09:19:35 +03:00
Tomasz Grabiec	7521301b72	Merge "raft: add tests for non-voters and fix related bugs" from Kostja Add test coverage inspired by etcd for non-voter servers, and fix issues discovered when testing. * scylla-dev/raft-learner-test-v4: raft: (testing) test non-voter can vote raft: (testing) test receiving a confchange in a snapshot raft: (testing) test voter-non-voter config change loop raft: (testing) test non-voter doesn't start election on election timeout raft: (testing) test what happens when a learner gets TimeoutNow raft: (testing) implement a test for a leader becoming non-voter raft: style fix raft: step down as a leader if converted to a non-voter raft: improve configuration consistency checks raft: (testing) test that non-voter stays in PIPELINE mode raft: (testing) always return fsm_debug in create_follower()	2021-06-12 21:36:47 +03:00
Botond Dénes	cb208a56f2	docs/guides/debugging.md: expand section on libthread-db Fix a typo in enabling libthread-db debugging. Add command line snippet which can enable libthread-db debugging on startup. Split the long wall of text about likely problems into separate per-problem subsections. Add sub-section about recently found Fedora bug(?) https://bugzilla.redhat.com/show_bug.cgi?id=1960867. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210603150607.378277-1-bdenes@scylladb.com>	2021-06-12 21:36:47 +03:00
Nadav Har'El	9774c146cc	cql-pytest: add test for connecting with different SSL/TLS versions This is a reproducer for issue #8827, that checks that a client which tries to connect to Scylla with an unsupported version of SSL or TLS gets the expected error alert - not some sort of unexpected EOF. Issue #8827 is still open, so this test is still xfailing. However, I verified that with a fix for this issue, the test passes. The test also prints which protocol versions worked - so it also helps checking issue #8837 (about the ancient SSL protocol being allowed). Refs #8837 Refs #8827 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>	2021-06-12 21:36:47 +03:00
Pavel Emelyanov	7b1f2d91a5	scylla-gdb: Remove maximum-request-size report The recent seastar update moved the variable again, so to have a proper support for it we'd need to have 2 try-catch attempts and a default. Or 1 try-catch, but make sure the maintainer commits this patch AND seastar update in one go, so that the intermediate variable doesn't creep into an intermediate commit. Or bear the scylla-gdb test is not bisect-safe a little bit. Instead of making this complex choise I suggest to just drop the volatile variable from the script at all. This thing is actually a constant derived from the latency goal and io-properties.yaml file, so it can be calculated without gdb help (unlike run-time bits like group rovers or numbers of queued/executing resources). To free developers from doing all this math by hands there's an "ioinfo" tool that (when run with correct options) prints the results of this math on the screen. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210610120151.1135-1-xemul@scylladb.com>	2021-06-11 19:06:43 +02:00
Michael Livshin	2bbc293e22	tests: improve error reporting of test_env::reusable_sst() Distinguish the "no such sstable" case from any reading errors. While at it, coroutinize the function. Refs #8785. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>	2021-06-11 19:06:43 +02:00
Pavel Emelyanov	fbd98e6292	alternator: Move start-stop code into controller This move is not "just move", but also includes: - putting the whole thing into seastar::async() - switch from locally captured dependencies into controller's class members - making smp_service_groups optional because it doesn't have default contructor and should somehow survive on constructed controller until its start() Also copy few bits from main that can be generalized later: - get_or_default() helper from main - sharded_parameter lambda for cdc - net family and preferred thing from main ( this also fixed the indentation broken by previous patch ) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:17:27 +03:00
Pavel Emelyanov	9e2ad77436	alternator: Move the whole starting code into a sched group The controller won't have the database_config at hands to get the sched group from. All other client services run the whole controller start in the needed sched group, so prepare the alternator controller for that. To make it compile (and while-at-it) also move up the sharded server and executor instances and the smp_service_group. All of these will migrate onto the controller in the next patch. ( the indentation is deliberately left broken ) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:11:02 +03:00
Pavel Emelyanov	f918a75572	alternator: Dont capture db, use cfg When .init()ing the server one needs to provide the max_concurrent_requests_per_shard value from config. Instead of carrying the database around for it -- use the db::config itself which is at hand. All the shards share its instance anyway. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:09:16 +03:00
Pavel Emelyanov	4aad618409	alternator: Controller skeleton Add the controller class with all the needed dependencies. For now completely unused (thus a bunch of (void)-s here and there). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:08:37 +03:00
Pavel Emelyanov	316e9af234	alternator: Controller basement Add header and source file for transport- (and thrift-) like controller that'll do all the bookkeeping needed to start and stop this client service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:06:10 +03:00
Pavel Emelyanov	773d2fe2a4	alternator: Drop storage service from executor It's completely unused in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:05:11 +03:00
Konstantin Osipov	2be8a73c34	raft: (testing) test non-voter can vote When a non-voter is requested a vote, it must vote to preserve liveness. In Raft, servers respond to messages without consulting with their current configuration, and the non-voter may not have the latest configuration when it is requested to vote.	2021-06-11 17:16:57 +03:00
Konstantin Osipov	eaf32f2c3c	raft: (testing) test receiving a confchange in a snapshot	2021-06-11 17:16:56 +03:00
Konstantin Osipov	d08ad76c24	raft: (testing) test voter-non-voter config change loop	2021-06-11 17:16:55 +03:00
Konstantin Osipov	6e4619fe87	raft: (testing) test non-voter doesn't start election on election timeout	2021-06-11 17:16:55 +03:00
Konstantin Osipov	c8ae13a392	raft: (testing) test what happens when a learner gets TimeoutNow Once learner receives TimeoutNow it becomes a candidate, discovers it can't vote, doesn't increase its term and converts back to a follower. Once entries arrive from a new leader it updates its term.	2021-06-11 17:16:55 +03:00
Konstantin Osipov	a972269630	raft: (testing) implement a test for a leader becoming non-voter	2021-06-11 17:16:55 +03:00
Konstantin Osipov	ba046ed1ab	raft: style fix	2021-06-11 17:16:54 +03:00
Konstantin Osipov	b0a1ebc635	raft: step down as a leader if converted to a non-voter If the leader becomes a non-voter after a configuration change, step down and become a follower. Non-voting members are an extension to Raft, so the protocol spec does not define whether they can be leaders. I can not think of a reason why they can't, yet I also can not think of a reason why it's useful, so let's forbid this. We already do not allow non-voters to become candidates, and they ignore timeout_now RPC (leadership transfer), so they already can not be elected.	2021-06-11 17:16:50 +03:00
Konstantin Osipov	684e0d2a8c	raft: improve configuration consistency checks Isolate the checks for configuration transitions in a static function, to be able to unit test outside class server. Split the condition of transitioning to an empty configuration from the condition of transitioning into a configuration with no voters, to produce more user-friendly error messages. Allow to transfer leadership in a configuration when the only voter is the leader itself. This would be equivalent to syncing the leader log with the learner and converting the leader to the follower itself. This is safe, since the leader will re-elect itself quickly after an election timeout, and may be used to do a rolling restart of a cluster with only one voter. A test case follows.	2021-06-11 17:16:47 +03:00
Konstantin Osipov	3e6fd5705b	raft: (testing) test that non-voter stays in PIPELINE mode Test that configuration changes preserve PIPELINE mode.	2021-06-11 17:07:39 +03:00
Konstantin Osipov	1dfe946c91	raft: (testing) always return fsm_debug in create_follower() create_follower() is a test helper, so it's OK to return a test-enabled FSM from it. This will be used in a subsequent patch/test case.	2021-06-11 12:24:43 +03:00
Alejo Sanchez	ff34a6515d	raft: replication test: fix elect_new_leader Recently, the logic of elect_new_leader was changed to allow the old leader to vote for the new candidate. But the implementation is wrong as it re-connects the old leader in all cases disregarding if the nodes were already disconnected. Check if both old leader and the requested new leader are connected first and only if it is the case then the old leader can participate in the election. There were occasional hangs in the loop of elect_new_leader because other nodes besides the candidate were ticked. This patch fixes the loop by removing ticks inside of it. The loop is needed to handle prevote corner cases (e.g. 2 nodes). While there, also wait log on all followers to avoid a previously dropped leader to be a dueling candidate. And update _leader only if it was changed. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210609193945.910592-3-alejo.sanchez@scylladb.com>	2021-06-10 12:36:25 +02:00
Alejo Sanchez	add12d801d	raft: log ignored prevote Add a log line for ignored prevote. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210609193945.910592-2-alejo.sanchez@scylladb.com>	2021-06-10 12:33:34 +02:00
Benny Halevy	e0622ef461	compaction_manager: stop_ongoing_compactions: print reason for stopping Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210610084704.388215-1-bhalevy@scylladb.com>	2021-06-10 11:52:57 +03:00
Piotr Sarna	7506f44c77	cql3: use existing constant for max result in indexed statements Original code which introduced enforcing page limits for indexed statements created a new constant for max result size in bytes. Botond reported that we already have such a constant, so it's now used instead of reinventing it from scratch. Closes #8839	2021-06-10 11:08:54 +03:00
Nadav Har'El	b26fcf5567	test/alternator: increase timeouts in test_tracing.py The query tracing tests in test/alternator's test_tracing.py had one timeout of 30 seconds to find the trace, and one unclearly-coded timeout for finding the right content for the trace. We recently saw both timeouts exceeded in tests, but only rarely and only in debug mode, in a run 100 times slower than normal. This patch increases both timeouts to 100 seconds. Whatever happens then, we win: If the test stops failing, we know the new timeout was enough. If the test continues to fail, we will be able to conclude that we have a real bug - e.g., perhaps one of the LWT operations has a bug causing it to hang indefinitely. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210608205026.1600037-1-nyh@scylladb.com>	2021-06-10 09:19:01 +03:00
Benny Halevy	8ecc626c15	queue_reader_handle: mark copy constructor noexcept It is trivially so, as std::exception_ptr is nothrow default constructible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210609135925.270883-2-bhalevy@scylladb.com>	2021-06-09 20:09:01 +03:00
Benny Halevy	3100cdcc65	queue_reader_handle: move-construct also _ex We're only moving the other reader without the other's exception (as it maybe already be abandoned or aborted). While at it, mark the constructor noexcept. Fixes #8833 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210609135925.270883-1-bhalevy@scylladb.com>	2021-06-09 20:09:01 +03:00
Pavel Emelyanov	990db016e9	transport: Untie transport and database Both controller and server only need database to get config from. Since controller creation only happens in main() code which has the config itself, we may remove database mentioning from transport. Previous attempt was not to carry the config down to the server level, but it stepped on an updateable_value landmine -- the u._v. isn't copyable cross-shard (despite the docs) and to properly initialize server's max_concurrent_requests we need the config's named_value member itself. The db::config that flies through the stack is const reference, but its named_values do not get copied along the way -- the updateable value accepts both references and const references to subscribe on. tests: start-stop in debug mode Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210607135656.18522-1-xemul@scylladb.com>	2021-06-09 20:04:12 +03:00
Eliran Sinvani	9bfb2754eb	dist: rpm: Add specific versioning and python3 dependency The Red Hat packages were missing two things, first the metapackage wasn't dependant at all in the python3 package and second, the scylla-server package dependencies didn't contain a version as part of the dependency which can cause to some problems during upgrade. Doing both of the things listed here is a bit of an overkill as either one of them separately would solve the problem described in #XXXX but both should be applied in order to express the correct concept. Fixes #8829 Closes #8832	2021-06-09 20:02:43 +03:00
Asias He	0665d9c346	gossip: Handle nodes removed from live endpoints directly When a node is removed from the _live_endpoints list directly, e.g., a node being decommissioned, it is possible the node might not be marked as down in gossiper::failure_detector_loop_for_node loop before the loop exits. When the gossiper::failure_detector_loop loop starts again, the node will not be considered because it is not present in _live_endpoints list any more. As a result, the node will not be marked as down though gossiper::failure_detector_loop_for_node loop. To fix, we mark the nodes that are removed from _live_endpoints lists as down in the gossiper::failure_detector_loop loop. Fixes #8712 Closes #8770	2021-06-09 15:02:25 +02:00
Tomasz Grabiec	419ee84d86	Merge "sstable: validate first and last keys ordering" from Benny In #8772, an assert validating first token <= last token failed in leveled_manifest::overlapping. It is unclear how we got to that state, so add validation in sstable::set_first_and_last_keys() that the to-be-set first and last keys are well ordered. Otherwise, throw malformed_sstable_exception. set_first_and_last_keys is called both on the write path from the sstable writer before the sstable is sealed, and on the open/load path via update_info_for_opened_data(). This series also fixes issues with unit tests with regards to first/last keys so they won't fail the validation. Refs #8772 Test: unit(dev) DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug) * tag 'validate-first-and-last-keys-ordering-v1': sstable: validate first and last keys ordering test: lib: reusable_sst: save unexpected errors test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard test: sstable_test: define primary key in schema for compressed sstable	2021-06-09 14:43:02 +02:00
Avi Kivity	a57d8eef49	Merge 'streaming: make_streaming_consumer: close reader on errors' from Benny Halevy Currently, if e.g. find_column_family throws an error, as seen in #8776 when the table was dropped during repair, the reader is not closed. Use a coroutine to simplify error handling and close the reader if an exception is caught. Also, catch an error inside the lambda passed to make_interposer_consumer when making the shared_sstable for streaming, and close the reader their and return an exceptional future early, since the reader will not be moved to sst->write_components, that assumes ownership over it and closes it in all cases. Fixes #8776 Test: unit(dev) DTest: repair_additional_test.py:RepairAdditionalTest.repair_while_table_is_dropped_test (dev, debug) w/ https://github.com/scylladb/scylla/pull/8635#issuecomment-856661138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8782 * github.com:scylladb/scylla: streaming: make_streaming_consumer: close reader on errors streaming: make_streaming_consumer: coroutinize returned function	2021-06-09 15:02:36 +03:00
Tomasz Grabiec	ce7a404f17	Merge "Cleanups/refactoring for Raft Group 0" from Kostja * scylla-dev/raft-group-0-part-1-rebase: raft: (service) pass Raft service into storage_service raft: (service) add comments for boot steps raft: add ordering for raft::server_address based on id raft: (internal) simplify construction of tagged_id raft: (internal) tagged_id minor improvements	2021-06-09 10:48:05 +02:00
Avi Kivity	d2157dfea7	Merge 'locator: token_metadata: simplify `tokens_iterator`' from Michał Chojnowski `ring_range()`/`tokens_iterator` are more complicated than they need to be. The `include_min` parameter is not used anywhere, and `tokens_iterator` is pimplified without a good reason. Simplify that. Closes #8805 * github.com:scylladb/scylla: locator: token_metadata: depimplify tokens_iterator locator: token_metadata: remove _ring_pos from tokens_iterator_impl locator: token_metadata: remove tokens_end() locator: token_metadata: remove `include_min` from tokens_iterator_impl locator: token_metadata: remove the `include_min` parameter from `ring_range()`	2021-06-08 15:42:41 +03:00
Konstantin Osipov	267a8e99ad	raft: (service) pass Raft service into storage_service Raft group 0 initialization and configuration changes should be integrated with Scylla cluster assembly, happening when starting the storage service and joining the cluster. Prepare for this. Since Raft service depends on query processor, and query processor depends on storage service, to break a dependency loop split Raft initialization into two steps: starting an under-constructed instance of "sharded" Raft service, accepting an under-constructed instance of "sharded" query_processor, and then passed into storage service start function, and then the local state of Raft groups from system tables once query processor starts. Consistently abbreviate raft_services instance raft_svcs, as is the convention at Scylla. Update the tests.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	959bd21cdb	raft: (service) add comments for boot steps	2021-06-08 14:52:32 +03:00
Konstantin Osipov	b81580f3c6	raft: add ordering for raft::server_address based on id	2021-06-08 14:52:32 +03:00
Konstantin Osipov	d42d5aee8c	raft: (internal) simplify construction of tagged_id Make it easy to construct tagged_id from UUID.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	c9a23e9b8a	raft: (internal) tagged_id minor improvements Introduce a syntax helper tagged_id::create_random_id(), used to create a new Raft server or group id. Provide a default ordering for tagged ids, for use in Raft leader discovery, which selects the smallest id for leader.	2021-06-08 14:52:32 +03:00
Benny Halevy	5a8531c4c8	repair: get_sharder_for_tables: throw no_such_column_family Insteadof std::runtime_error with a message that resembles no_such_column_family, throw a no_such_column_family given the keyspace and table uuid. The latter can be explicitly caught and handled if needed. Refs #8612 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210608113605.91292-1-bhalevy@scylladb.com>	2021-06-08 14:45:44 +03:00
Nadav Har'El	355dbf2140	test/cql-pytest: option for running the tests over SSL This patch adds a "--ssl" option to test/cql-pytest's pytest, as well as to the run script test/cql-pytest/run. When "test/cql-pytest/run --ssl" is used, Scylla is started listening for encrypted connections on its standard port (9042) - using a temporary unsigned certificate. Then, the individual tests connect to this encrypted port using TLSv1.2 (Scylla doesn't support earlier version of SSL) instead of TCP. This "--ssl" feature allows writing test which stress various aspects of the connection (e.g., oversized requests - see PR #8800), and then be able to run those tests in both TCP and SSL modes. Fixes #8811 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210607200329.1536234-1-nyh@scylladb.com>	2021-06-08 11:43:20 +02:00
Kamil Braun	3778a816c1	storage_proxy: abstract_read_executor: make certain methods private The methods `make_mutation_data_request`, `make_data_request` and `make_digest_request` were marked as protected, but weren't used by deriving classes. The "API" for deriving classes is encapsulated through plural versions of these functions, such as `make_mutation_data_requests` (note the "s" at the end), which send a request to a set of replicas (rather than a single replica) but also do other important things - like gathering statistics - hence we don't want the deriving classes to use them directly. Marking these singular methods as private communicates the intent more clearly.	2021-06-08 12:32:47 +03:00
Asias He	5c9816615f	streaming: Enable off-strategy compaction for bootstrap and replace The off-strategy compaction is now enabled for repair based node operations. It is not bound to repair based node operations though. It makes sense to enable it for streaming based node operations too. Fixes #8820 Closes #8821	2021-06-08 12:13:20 +03:00
Pavel Emelyanov	4ad4208426	util: Drop int_or_strong_ordering concept Nobody uses it now. All tri-comparing stuff is strong_ordering now. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-08 11:40:55 +03:00
Pavel Emelyanov	3f34878708	tests: Switch total-order-check onto strong_ordering This helper uses int_or_strong_ordering to facilitate vector ordering checks. The rework is straightforward. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-08 11:40:55 +03:00
Pavel Emelyanov	133692477d	to_string: Add formatter for strong_ordering There's not default one (yet), but totat_order_check.hh wants to print and format these values. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-08 11:33:04 +03:00
Raphael S. Carvalho	f8b2a6c923	sstables: Optimize incremental selection when only primary set contains sstables Compound set's incremental selector isn't needed when only one set contains sstables, which is the common case because secondary set will only contain data during maintenance operations. From now on, if only primary set contains data, its selector will be built directly without compound set's selector acting as an interposer. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210607193651.126937-1-raphaelsc@scylladb.com>	2021-06-08 10:25:19 +03:00
Benny Halevy	2e93996473	streaming: make_streaming_consumer: close reader on errors Currently, if e.g. find_column_family throws an error, as seen in #8776 when the table was dropped during repair, the reader is not closed. Use a coroutine to simplify error handling and close the reader if an exception is caught. Also, catch an error inside the lambda passed to make_interposer_consumer when making the shared_sstable for streaming, and close the reader their and return an exceptional future early, since the reader will not be moved to sst->write_components, that assumes ownership over it and closes it in all cases. Fixes #8776 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-08 08:50:46 +03:00
Benny Halevy	42028c324c	streaming: make_streaming_consumer: coroutinize returned function To simplify error handling in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-08 08:48:33 +03:00
Pavel Emelyanov	cd166fa942	tests: Return strong-ordering from tri-comparators Some collection tests still use int-s for it. The conversion is straightforward. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-07 21:41:08 +03:00
Avi Kivity	3e3003fcc1	Merge 'cql3: limit the concurrency of indexed statements' from Piotr Sarna Indexed select statements fetch primary key information from their internal materialized views and then use it to query the base table. Unfortunately, the current mechanism for retrieving base table rows makes it easy to overwhelm the replicas with unbounded concurrency - the number of concurrent ops is increased exponentially until a short read is encountered, but it's not enough to cap the concurrency - if data is fetched row-by-row, then short reads usually don't occur and as a result it's easy to see concurrency of 1M or higher. In order to avoid overloading the replicas, the concurrency of indexed queries is now capped at 4096 and additionally throttled if enough results are already fetched. For paged queries it means that the query returns as soon as 1MB of data is ready, and for unpaged ones the concurrency will no longer be doubled as soon as the previous iteration fetched 1MB of results. The fixed 4096 value can be subject to debate, its reasoning is as follows: for 2KiB rows, so moderately large but not huge, they result in fetching 10MB of data, which is the granularity used by replicas. For 200B rows, which is rather small, the result would still be around 1MB. At the same time, 4096 separate tasks also means 4096 allocations, so increasing the number also strains the allocator. Fixes #8799 Tests: unit(release), manual: observing metrics of modified index_paging_test Closes #8814 * github.com:scylladb/scylla: cql3: limit the transitional result size for indexed queries cql3: return indexed pages after 1MB worth of data cql3: limit the concurrency of indexed statements	2021-06-07 18:00:51 +03:00
Gleb Natapov	5d15ecb7e5	raft: do not block io_fiber just because of a slow follower Currently if append_message cannot be sent to one of the followers the entire io_fiber will block which eventually stop the replication. The patch changes message sending part of io_fiber to be non blocking. The code adds a hash table that is used to keep track of append_request sending status per destination. All the remaining futures are waited for during abort. Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>	2021-06-07 16:55:14 +02:00
Gleb Natapov	01b6a2eb38	raft: randomized_nemesis_test: tick virtual clock less aggressively Currently each tick of the virtual clock immediately schedules the next one at the end of the task queue, but this is too aggressive. If a tick generates work that need two tasks to be scheduled one after another such implementation will make the task queue grow to infinity. Considering that in the debug mode even ready future causes preemption and task queue shuffling may cause two or more ticks to be executed without any other work done in the middle it is very easy to get to such situation. The patch changes the virtual clock to tick only when a shard is idle. Message-Id: <20210606140305.2930189-1-gleb@scylladb.com>	2021-06-07 16:54:56 +02:00
Piotr Sarna	df0d44486a	cql3: limit the transitional result size for indexed queries Unpaged indexed queries already have a concurrency limit of 4096, but now the concurrency is further limited by previous number of bytes fetched. Once this number reached 1MB, the concurrency will not be increased in consecutive queries to avoid overload.	2021-06-07 16:29:18 +02:00
Piotr Sarna	60e55b6c7f	cql3: return indexed pages after 1MB worth of data Currently there's no practical limit of the resulting page size for an indexed query, because it simply translates a page worth of base primary keys into base rows. In order to avoid sending too large pages, the result is returned after hitting a 1MB limit.	2021-06-07 16:05:50 +02:00
Piotr Sarna	8eeac10ded	cql3: limit the concurrency of indexed statements Indexed select statements fetch primary key information from their internal materialized views and then use it to query the base table. Unfortunately, the current mechanism for retrieving base table rows makes it easy to overwhelm the replicas with unbounded concurrency - the number of concurrent ops is increased exponentially until a short read is encountered, but it's not enough to cap the concurrency - if data is fetched row-by-row, then short reads usually don't occur and as a result it's easy to see concurrency of 1M or higher. In order to avoid overloading the replicas, the concurrency of indexed queries is now capped at 4096. The number can be subject to debate, its reasoning is as follows: for 2KiB rows, so moderately large but not huge, they result in fetching 10MB of data, which is the granularity used by replicas. For 200B rows, which is rather small, the result would still be around 1MB. At the same time, 4096 separate tasks also means 4096 allocations, so increasing the number also strains the allocator. Fixes #8799 Tests: unit(release), manual: observing metrics of modified index_paging_test	2021-06-07 15:56:15 +02:00
Benny Halevy	5f31beaf97	flat_mutation_reader: unify reader_consumer declarations Put the reader_consumer declaration in flat_mutation_reader.hh and include it instead of declaring the same `using reader_consumer` declaration in several places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210607075020.31671-1-bhalevy@scylladb.com>	2021-06-07 16:11:18 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	0048c404d2	Merge 'dht: token: make some cosmetic changes' from Michał Chojnowski This is a set of a few cosmetic changes in dht/token. Mostly some comments and a simplification of `midpoint()`. Closes #8803 * github.com:scylladb/scylla: dht: token: add a comment excusing the `const bytes&` constructor dht: token: simplify midpoint() dht: token: add a comment to normalize() dht: token: use {read,write}_unaligned instead of std::copy_n dht: token-sharding: fix a typo in a comment	2021-06-07 15:41:15 +03:00
Piotr Sarna	fa29b79c20	transport: close connections when too large requests arrive Too large requests are currently handled by the CQL server by skipping them and sending back an error response. That's however wasteful and dangerous: bogus request sizes will force Scylla to potentially skip gigabytes of data - and skipping is done by simply reading from the socket, so it may results in gigabytes of bandwidth wasted. Even if the request size is not bogus, closing the connection forces users to adjust their request sizes, which should be done anyway. Originally, there was a bug in handling too large requests which only read their headers and then left the connection in a broken, undefined state, trying to interpret the rest of the large request as a next CQL header. It was later fixed to skip the request, but closing the connection is a safer thing to do. Fixes #8798 Closes #8800	2021-06-07 12:23:55 +03:00
Avi Kivity	e6c5a63581	Merge "Fix several issues on transport stop" from Pavel E " There's a bunch of issues with starting and stopping of cql_server with the help of cql_controller. fixes: #8796 tests: manual(start + stop, start + exception on cql_set_state() ) unit not run, they don't mess with transport controller " * 'br-transport-stop-fixes' of https://github.com/xemul/scylla: transport/controller: Stop server on state change failure too transport/controller: Rollback server start on state change failure too transport/controller: Do not leave _server uninitialized transport/controller: Rework try-catch into defers	2021-06-07 11:41:36 +03:00
Michał Chojnowski	3ea97e7a11	locator: token_metadata: depimplify tokens_iterator This class has no meaningful dependencies, so pimpl is unreasonable here.	2021-06-07 10:41:23 +02:00
Michał Chojnowski	baaac5bb7c	locator: token_metadata: remove _ring_pos from tokens_iterator_impl _ring_pos is slightly confusing. I thought at first that it doesn't do anything since operator== doesn't use it. This cosmetic patch tries to improve the readability, and also removes operator!= which is generated automatically in C++20.	2021-06-07 10:41:22 +02:00
Michał Chojnowski	30e5290cea	locator: token_metadata: remove tokens_end() It's an internal method of token_metadata_impl and doesn't have to exist.	2021-06-07 10:41:11 +02:00
Alejo Sanchez	bd168d57ff	raft: fix vote reply handling in prevote Do not register a reply to prevote as a real vote Found and authored by @kostja. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210604122530.1975388-1-alejo.sanchez@scylladb.com>	2021-06-06 19:18:49 +03:00
Tomasz Grabiec	50d64646cd	Merge "raft: replication test fixes and OOP refactor" from Alejo Feature requests, fixes, and OOP refactor of replication_test. Note: all known bugs and hangs are now fixed. A new helper class "raft_cluster" is created. Each move of a helper function to the class has its own commit. New helpers are provided To simplify code, for now only a single apply function can be set per raft_cluster. No tests were using in any other way. In the future, there could be custom apply functions per server dynamically assigned, if this becomes needed. * alejo/raft-tests-replication-02-v3-30: (66 commits) raft: replication test: wait for log for both index and term raft: replication test: reset network at construction raft: replication test: use lambda visitor for updates raft: replication test: move structs into class raft: replication test: move data structures to cluster class raft: replication test: remove shared pointers raft: replication test: move get_states() to raft_cluster raft: replication test: test_server inside raft_cluster raft: replication test: rpc declarative tests raft: replication test: add wait_log raft: replication test: add stop and reset server raft: replication test: disconnect 2 support raft: replication test: explicit node_id naming raft: replication test: move definitions up raft: replication test: no append entries support raft: replication test: fix helper parameter raft: replication test: stop servers out of config raft: replication test: wait log when removing leader from configuration raft: replication test: only manipulate servers in configuration raft: replication test: only cancel rearm ticker for removed server ...	2021-06-06 19:18:49 +03:00
Piotr Sarna	cb17aa1e53	Merge 'test/alternator: rewrite run script to share code with cql-pytest's run script' from Nadav Har'El In this small series, I rewrite test/alternator/run to Python using the utility functions developed for test/cql-pytest. In the future, we should do the same to test/redis/run and test/scylla-gdb/run. The benefit of this rewrite is less code duplication (all run scripts start with the same duplicate code to deal with temporary directories, to run Scylla IP addresses, etc.), but most importantly - in the future fixes we do to cql-pytest (e.g., parameters needed to start Scylla efficiently, how to shut down Scylla, etc.) will appear automatically in alternator test without needing to remember to change both. Another benefit is that test/alternator/run will now be Python, not a shell script. This should make it easier to integrate it into test.py (refs #6212) in the future - if we want to. Closes #8792 * github.com:scylladb/scylla: test/alternator: rewrite test/alternator/run script in Python test/cql-pytest: make test run code more general	2021-06-06 19:18:49 +03:00
Nadav Har'El	fe1fa9d72b	docs: update Alternator's compatibility.md In the last year, four new features were added to DynamoDB which we don't yet support - Kinesis Streams, PartiQL, Contributor Insights and Export to S3. Let's document them as missing Alternator features, and point to the four newly-created issues about these features. Refs #8786 Refs #8787 Refs #8788 Refs #8789 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210603125825.1179171-1-nyh@scylladb.com>	2021-06-06 19:18:49 +03:00
Avi Kivity	872cd8f692	test: adjust copyright statement to use ScyllaDB rather than old name	2021-06-06 19:18:49 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Avi Kivity	b3ce1c8b40	gdb: prepare for Seastar's "smp: allow having multiple instances of the smp class" scylladb/seastar@e6463df8a0 ("smp: allow having multiple instances of the smp class") changes the type of seastar::smp::_qs from a unique_ptr to a regular pointer. Adjust for that change, with a fallback to support older versions. Closes #8784	2021-06-06 19:18:49 +03:00
Nadav Har'El	48ff641f67	Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy Shutdown must never fail, otherwise it may cause hangs as seen in https://github.com/scylladb/scylla/issues/8577. This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files. In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging. Fixes #8577 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8578 * github.com:scylladb/scylla: commitlog: segment_manager::shutdown: abort on errors commitlog: allocate_segment_ex: make_checked_file	2021-06-06 19:18:49 +03:00
Avi Kivity	8a4abe9895	cql3: expression: don't copy expression in has_supporting_index() std::bind() copies the bound parameters for safekeeping. Here this includes expr, which can be quite heavyweight. Use std::ref() to prevent copying. This is safe since the bound expression is executed and discarded before has_supporting_index() returns. Closes #8791	2021-06-06 19:18:49 +03:00
Nadav Har'El	cee0340c89	scripts/pull_github_pr.sh: do not hard-code project name The current pull_github_pr.sh hard-codes the project name "scylladb/scylla". Let's determine it automaticaly, from the git origin url. This will allow using exactly the same script in other Scylla subprojects, e.g., Seastar. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210318142624.1794419-1-nyh@scylladb.com>	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	2187a59089	treewide: move `service::cas_request` out from `storage_proxy.hh` And remove all remaining inclusions of `storage_proxy.hh` in the headers. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	e0749d6264	treewide: some random header cleanups Eliminate not used includes and replace some more includes with forward declarations where appropriate. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	142d3b5ad9	cdc: self-sufficient headers fixup Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Gleb Natapov	bb822c92ab	raft: change raft::rpc api to return void for most sending functions Most RAFT packets are sent very rarely during special phases of the protocol (like election or leader stepdown). The protocol itself does not care if a packet is sent or dropped, so returning futures from their send function does not serve any purpose. Change the raft's rpc interface to return void for all packet types but append_request. We still want to get a future from sending append_request for backpressure purposes since replication protocol is more efficient if there is no packet loss, so it is better to pause a sender than dropping packets inside the rpc. Rpc is still allowed to drop append_requests if overloaded.	2021-06-06 19:18:49 +03:00
Gleb Natapov	f5a54d6c05	raft: move ELECTION_TIMEOUT definition to a public header Move ELECTION_TIMEOUT definition to be visible to outside modules.	2021-06-06 19:18:49 +03:00
Gleb Natapov	87844c0ce1	raft: remove unused clock type definition RAFT uses logical clock now and this define is from older times.	2021-06-06 19:18:49 +03:00
Gleb Natapov	90ea71da54	raft: wait for io and applier fiber to stop before before aborting snapshots and waiters IO and applier fibers may update waiters and start new snapshot transfers, so abort() needs to wait for them to stop before proceeding to abort waiters and snapshot transfers,	2021-06-06 19:18:49 +03:00
Yaron Kaikov	6a447db8a8	scylla_util.py: Fix Azure support for machine-image In https://github.com/scylladb/scylla/pull/7807 we added support for Azure instance in Scylla. The following changes are required in order machine-image to work: 1) fix wrong metadata URL and updating metadata path values (was intreduce in `f627fcbb0c`) 2) fix function naming which been used my machine image 3) add missing function which are reuqired by mahcine-image 4) cleanup unused functions Closes #8596	2021-06-06 09:21:23 +03:00
Asias He	2a7b855255	repair: Init repair metrics during startup The _node_ops_metrics is thread local, it is constructed when it is first accessed. If there are no node operations, the metrics will not be shown. To make the metrics more consistent, init during startup. Refs #8311 Closes #8780	2021-06-06 09:21:23 +03:00
Benny Halevy	3f9bad0f0a	test: compound_test: use tests::random For reproducibility. Test: compound_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210602061910.286893-2-bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	40e032ff8b	test: compound_test: use to seastar test framework Prepare for using tests::random instead of std::rand for reproducibility. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210602061910.286893-1-bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Asias He	9b902fad79	gossiper: Update timestamp for nodes in ack and ack2 msg handler In commit `425e3b1182` (gossip: Introduce direct failure detector), the call to notify_failure_detector inside ack and ack2 msg handler was removed since there is no need to update the old failure detector anymore. However, the timestamp for endpoit_state is also updated inside notify_failure_detector. With the new failure detector we still need the timestamp for endpoit_state. Otherwise, nodes might be removed from gossip wrongly. For example, as we saw in issue #8702: INFO 2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2 has been silent for 5000ms, removing from gossip To fix, update the timestamp as we do before in ack and ack2 msg handler. Fixes #8702 Closes #8777	2021-06-06 09:21:23 +03:00
Benny Halevy	f081e651b3	memtable_list: rename request_flush to just flush Now that it returns a future that always waits on pending flushes there is no point in calling it `request_flush`. `flush()` is simpler and better describes its function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	4f20cd3bea	memtable_list: rename seal_active_memtable_immediate to seal_active_memtable Now that there's no more seal_active_memtable_delayed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	ba65b90b34	memtable_list: get rid of seal_active_memtable_delayed This path is unused since `e5be3352cf`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Calle Wilund	3b55ef36d1	cf_prop_defs: Fix extensions merge to handle removal Fixes #8773 When refactored for cdc, properties -> extensions merge was modified so it did not handle _removal_ (i.e. an extension function returning null -> no entry in new map). This causes certain enterprise extensions to not be able to disable themselves. Fixed by filtering existing extensions by property keywords. Unit test added. Closes #8774	2021-06-06 09:21:23 +03:00
Nadav Har'El	f22ed3ff5c	test/alternator: reduce very high timeout in one tracing test In test_tracing.py::test_slow_query_log, the was what looked like an innocent 30-second timeout, but this was in fact a 8 minute timeout - because it started with sleeping 1 second, then 2 seconds, then 3, ... until 30 seconds. Such a high timeout is frustrating when trying to debug failures in the test - which is only expected to take 2 seconds (and all of it because of an artificial timeout). So fix the loop to stop iterating after 60 seconds (a compromise between 30 seconds and 8 minutes...), sleeping a constant amount between iterations. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>	2021-06-06 09:21:23 +03:00
Avi Kivity	100d6f4094	build: enable -Wunused-function Also drop a single violation in transport/server.cc. This helps prevent dead code from piling up. Three functions in row_cache_test that are not used in debug mode are moved near their user, and under the same ifdef, to avoid triggering the error. Closes #8767	2021-06-06 09:21:23 +03:00
Benny Halevy	6ce826206a	sstables: use vector empty method rather than size Testing std::vector::empty() is slightly more efficient than testing for `size() > 0`. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210601115552.155148-2-bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	0565ba31a1	compaction_info: is_stop_requested: use sstring::empty rather than size `!empty()` is slightly more efficient than `size() > 0`. While at it, mark the function noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210601115552.155148-1-bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	948a9da832	table: do_apply: verify that _async_gate is open Applying changes to the memtable after table::stop is prohibited. Verify that by making sure that the _async_gate is still open in `do_apply`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210601055042.41380-1-bhalevy@scylladb.com>	2021-06-06 09:21:23 +03:00
Benny Halevy	82a263f672	database: apply_in_memory: run_when_memory_available under table::run_async Make sure to apply the mutation under the table's _async_gate. Fixes #8790 Test: unit(dev), view_build_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8794	2021-06-06 09:21:23 +03:00
Calle Wilund	131da30856	table: Always use explicit commitlog discard + clear out rp_set Fixes #8733 If a memtable flush is still pending when we call table::clear(), we can end up doing a "discard-all" call to commitlog, followed by a per-segment-count (using rp_set) _later_. This will foobar our internal usage counts and quite probably cause assertion failures. Fixed by always doing per-memtable explicit discard call. But to ensure this works, since a memtable being flushed remains on memtable list for a while (why?), we must also ensure we clear out the rp_set on discard. Closes #8766	2021-06-06 09:21:23 +03:00
Pavel Emelyanov	0944d69475	repair, streaming: Generalize consumer lambdas Both streaming and repair call the distributed sstables writing with equal lambdas each being ~30 lines of code. The only difference between them is repair might request offstrategy compaction for new sstable. Generalization of these two pieces save lines of codes and speeds the release/repair/row_level.o compilation by half a minute (out of twelve). tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210531133113.23003-1-xemul@scylladb.com>	2021-06-06 09:21:23 +03:00
Lubos Kosco	777771df34	scylla_util.py: Relax GCE setup NVMe device checks We don't want to fail I/O setup if there are more than one NVMe devices mounted as root nor if there are no NVMe devices. Fixes #8032 Closes #8444	2021-06-06 09:21:23 +03:00
Botond Dénes	b0056f88dc	test.py: revamp coverage support Instead of attempting to universally set the proper environment necessary for tests to generate profiling data such that coverage.py can process it, allow each Test subclass to set up the environment as needed by the specific Test variant. With this we now have support for all current test types, including cql, cql-pytest and alternator tests.	2021-06-06 09:21:23 +03:00
Botond Dénes	438391b4cc	scripts/coverage.py: check that --path is a directory To detect a bad --path that would fail coverage generation early.	2021-06-06 09:21:23 +03:00
Botond Dénes	ca91fd0e34	scripts/coverage.py: update main()'s docstring with new --run modifiers And fix a typo while there.	2021-06-06 09:21:23 +03:00
Botond Dénes	2ba3fc2e11	scripts/coverage.py: add --distinct-id parameter Yet another modifier for `--run`, allowing running the same executable multiple times and then generating a coverage report across all runs. This will also be used by test.py for those test suites (cql test) which run the same executable multiple times, with different inputs.	2021-06-06 09:21:23 +03:00
Botond Dénes	b1f46b3693	scripts/coverage.py: add --executable parameter Another modifier for `--run`, allowing to override the test executable path. This is useful when the real test is ran through a run-script, like in the case of cql-pytest.	2021-06-06 09:21:23 +03:00
Michał Chojnowski	81c1a7f7e9	locator: token_metadata: remove `include_min` from tokens_iterator_impl `include_min` is always set to the default value. Get rid of it.	2021-06-05 17:40:35 +02:00
Michał Chojnowski	2a3bd2babe	locator: token_metadata: remove the `include_min` parameter from `ring_range()` `include_min` is always set to the default value. Remove it.	2021-06-05 17:40:35 +02:00
Michał Chojnowski	23b7178f0d	dht: token: add a comment excusing the `const bytes&` constructor	2021-06-05 15:22:42 +02:00
Michał Chojnowski	31aad81dc9	dht: token: simplify midpoint() midpoint doesn't have to be so complicated.	2021-06-05 15:22:35 +02:00
Pavel Emelyanov	76947c829e	transport/controller: Stop server on state change failure too If on stop the set_cql_state() throws the local sharded<cql_server> will be left not stopped and will fail the respective assertion on its destruction. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-04 16:53:21 +03:00
Pavel Emelyanov	f6ef148c76	transport/controller: Rollback server start on state change failure too If set_cql_state() throws the cserver remains started. If this happens on start before the controller stop defer action is scheduled the destruction of controller will fain on assertion that checks the _server must be stopped. Effectively this is the fix of #8796 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-04 16:50:51 +03:00
Pavel Emelyanov	6995e41e64	transport/controller: Do not leave _server uninitialized If an exception happens after sharded<cql_server>.start() the controller's _server pointer is left pointing to stopped sharded server. This makes it impossible to start the server again (via API) since the check for if (_server) will always be true. This is the continuation of the `ae4d5a60` fix. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-04 16:48:26 +03:00
Pavel Emelyanov	12220b74e8	transport/controller: Rework try-catch into defers This is to make further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-04 16:48:12 +03:00
Alejo Sanchez	3e91a8ca0d	raft: replication test: wait for log for both index and term Waiting on index alone does not guarantee leader correct leader log propagation. This patch add checking also the term of the leader's last log entry. This was exposed with occasional problems with packet drops. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:38:19 -04:00
Alejo Sanchez	545893145e	raft: replication test: reset network at construction Reset network in constructor, not in unrelated function. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:18:32 -04:00
Alejo Sanchez	294dcfb204	raft: replication test: use lambda visitor for updates Process updates with a lambda visitor. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:18:31 -04:00
Nadav Har'El	0bb2e010f5	test/alternator: rewrite test/alternator/run script in Python We already wrote the test/cql-pytest/run script in Python in a way it can be reusable for the other test//run scripts. So this patch replaces the test/alternator/run shell script with Python code which does the same thing (safely runs Scylla with Alternator and pytest on it in a temporary directory and IP address), but sharing most of the code that cql-pytest uses. The benefit of reusing the test/cql-pytest/run.py library goes beyond shorter code - the main benefit will be that we can't forget to fix one of the test//run scripts (e.g., add more command line options or fix a bug) when fixing another one. To make the test/cql-pytest/run.py library reusable for running Alternator, I needed to generalize a few things in this patch (e.g., the way we check and wait for Scylla to boot with the different APIs we intend to check). There is also one bug-fix on how interrupts are handled (they are now better guaranteed to kill pytest) - and now fixing this bug benefits all runners using run.py (cql-pytest/run, cql-pytest/run-cassandra and alternator/run). In the future, we can port the runners which are still duplicate shell scripts - test/redis/run and test/scylla-gdb/run - to Python in a similar manner to what we did here for test/alternator/run. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-03 11:23:00 +03:00
Nadav Har'El	ef45fccdae	test/cql-pytest: make test run code more general Change the cql-pytest-specific run_cql_pytest() function to a more general function to run pytest in any directory. Will be useful for reusing the same code for other test runners (e.g., Alternator), and is also clearer. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-03 11:22:36 +03:00
Benny Halevy	8f054edec7	test: database_test: add snapshot_skip_flush_works Test that taking a snapshot with the skip_flush option does not flush the memtable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 20:39:29 +03:00
Benny Halevy	0c80d9d7a7	api: storage_service/snapshots: support skip-flush option The option is provided by nodetool snapshot https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/ ``` nodetool [(-h <host> \| --host <host>)] [(-p <port> \| --port <port>)] [(-pp \| --print-port)] [(-pw <password> \| --password <password>)] [(-pwf <passwordFilePath> \| --password-file <passwordFilePath>)] [(-u <username> \| --username <username>)] snapshot [(-cf <table> \| --column-family <table> \| --table <table>)] [(-kc <kclist> \| --kc.list <kclist>)] [(-sf \| --skip-flush)] [(-t <tag> \| --tag <tag>)] [--] [<keyspaces...>] -sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data) ``` But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167) and not supported at the api level. This patch wires the skip_flush option support to the REST API. Fixes #8725 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 17:20:21 +03:00
Benny Halevy	9cf858b5fc	snapshot: support skip_flush option skip_flush is disabled by default. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 17:20:21 +03:00
Benny Halevy	52fd2b71b7	table: snapshot: add skip_flush option skip_flush is false by default. Also, log a debug message when starting the snapshot. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 17:20:21 +03:00
Benny Halevy	4169f56407	api: storage_service/snapshots: add sf (skip_flush) option Note: I tried adding the option and calling it "skip_flush" but I couldn't make it work with scylla-jmx, hence it's called by the abbreviated name - "sf". Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 17:20:19 +03:00
Benny Halevy	4274cf6351	sstable: validate first and last keys ordering In #8772, an assert validating first token <= last token failed in leveled_manifest::overlapping. It is unclear how we got to that state, so add validation in sstable::set_first_and_last_keys() that the to-be-set first and last keys are well ordered. Otherwise, throw malformed_sstable_exception. set_first_and_last_keys is called both on the write path from the sstable writer before the sstable is sealed, and on the open/load path via update_info_for_opened_data(). While at it, change the exception type thrown when the key in the summary is empty from runtime_error to malformed_sstable_exception, sice the function is called from the read path and the corruption may already be present in the sstable. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 12:28:48 +03:00
Benny Halevy	7a4591119b	test: lib: reusable_sst: save unexpected errors reusable_sst tries openeing an sstable using all sstable format versions in descending order. It is expected to see "file not found" if the actual sstable version is not the latest one. That said, we may hit other error if the sstable is malformed in any way, so do not override this kind of error if "file not found" errors are hit after it, and return the unexpected error instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 12:25:29 +03:00
Benny Halevy	9452b99b40	test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard Currently the test is using "first_key", "last_key" literals for the first and last keys and expects them to sort properly with the murmur3 partitioner. Also it does that for all generated sstables which is less interesting for reshape. Use token_generation_for_current_shard to generate random, properly ordered keys. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 12:25:29 +03:00
Benny Halevy	d5405dade7	test: sstable_test: define primary key in schema for compressed sstable Otherwise, the primary_key will be considered as composite, as its length does not equal 1. That hampers token caluclation when decorating the dirst and last keys in the summary file. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 12:25:29 +03:00
Alejo Sanchez	a3fc974de9	raft: replication test: move structs into class Move auxiliary classes connection and hash_connection out of raft_cluster and into connected class. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	5b688d42d7	raft: replication test: move data structures to cluster class Move state_machine, persistence, connection, hash_connection, connected, failure_detector, and rpc inside raft_cluster. This commit moves declaration of class raft_cluster up. (Minimize changed lines) Moves apply_fn definition from state_machine to raft_cluster. Fixes namespace in declarations Keeps static rpc::net outside for now to keep this commit simple. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	1250d910ee	raft: replication test: remove shared pointers Following gleb, tomek, and kamil's suggestion, remove unnecessary use of lw_shared_ptr. This also solves the problem of constructing a lw_shared_ptr from a forward declaration (connected) in a subsequent patch. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	aa1200ee50	raft: replication test: move get_states() to raft_cluster Move get_states() helper inside raft cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	740545cdc5	raft: replication test: test_server inside raft_cluster Since there are no more external users of test_server, move it to raft_cluster and remove member access operator. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	1ee4408869	raft: replication test: rpc declarative tests Convert rpc replication tests to declarative form. This will enable moving remaining parts inside raft_cluster. For test stability, add support for checking rpc config of a node eventually changes to the expected configuration. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	f11ae18158	raft: replication test: add wait_log Allow test cases to specify waiting for log for one or more servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	fa84b15909	raft: replication test: add stop and reset server Add stop an reset server support. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	19d28e7e0f	raft: replication test: disconnect 2 support Support custom disconnection of 2 servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	e2612e5327	raft: replication test: explicit node_id naming Use node_id{x} for more expressive naming in tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	bdfdd2da0b	raft: replication test: move definitions up Move definitions up for next patch. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	14bd29f974	raft: replication test: no append entries support Handle test cases not appending entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	a73db881cb	raft: replication test: fix helper parameter Use vector instead of initializer_list for function helper parameter. This is not a constructor and it complicates usage. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	8468059d0e	raft: replication test: stop servers out of config As requested by @gleb-cloudious, stop servers taken out of configuration. Adjust other parts of code relying on all servers being active. Remove temporary stop on rpc server. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	51343d4de7	raft: replication test: wait log when removing leader from configuration If leader is removed from configuration wait log first. Remove wait_log_all for every case as it was too broad fix. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	e032d8446f	raft: replication test: only manipulate servers in configuration Only start/stop, init/start/reamr tickers, wait log, elapse_election, run free election, check for leader, and verify servers in current configuration. This is necessary for having servers out of configuration not present/stopped. Temporarily stop a server in rpc test until we truly stop servers out of configuration in next commit. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	ec078ca55f	raft: replication test: only cancel rearm ticker for removed server When changing configuration, don't pause and restart all tickers. Only do it for the specific server(s) being removed. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	802f68317e	raft: replication test: only pause restart tickers in config Only pause and restart tickers for servers in configuration. Currently when a server is taken out it's reset and a new one is set up, but out of configuration. @gleb-cloudious requested to have fully stopped servers when out of configuration, until they are re-added. This change is needed to allow that or else restart would arm tickers on servers no longer present. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	85f299e39b	raft: replication test: simplify calls to helpers Pass test update directly to helpers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	27f50b3589	raft: replication test: persisted snapshots in raft_cluster Move persisted snapshots inside raft_cluster, de-cluttering code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	5b601c133b	raft: replication test: verify in raft_cluster Do verifications in raft_cluster::verify(). This will enable having persisted snapshots inside the class and de-clutter caller code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:03 -04:00
Alejo Sanchez	ce6746b888	raft: replication test: connected inside raft_cluster Keep connected inside raft_cluster. Helpers are already provided to handle connectivity. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:02 -04:00
Alejo Sanchez	b41ce7084b	raft: replication test: snapshots inside raft_cluster Keep snapshots inside raft_cluster, removing this need outside. If this is needed later, a const getter can be implemented. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:02 -04:00
Alejo Sanchez	4c2f8d84c5	raft: replication test: remove obsolete param Since create_server() is in raft_cluster, there's no need for change_configuration() to pass total values anymore. Remove it. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:02 -04:00
Alejo Sanchez	e9df914692	raft: replication test: elect_new_leader wait log and pause Do wait_log() for the next leader always in elect_new_leader. Only wait log for new leader if it's connected to the old leader. Pause and restart tickers when creating a candidate to avoid another node to become a dueling candidate. Remove pause and restart tickers around calls to elect_new_leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:02 -04:00
Alejo Sanchez	52188016af	raft: replication test: create_server in raft_cluster Remove the global create_raft_server() and replace with a create_server() helper in replication_test(). This will allow not requiring the user of raft_cluster to create special objects. Note this does not move(apply) anymore as it's kept in raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:47:02 -04:00
Alejo Sanchez	1edcb6e647	raft: replication test: reset snapshots When stopping a server also delete snapshots and persisted snapshots. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 23:46:11 -04:00
Alejo Sanchez	453f19cf0e	raft: replication test: reset server helper Add a helper to reset a server in raft_cluster. Besides simplifying code and preventing errors, this will help move create_raft_server logic to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	d3b7f21b88	raft: replication test: pause tickers before stopping Pause tickers before stopping servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	30c9daafd2	raft: replication test: tick helper Move test tick handling to raft_cluster as helper method. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	2e61c507d2	raft: replication test: tickers on raft_cluster Move tickers to raft_cluster helper class. Ticker initialization and pause is done automatically at start_all() and stop_all(). Add temporary helpers to manage specific tickers. These might be removed later once proper node abort and reset are implemented. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	aea77871c4	raft: replication test: cluster tracking leader Track current leader inside helper class. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	ca8e55613e	raft: replication test: elect first leader in raft_cluster Run first leader election inside raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	322802308c	raft: replication test: use id 0 for rpc tests raft_cluster at the moment only allows sequential 0 based ids. The code was generating ids over this and causing problems for code changes. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	c1a6e81002	raft: replication test: fix partition wait log When partitioning, don't wait_log on servers outside configuration. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:20 -04:00
Alejo Sanchez	6db730c500	raft: replication test: partition helper Add a partition handling helper to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	848c244932	raft: replication test: track in_configuration in raft_cluster Keep track of servers in configuration inside raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	16728b8966	raft: replication test: use cluster saved apply function Use apply function saved in cluster at creation time. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	3daed889b8	raft: replication test: change_configuration in raft_cluster Move change_configuration to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	102b8e71bb	raft: replication test: free_election in raft_cluster Move free_election to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	60d4d06861	raft: replication test: wait_log_all in raft_cluster Move wait_log_all to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	d1ba0fe719	raft: replication test: wait_log in raft_cluster Move wait_log to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	3e4871b884	raft: replication test: elect_new_leader in raft_cluster Move elect_new_leader to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	59b9642be5	raft: replication test: elapse_election in raft_cluster Move elapse_election to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	b3e2b54913	raft: replication test: move add_entry up Style. Move definition of add_entry and add_remaining_entries with the rest of raft_cluster definitions. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	8cd2abe72b	raft: replication test: remove spurious check Going forward the leader is always in configuration and up to date. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	2d51d1bbc5	raft: replication test: raft_cluster add_entries Move add_entries() to raft_cluster and provide a helper to add remaining entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	2a1e7a15a6	raft: replication test: calculate first value helper Helper to calculate what's the value number to be added after snapshot and leader initial log. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	e2f425e210	raft: replication test: initial state helper Move initial_state preparation to its own helper function. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	d2c0308a85	raft: replication test: move declarations up Move declarations near the top of the file for following refactors to raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	a3700a6d0a	raft: replication test: move up set_config Move set_config above raft_cluster for a subsequent commit. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	57da05c986	raft: replication test: use disconnect() helper For rpc tests, use raft_cluster::disconnect() instead of the local connected reference. This removes connected object use outside raft_cluster. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	54c919b726	raft: replication test: add connectivity helpers Add connectivity helpers disconnect(server, except) and connect_all() to so users of raft_cluster don't need to keep the a connectivity object pointer. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	5e324f3438	raft: replication test: rpc with raft_cluster Use raft_cluster for rpc tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	752d53a909	raft: replication test: use parallel start/stop Start and stop servers in parallel. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	bcf5181697	raft: replication test: cluster class Use raft_cluster class to handle servers. First part of this change. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	5fc0a1251d	raft: replication test: helper uuid to local id Add a helper to convert from UUID to size_t id. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	7e93501d4c	raft: replication test: use optional Instead of tracking with a boolean use an optional for partition leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	ccb85bce02	raft: replication test: wait log on next leader only When there's a defined next leader, only wait for log propagation for this follower. Splits wait_log() to waiting for one follower with wait_log() and waiting for all followers with wait_log(). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	2aa1646e35	raft: replication test: remove wait after adding entries Remove log wait after adding entries. It was added to handle some debug hangs but it is not good for testing. There are already wait logs at proper code locations. (e.g. elect_new_leader, partition) Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	0216d0a7b0	raft: replication test: remove unused param elect_new_leader doesn't need to know configuration anymore. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	effcb7c5f6	raft: tests: move conversion helpers to header Move replication test helpers to header. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Alejo Sanchez	7327cbd871	raft: replication test: use structs to avoid alias Use structs for test commands. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-01 21:50:19 -04:00
Avi Kivity	e9e5663731	build, utils/bptree.hh: drop -Wno-gnu-designator warning Drop the warning about old-stye GNU designated initializers and convert two violations in bptree.hh to the standard C++20 syntax. Closes #8743	2021-05-31 18:51:49 +03:00
Nadav Har'El	ff81072f64	cql-pytest: port Cassandra's unit test validation/entities/secondary_index_test In this patch, we port validation/entities/secondary_index_test.java, resulting in 41 tests for various aspects of secondary indexes. Some of the original Java tests required direct access to the Cassandra internals not available through CQL, so those tests were omitted. In porting these tests, I uncovered 9 previously-unknown bugs in Scylla: Refs #8600: IndexInfo system table lists MV name instead of index name Refs #8627: Cleanly reject updates with indexed values where value > 64k Refs #8708: Secondary index is missing partitions with only a static row Refs #8711: Finding or filtering with an empty string with a secondary index seems to be broken Refs #8714: Improve error message on unsupported restriction on partition key Refs #8717: Recent fix accidentally broke CREATE INDEX IF NOT EXISTS Refs #8724: Wrong error message when attempting index of UDT column with a duration Refs #8744: Index-creation error message wrongly refers to "map" - it can be any collection Refs #8745: Secondary index CREATE INDEX syntax is missing the "values" option These tests also provide additional reproducers for already known issues: Refs #2203: Add support for SASI Refs #2962: Collection column indexing Refs #2963: Static column indexing Refs #4244: Add support for mixing token, multi- and single-column restrictions Due to these bugs, 15 out of the 41 tests here currently xfail. We actually had more failing tests, but we fixed a few of the above issues before this patch went in, so their tests are passing at the time of this submission. All 41 tests pass when running against Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210531112354.970028-1-nyh@scylladb.com>	2021-05-31 18:31:13 +03:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Piotr Dulikowski	b0c22f2e39	repair: trigger repair abort_source only from shard 0 When user requests repair to be forcefully aborted, the `_abort_all_as` abort source could be modified from multiple shards in parallel by the `tracker::abort_all_repairs()` function, which can lead to undefined behavior and to a crash. This commit makes sure that `_abort_all_as` is used only from shard 0 when repair is aborted. Fixes #8693 Closes #8734	2021-05-31 15:57:31 +03:00
Michał Chojnowski	a2352ea332	dht: token: add a comment to normalize() The purpose and name of normalize are not obvious and deserve an explanatory comment.	2021-05-31 11:54:58 +02:00
Michał Chojnowski	3d9b8c9eff	dht: token: use {read,write}_unaligned instead of std::copy_n A cosmetic change.	2021-05-31 11:54:58 +02:00
Michał Chojnowski	3c88a9ccb6	dht: token-sharding: fix a typo in a comment	2021-05-31 11:54:45 +02:00
Avi Kivity	e96ff3d82d	dist: add new docker building process The new process has the following differences from the Dockerfile based image: - Using buildah commands instead of a Dockerfile. This is more flexible since we don't need to pack everything into a "build context" and transfer it to the container; instead we interact with the container as we build it. - Using packages instead of a remote yum repository. This makes it easy to create an image in one step (no need to create a repository, promote, then download the packages back via yum. It means that the image cannot be upgraded via yum, but container images are usually just replaced with a new version. - Build output is an OCI archive (e.g. a tarball), not a docker image in a local repoistory. This means the build process can later be integrated into ninja, since the artifact is just a file. The file can be uploaded into a repository or made available locally with skopeo. - any build mode is supported, not just release. This can be used for quick(er) testing with dev mode. I plan to integrate it further into the build system, but currently this is blocked on a buildah bug [1]. [1] https://github.com/containers/buildah/issues/3262 Closes #8730	2021-05-31 10:05:22 +03:00
Nadav Har'El	2440569984	secondary index: fix error message which erroneously refered to "map" The value of a frozen collection may only be indexed (using a secondary index) in full - it is not allowed to index only the keys for example - "CREATE INDEX idx ON table (keys(v))" is not allowed. The error message referred to a frozen<map>, but the problem can happen on any frozen collection (e.g., a frozen set), not just a frozen map, so can be confusing to a user who used a frozen set, and getting an error about a frozen map. So this patch fixes the error message to refer to a "frozen collection". Note that the Cassandra error message in this case is different - it reads: "Frozen collections are immutable and must be fully indexed". Fixes #8744. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210529094056.825117-1-nyh@scylladb.com>	2021-05-30 23:23:20 +03:00
Botond Dénes	cd6bbd37a4	utils/utf8.c: move includes outside of namespaces Including in the middle of a namespace is not a good practice. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210528142502.962947-1-bdenes@scylladb.com>	2021-05-30 23:23:20 +03:00
Raphael S. Carvalho	a7cdd846da	compaction: Prevent tons of compaction of fully expired sstable from happening in parallel Compaction manager can start tons of compaction of fully expired sstable in parallel, which may consume a significant amount of resources. This problem is caused by weight being released too early in compaction, after data is all compacted but before table is called to update its state, like replacing sstables and so on. Fully expired sstables aren't actually compacted, so the following can happen: - compaction 1 starts for expired sst A with weight W, but there's nothing to be compacted, so weight W is released, then calls table to update state. - compaction 2 starts for expired sst B with weight W, but there's nothing to be compacted, so weight W is released, then calls table to update state. - compaction 3 starts for expired sst C with weight W, but there's nothing to be compacted, so weight W is released, then calls table to update state. - compaction 1 is done updating table state, so it finally completes and releases all the resources. - compaction 2 is done updating table state, so it finally completes and releases all the resources. - compaction 3 is done updating table state, so it finally completes and releases all the resources. This happens because, with expired sstable, compaction will release weight faster than it will update table state, as there's nothing to be compacted. With my reproducer, it's very easy to reach 50 parallel compactions on a single shard, but that number can be easily worse depending on the amount of sstables with fully expired data, across all tables. This high parallelism can happen only with a couple of tables, if there are many time windows with expired data, as they can be compacted in parallel. Prior to `55a8b6e3c9`, weight was released earlier in compaction, before last sstable was sealed, but right now, there's no need to release weight earlier. Weight can be released in a much simpler way, after the compaction is actually done. So such compactions will be serialized from now on. Fixes #8710. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com> [avi: drop now unneeded storage_service_for_tests]	2021-05-30 23:22:51 +03:00
Benny Halevy	1c0769d789	table: clear: make exception safe It is currently possible that _memtables->add_memtable() will throw after _memtables->clear(), leaving the memtables list completely empty. However, we do rely on always having at least one allocated in the memtables list as active_memtable() references a lw_shared_ptr<memtable> at the back of the memtables vector, and it expected to always be allocated via add_memtable() upon construction and after clear(). This change moves the implementation of this convention to memtable_list::clear() and makes the latter exception safe by first allocating the to-be-added empty memtable and only then clearing the vector. Refs #8749 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>	2021-05-30 13:22:52 +03:00
Avi Kivity	791412b046	test: user_defined_function_test: raise Lua timeout user_defined_function_test fails sporadically in debug mode due to lua timeout. Raise the timeout to avoid the failure, but not so much that the test that expects timout becomes too slow. Fixes #8746. Closes #8747	2021-05-30 13:10:57 +03:00
Piotr Jastrzebski	76d7c761d1	schema: Stop using deprecated constructor This is another boring patch. One of schema constructors has been deprecated for many years now but was used in several places anyway. Usage of this constructor could lead to data corruption when using MX sstables because this constructor does not set schema version. MX reading/writing code depends on schema version. This patch replaces all the places the deprecated constructor is used with schema_builder equivalent. The schema_builder sets the schema version correctly. Fixes #8507 Test: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>	2021-05-30 11:58:27 +03:00
Nadav Har'El	1507bbb35a	cql-pytest: increase default server-side timeouts Sometimes the cql-pytest tests run extremely slowly. This can be a combination of running the debug build (which is naturally slow) and a test machine which is overcommitted, or experiencing some transient swap storm or some similar event. We don't want tests, which we run on a 100% reliable setups, to fail just because they run into timeouts in Scylla when they run very slowly. We already noticed this problem in the past, and increased the CQL client timeout in conftest.py from the default of 10 seconds to 120 seconds - the old default of 10 seconds was not enough for some long operations (such as creating a table with multiple views) when the test ran very slowly. However, this only fixed the client-side timeout. We also have a bunch of server-side timeouts, configured to all sorts of arbitrary (and fairly small) numbers. For example, the server has a "write request timeout" option, which defaults to just 2 seconds. We recently saw this timeout exceeded in a slow run which tried to do a very large write. So this patch configures all the configurable server-side timeouts we have to default to 300 seconds. This should be more than enough for even the slowest runs (famous last words...). This default is not a good idea on real multi-node clusters which are expected to deal with node loss, but this is not the case in cql-pytest. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210529213648.856503-1-nyh@scylladb.com>	2021-05-30 01:20:14 +03:00
Avi Kivity	d23bebf5c2	Merge "Unexport storage service dependencies" from Pavel E " Right now storage service is used as "provider" of another services -- database, feature service and tokens. This set unexports the first pair. This dropps a bunch of calls for global storage service instances from the places that don't really need it. tests: unit(dev), start-stop " * 'br-pupate-storage-service' of https://github.com/xemul/scylla: storage-service: Don't export features api: Get features from proxy storage-service: Don't export database storage-service: Turn some global helpers into methods storage-service: Open-code simple config getters view: Get database from stprage_proxy main: Use local database instance api: Use database from http_ctx	2021-05-29 20:52:47 +03:00
Pavel Emelyanov	598bbfab15	storage-service: Don't export features Now storage service uses the feature service instance internally and doesn't need to provide getter for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:16:12 +03:00
Pavel Emelyanov	651568318d	api: Get features from proxy The reset_local_schema call needs proxy and feature service to do its job. Right now the features are retrived from global storage service, but they are present on the proxy as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:15:15 +03:00
Pavel Emelyanov	b990b764ca	storage-service: Don't export database Now storage service uses the database instance internally and doesn't need to provide getter for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:13:27 +03:00
Pavel Emelyanov	0651038f29	storage-service: Turn some global helpers into methods There are two static helpers used by storage service that grab global storage service. To simplify these two turn both into storage service methods and use 'this' inside. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:12:25 +03:00
Pavel Emelyanov	5ae8accfed	storage-service: Open-code simple config getters There are two db::config getters in storage_service.cc that are used only once. Both call for global storage service, but since they are called from storage service it's simpler to break this loop and make storage service get needed config options directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:11:24 +03:00
Pavel Emelyanov	1ce0682821	view: Get database from stprage_proxy The db::view code already uses proxy rather actively, so instead of depending on the storage service to be at hands it's better to make db::view require the proxy. For now -- via global instance. There's one dependency on storage service left after this patch -- to get the tokens. This piece is to be fixed later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:09:32 +03:00
Pavel Emelyanov	6d53ddaa5f	main: Use local database instance All start-stop code in main has the sharded<database> at hands, there's no need in getting it from global storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:08:57 +03:00
Pavel Emelyanov	e476247763	api: Use database from http_ctx Instead of getting database from global storage service it's simpler and better to grab it from the http context at hands. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:08:25 +03:00
Asias He	e86d39faf0	storage_service: Update peer table only if the peer is part of the ring Consider the following procedure: - n1, n2, n3 - n3 is network partitioned from the cluster - n4 replaces n3 - n3 has the network partition fixed - n1 learns n3 as NORMAL status and calls storage_service::handle_state_normal which in turn calls update_peer_info, all columns except tokens column in system.peers are written - n1 restarts before figure out n4 is the new owner and deletes the entry for n3 in system.peers - n3 is removed from gossip by all the nodes in the cluster automatically because they detect the collision and removes n3 - n1 restarts, leaving the entry in system.peers for n3 forever To fix, we can update peer tables only if the node is part of the ring. Fixes #8729 Closes #8742	2021-05-28 15:03:26 +02:00
Avi Kivity	b6c49fd320	Update seastar submodule > Merge "memory: optimize thread-local initialization" from Avi > Merge "Move priority classes manipulations from io-queue" from Pavel E > gate: add default move assignment operator	2021-05-28 11:47:54 +03:00
Pavel Emelyanov	526d31734c	scylla-gdb: scylla_io_queues: Support new registered classes layout Starting from seastar commit 5dae0cf3c48159990f51e5d38495af5ae224c2f8 all the registered classes info was moved into io_priority_class::_infos array. tests: scylla-gdb(release, old and new seastars) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210528083941.27990-1-xemul@scylladb.com>	2021-05-28 11:47:38 +03:00
Avi Kivity	0acf5bfca6	build: enable -Wreturn-std-move Clang warns when "return std::move(x)" is needed to elide a copy, but the call to std::move() is missing. We disabled the warning during the migration to clang. This patch re-enables the warning and fixes the places it points out, usually by adding std::move() and in one place by converting the returned variable from a reference to a local, so normal copy elision can take place. Closes #8739	2021-05-27 21:16:26 +03:00
Avi Kivity	d3e5b37059	Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund" This reverts commit `e9c940dbbc`, reversing changes made to `6144656b25`. Since it was merged commitlog_test consistently times out in debug mode.	2021-05-27 21:16:26 +03:00
Wojciech Mitros	725c6aac81	test/perf: close test_env to pass an assert in sstables_manager destructor When destroying an perf_sstable_test_env, an assert in sstables_manager destructor fails, because it hasn't been closed. Fix by removing all references to sstables from perf_sstable_test_env, and then closing the test_env(as well as the sstables_manager) Fixes #8736 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #8737	2021-05-27 17:41:17 +03:00
Michał Chojnowski	5e9f741bb4	repair: remove range_split.hh Dead code since `80ebedd242`. Closes #8698	2021-05-27 17:21:37 +03:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00
Avi Kivity	e8e4456ec7	Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding. The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish. It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first. Closes #8680 * github.com:scylladb/scylla: test: add a case for conflicting workload types cql-pytest: add basic tests for service level workload types docs: describe workload types for service levels sys_dist_ks: fix redundant parsing in get_service_level sys_dist_ks: make get_service_level exception-safe transport: start shedding requests during potential overload client_state: hook workload type from service levels cql3: add listing service level workload type cql3: add persisting service level workload type qos: add workload_type service level parameter	2021-05-27 17:01:56 +03:00
Avi Kivity	f3e8e625c0	Update tools/java submodule (toppartitions single jmx call) * tools/java fd92603b99...599b2368d6 (1): > toppartitions: Fix toppartitions to only jmx once Ref #8459.	2021-05-27 16:57:57 +03:00
Konstantin Osipov	52f7ff4ee4	raft: (testing) update copyright An incorrect copyright information was copy-pasted from another test file. Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>	2021-05-27 15:47:49 +03:00
Nadav Har'El	92b7a84e90	secondary index: in error message, call UDT as UDT It is forbidden to create a secondary index of a column which includes in any way the "duration" type. This includes a UDT which including duration. Our code attempted to print in this case the message "Secondary indexes are not supported on UDTs containing durations" - but because we tested for tuples first, and UDTs are also tuples - we got the message about tuples. By changing the order of the tests, we get the most specific (and useful) error message. Fixes #8724. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210526201042.642550-1-nyh@scylladb.com>	2021-05-27 15:46:30 +03:00
Piotr Sarna	99f356d764	test: add a case for conflicting workload types The test case verifies that if several workload types are effective for a single role, the conflict resolution is well defined.	2021-05-27 14:31:36 +02:00
Piotr Sarna	01b7e445f9	cql-pytest: add basic tests for service level workload types The test cases check whether it's possible to declare workload type for a service level and if its input is validated.	2021-05-27 14:31:36 +02:00
Piotr Sarna	54a5d4516c	docs: describe workload types for service levels A paragraph about workload types is added to docs/service_levels.md	2021-05-27 14:31:36 +02:00
Piotr Sarna	d45574ed28	sys_dist_ks: fix redundant parsing in get_service_level The routine used for getting service level information already operates on the service level name, but the same information is also parsed once more from a row from an internal table. This parsing is redundant, so it's hereby removed.	2021-05-27 14:31:26 +02:00
Piotr Sarna	7faba19605	sys_dist_ks: make get_service_level exception-safe In order to avoid killing the node if a parsing error occurs, the routine which fetches service level information is made exception-safe.	2021-05-27 14:31:25 +02:00
Pavel Emelyanov	d2442a1bb3	tests: Ditch storage_service_for_tests The purpose of the class in question is to start sharded storage service to make its global instance alive. I don't know when exactly it happened but no code that instantiates this wrapper really needs the global storage service. Ref: #2795 tests: unit(dev), perf_sstable(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210526170454.15795-1-xemul@scylladb.com>	2021-05-27 14:39:13 +03:00
Piotr Sarna	cb27ebe61d	transport: start shedding requests during potential overload This commit implements the following overload prevention heuristics: if the admission queue becomes full, a timer is armed for 50ms. If any of the ongoing requests finishes, the timer is disarmed, but if that doesn't happen, the server goes into shedding mode, which means that it reads new requests from the socket and immediately drops them until one of the ongoing requests finishes. This heuristics is not recommended for OLAP workloads, so it is applied only if the session declared itself as interactive (via service level's workload_type parameter).	2021-05-27 13:02:22 +02:00
Piotr Sarna	409c67b1b4	client_state: hook workload type from service levels The client state is now aware of its workload type derived from its attached service level.	2021-05-27 13:02:22 +02:00
Piotr Sarna	762e2f48f2	cql3: add listing service level workload type The workload type information is now presented in the output of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.	2021-05-27 13:02:22 +02:00
Piotr Sarna	4816678eb6	cql3: add persisting service level workload type The workload type information can now be set via CQL and it's persisted in the distributed system table.	2021-05-27 13:02:22 +02:00
Piotr Sarna	578543603d	qos: add workload_type service level parameter The workload type is currently one of three values: - unspecified - interactive - batch By defining the workload type, the service level makes it easier for other components to decide what to do in overload scenarios. E.g. if the workload is interactive, requests can be shed earlier, while if it's batched (or unspecified), shedding does not take place. Conversely, batch workloads could accept long full scan operations.	2021-05-27 13:02:22 +02:00
Dejan Mircevski	b54872fd95	auth: Remove `const` from role_manager methods Some subclasses want to maintain state, which constness needlessly precludes. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8721	2021-05-27 11:27:38 +03:00
Nadav Har'El	97e827e3e1	secondary index: fix regression in CREATE INDEX IF NOT EXISTS The recent commit `0ef0a4c78d` added helpful error messages in case an index cannot be created because the intended name of its materialized view is already taken - but accidentally broke the "CREATE INDEX IF NOT EXISTS" feature. The checking code was correct, but in the wrong place: we need to first check maybe the index already exists and "IF NOT EXISTS" was chosen - and only do this new error checking if this is not the case. This patch also includes a cql-pytest test for reproducing this bug. The bug is also reproduced by the translated Cassandra unit tests cassandra_tests/validation/entities/secondary_index_test.py:: testCreateAndDropIndex and this is how I found this bug. After these patch, all these tests pass. Fixes #8717. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210526143635.624398-1-nyh@scylladb.com>	2021-05-27 09:10:41 +02:00
Asias He	72cc596842	repair: Wire off-strategy compaction for regular repair We have enabled off-strategy compaction for bootstrap, replace, decommission and removenode operations when repair based node operation is enabled. Unlike node operations like replace or decommission, it is harder to know when the repair of a table is finished because users can send multiple repair requests one after another, each request repairing a few token ranges. This patch wires off-strategy compaction for regular repair by adding a timeout based automatic off-strategy compaction trigger mechanism. If there is no repair activity for sometime, off-strategy compaction will be triggered for that table automatically. Fixes #8677 Closes #8678	2021-05-26 11:41:27 +03:00
Konstantin Osipov	ac43941f17	rpc: don't include an unused header (raft_services.hh) Message-Id: <20210525183919.1395607-7-kostja@scylladb.com>	2021-05-26 11:07:44 +03:00
Konstantin Osipov	7ca4ffc309	system_keyspace: coroutinize db::system_keyspace::setup() Message-Id: <20210525183919.1395607-19-kostja@scylladb.com>	2021-05-26 11:06:21 +03:00
Avi Kivity	e2e723cc4c	build: enable -Wrange-loop-construct warning This warning triggers when a range for ("for (auto x : range)") causes non-trivial copies, prompting the developer to replace with a capture by reference. A few minor violations in the test suite are corrected. Closes #8699	2021-05-26 10:32:56 +03:00
Avi Kivity	3896e35897	Merge 'storage_service: Respect --enable-repair-based-node-ops flag during removenode' from Asias He In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700 Closes #8701 * github.com:scylladb/scylla: storage_service: Add removenode_add_ranges helper storage_service: Respect --enable-repair-based-node-ops flag during removenode	2021-05-26 10:32:56 +03:00
Avi Kivity	e9c940dbbc	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). Closes #8695 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-05-25 18:34:29 +03:00
Kamil Braun	739c24b020	docs: update cdc.md with info about the new internal table	2021-05-25 16:07:23 +02:00
Kamil Braun	c948573398	sys_dist_ks: don't create old CDC generations table on service initialization The old table won't be created in clusters that are bootstrapped after this commit. It will stay in clusters that were upgraded from a version before this commit. Note that a fully upgraded cluster doesn't automatically create a new generation in the new format. Even if the last generation was created before the upgrade, the cluster will keep using it. A new generation will be created in the new format when either: 1. a new node bootstraps (in the new version), 2. or the user runs checkAndRepairCdcStreams, which has a new check: if the current generation uses the old format, the command will decide that repair is needed, even if the generation is completely fine otherwise (also in the new version). During upgrade, while the CDC_GENERATIONS_V2 feature is still not enabled, the user may still bootstrap a node in the old version of Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In that case a new generation will be created in the old format, using the old table definitions.	2021-05-25 16:07:23 +02:00
Kamil Braun	2835697ac1	sys_dist_ks: rename all_tables() to ensured_tables() The static function `all_tables` in system_distributed_keyspace.cc was used by the `system_distributed_keyspace` service initialization function (`start()`) to ensure that a certain set of tables - which the service provides accessors to - exist in the cluster. For each table in the vector returned by `all_tables()` the function would try to create the table, ignoring the "table already exists" error if it is thrown. The commit renames `all_tables` to `ensured_tables` to better convey the intention of this function and documents its purpose in a comment. We do this because in the future the service may provide accessors to tables which it does not actually create. The example - coming in a later commit - is a table which was created in a previous version of Scylla, and for which we still have to provide accessors for backward compatibility / correct handling of the upgrade procedure, but which we do not want to create in clusters that were freshly created using the new version of Scylla, since in that case these tables would be just unnecessary garbage. We mention this use case in the comment.	2021-05-25 16:07:23 +02:00
Kamil Braun	337a4ef8ad	cdc: when creating new generations, use format v2 if possible A node with this commit, when creating a new CDC generation (during bootstrap, upgrade, or when running checkAndRepairCdcStreams command) will check for the CDC_GENERATIONS_V2 feature and: - If the feature is enabled create the generation in the v2 format and insert it into the new internal table. This is safe because a node joins the feature only if it understands the new format. - Otherwise create it in the v1 format, limiting its size as before, and insert it into the old table. The second case should only happen if we perform bootstrap or run checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully upgraded clusters the feature shall be enabled, causing all new generations to use the new format.	2021-05-25 16:07:23 +02:00
Kamil Braun	4d3870b24b	main: pass feature_service to cdc::generation_service	2021-05-25 16:07:23 +02:00
Kamil Braun	2ac9239f6a	gms: introduce CDC_GENERATIONS_V2 feature When a node joins this feature (which it does immediately when upgrading to a version that has this commit), it says: "I understand the new generation storage format and the new identifier format". Thus, when the feature becomes enabled - after all nodes have joined it - it means that it's safe to create new generations using these new storage/ID formats.	2021-05-25 16:07:23 +02:00
Kamil Braun	9c1a3180bb	cdc: introduce retrieve_generation_data This function given a generation ID retrieves its data from the internal table in which the data resides. This depends on the version of the ID: for _v1 we're using system_distributed.cdc_generation_descriptions, for _v2 we're using the better system_distributed_v2.cdc_generation_descriptions_v2 (see the previous commit for detailed explanation of the superiority of the new table).	2021-05-25 16:07:23 +02:00
Kamil Braun	f25e77c202	test: cdc: include new generations table in permissions test	2021-05-25 16:07:23 +02:00
Kamil Braun	1c25b9df56	sys_dist_ks: increase timeout for create_cdc_desc If we want to allow larger generations, we may want to give this operation a bit more time.	2021-05-25 16:07:23 +02:00
Kamil Braun	3155cde9c8	sys_dist_ks: new table for exchanging CDC generations Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in the `system_distributed_everywhere` keyspace so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. The timestamp and the UUID form the "generation identifier" of this new generation; this should explain why we introduced the generation_id_v2 type in previous commits. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. Note that the node is still using the old method - the actual switch will be done in a later commit.	2021-05-25 16:07:23 +02:00
Calle Wilund	a96433c684	commitlog_test: Add test case for usage/disk size threshold mismatch Refs #8270 Tries to simulate case where we mismatch segments usage with actual disk footprint and fail to flush enough to allow segment recycling	2021-05-25 12:43:12 +00:00
Calle Wilund	bf0a91b566	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-05-25 12:43:12 +00:00
Calle Wilund	8ce836209b	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-05-25 12:43:12 +00:00
Calle Wilund	e34ed30178	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-05-25 12:43:12 +00:00
Calle Wilund	ec40207e7f	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-05-25 12:43:12 +00:00
Benny Halevy	6144656b25	table: seal_active_memtable: update stats also on the error path Currently the pending (memtables) flushes stats are adjusted back only on success, therefore they will "leak" on error, so move use a .then_wrapped clause to always update the stats. Note that _commitlog->discard_completed_segments is still called only on success, and so is returning the previous_flush future. Test: unit(dev) DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>	2021-05-25 12:51:54 +02:00
Benny Halevy	d46958d3ce	phased_barrier: advance_and_await: abort on allocation failure Currently, advance_and_wait() allocates a new gate which might fail. Rather than returning this failure as an exceptional future - which will require its callers to handle that failure, keep the function as noexcept and let an exception from make_lw_shared<gate>() terminate the program. This makes the function "fail-free" to its callers, in particular, when called from the table::stop() path where we can't do much about these failures and we require close/stop functions to always succeed. The alternative of make the allocation of a new gate optional and covering from it in start() is possible but was deemed not worth it as it will add complexity and cost to start() that's called on the common, hot, path. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>	2021-05-25 12:50:59 +02:00
Avi Kivity	e391e4a398	test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment test_phased_barrier_reassignment has a timeout to prevent the test from hanging on failure, but it occastionally triggers in debug mode since the timeout is quite low (1ms). Increase the timeout to prevent false positives. Since the timeout only expires if the test fails, it will have no impact on execution time. Ref #8613 Closes #8692	2021-05-25 11:20:18 +02:00
Benny Halevy	3ad0f156b9	memtable_list: request_flush: wait on pending flushes also when empty() In https://github.com/scylladb/scylla/issues/8609, table::stop() that is called from database::drop_column_family is expected to wait on outstanding flushes by calling _memtable->request_flush(), but the memtable_list is considered empty() at this point as it has a single empty memtable, so request_flush() returns a ready future, without waiting on outstanding flushes. This change replaces the call to request_flush with flush(). Fix that by either returning _flush_coalescing future that resolves when the memtable is sealed, if available, or go through the get_flush_permit and _dirty_memory_manager->flush_one song and dance, even though the memtable is empty(), as the latter waits on pending flushes. Fixes #8609 Test: unit(dev) DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>	2021-05-25 11:19:51 +02:00
Kamil Braun	d71513d814	abstract_replication_strategy: avoid reactor stalls in `get_address_ranges` and friends The algorithm used in `get_address_ranges` and `get_range_addresses` calls `calculate_natural_endpoints` in a loop; the loop iterates over all tokens in the token ring. If the complexity of a particular implementation of `calculate_natural_endpoints` is large - say `θ(n)`, where `n` is the number of tokens - this results in an `θ(n^2)` algorithm (or worse). This case happens for `Everywhere` replication strategy. For small clusters this doesn't matter that much, but if `n` is, say, `20*255`, this may result in huge reactor stalls, as observed in practice. We avoid these stalls by inserting tactical yields. We hope that some day someone actually implements a subquadratic algortihm here. The commit also adds a comment on `abstract_replication_strategy::calculate_natural_endpoints` explaining that the interface does not give a complexity guarantee (at this point); the different implementations have different complexities. For example, `Everywhere` implementation always iterates over all tokens in the token ring, so it has `θ(n)` worst and best case complexity. On the other hand, `NetworkTopologyStrategy` implementation usually finishes after visiting a small part of the token ring (specifically, as soon as it finds a token for each node in the ring) and performs a constant number of operations for each visited token on average, but theoretically its worst case complexity is actually `O(n + k^2)`, where `n` is the number of all tokens and `k` is the number of endpoints (the `k^2` appears since for each endpoint we must perform finds and inserts on `unordered_set` of size `O(k)`; `unordered_set` operations have `O(1)` average complexity but `O(size of the set)` worst case complexity). Therefore it's not easy to put any complexity guarantee in the interface at this point. Instead, we say that: - some implementations may yield - if their complexities force us to do so - but in general, there is no guarantee that the implementation may yield - e.g. the `Everywhere` implementation does not yield. Fixes #8555. Closes #8647	2021-05-25 11:53:28 +03:00
Raphael S. Carvalho	ee39eb9042	sstables: Fix slow off-strategy compaction on STCS tables Off-strategy compaction on a table using STCS is slow because of the needless write amplification of 2. That's because STCS reshape isn't taking advantage of the fact that sstables produced by a repair-based operation are disjoint. So the ~256 input sstables were compacted (in batches of 32) into larger sstables, which in turn were compacted into even larger ones. That write amp is very significant on large data sets, making the whole operation 2x slower. Fixes #8449. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>	2021-05-25 11:24:42 +03:00
Asias He	70147dcb5a	storage_service: Add removenode_add_ranges helper Share the code between restore_replica_count and removenode_with_stream to reduce duplication. Refs #8700	2021-05-25 10:44:31 +08:00
Asias He	a285bd28e2	storage_service: Respect --enable-repair-based-node-ops flag during removenode In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700	2021-05-25 10:42:58 +08:00
Kamil Braun	4658adbe18	tree-wide: introduce cdc::generation_id_v2 This is a new type of CDC generation identifiers. Compared to old IDs, additionally to the timestamp it contains an UUID. These new identifiers will allow a safer and more efficient algorithm of introducing new generations into a cluster (introduced in a later commit). For now, nodes keep using the old identifier format when creating new generations and whenever they learn about a new CDC generation from gossip they assume that it also is stored in the v1 format. But they do know how to (de)serialize the second format and how to persist new identifiers in local tables.	2021-05-24 17:50:21 +02:00
Avi Kivity	948e2c0b36	utils: config_file: delete unneeded template instantation of operator<<() config_file.cc instantiates std::istream& std::operator>>(std::istream&, std::unordered_map<seastar::sstring, seastar::sstring>&), but that instantiation is ignored since config_file_impl.hh specializes that signature. -Winstantiation-after-specialization warns about it, so re-enable it now that the code base is clean. Also remove the matching "extern template" declaration, which has no definition any more. Closes #8696	2021-05-24 18:34:45 +03:00
Avi Kivity	60fb224171	Update seastar submodule * seastar 28dddd2683...f0f28d07e1 (4): > httpd: allow handler to not read an empty content Fixes #8691. > compat: source_location: implement if no std or experimental are available > compat: source_location: declare using in seastar::compat namespace > perftune.py: fix a bug in mlx4 IRQs names matching pattern	2021-05-24 17:44:08 +03:00
Piotr Sarna	95c6ec1528	Merge 'test/cql-pytest: clean up tests to run on Cassandra' from Nadav Har'El To keep our cql-pytest tests "correct", we should strive for them to pass on Cassandra - unless they are testing a Scylla-only feature or a deliberate difference between Scylla and Cassandra - in which case they should be marked "scylla-only" and cause such tests to be skipped when running on Cassandra. The following few small patches fix a few cases where our tests we failing on Cassandra. In one case this even found a bug in the test (a trivial Python mistake, but still). Closes #8694 * github.com:scylladb/scylla: test/cql-pytest: fix python mistake in an xfailing test test/cql-pytest: mark some tests with scylla-only test/cql-pytest: clean up test_create_large_static_cells_and_rows	2021-05-24 16:42:01 +02:00
Avi Kivity	789757a692	Merge 'cql3: represent lists as chunked_vector instead of std::vector' from Michał Chojnowski The cql3 layer manipulates lists as `std::vector`s (of `managed_bytes_opt`). Since lists can be arbitrarily large, let's use chunked vectors there to prevent potentially large contiguous allocations. Closes #8668 * github.com:scylladb/scylla: cql3: change the internal type of tuples::in_value from std::vector to chunked_vector cql3: change the internal type of lists::value from std::vector to chunked_vector cql3: in multi_item_terminal, return the vector of items by value	2021-05-24 17:19:45 +03:00
Nadav Har'El	edc2c65552	Merge 'Fix service level negative timeouts' from Piotr Sarna This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed. More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts. The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels. Closes #8645 * github.com:scylladb/scylla: cql-pytest: introduce service level test suite cql-pytest: add enabling authentication by default qos: fix validating service level timeouts for negative values	2021-05-24 16:30:13 +03:00
Tomasz Grabiec	b1821c773f	Merge "raft: basic RPC module testing" from Pavel Solodovnikov Now RPC module has some basic testing coverage to make sure RPC configuration is updated appropriately on configuration changes (i.e. `add_server` and `remove_server` are called when appropriate). The test suite currenty consists of the following test-cases: * Loading server instance with configuration from a snapshot. * Loading server instance with configuration from a log. * Configuration changes (remove + add node). * Leader elections don't lead to RPC configuration changes. * Voter <-> learner node transitions also don't change RPC configuration. * Reverting uncommitted configuration changes updates RPC configuration accordingly (two cases: revert to snapshot config or committed state from the log). A few more refactorings are made along the way to be able to reuse some existing functions from `replication_test` in `rpc_test` implementation. Please note, though, that there are still some functions that are borrowed from `replication_test` but not yet extracted to common helpers. This is mostly because RPC tests doesn't need all the complexity that `replication_test` has, thus, some helpers are copied in a reduced form. It would take some effort to refactor these bits to fit both `replication_test` and `rpc_test` without sacrificing convenience. This will probably be addressed in another series later. * manmanson/raft-rpc-tests-v9-alt3: raft: add tests for RPC module test: add CHECK_EVENTUALLY_EQUAL utility macro raft: replication_test: reset test rpc network between test runs raft: replication_test: extract tickers initialization into a separate func raft: replication_test: support passing custom `apply_fn` to `change_configuration()` raft: replication_test: introduce `test_server` aggregate struct raft: replication_test: support voter<->learner configuration changes raft: remove duplicate `create_command` function from `replication_test` raft: avoid 'using' statements in raft testing helpers header	2021-05-24 14:44:37 +02:00
Benny Halevy	56d3cb514a	sstables: parse statistics: improve error handling Properly return malformed_sstable_exception if the statistics file fails to parse. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210524113808.973951-1-bhalevy@scylladb.com>	2021-05-24 15:12:48 +03:00
Nadav Har'El	5da0ad2ebc	Merge branch 'coverage-py-missing-features/v1' of https://github.com/denesb/scylla into next This patchset adds the missing features noted by the patchset introducing it, namely: * The ability to run a test through `coverage.py`, automating the entire process of setting up the environment, running the test and generating the report. This is possible with the new `--run` command line argument. It supports either generating a report immediately after running the provided test or just doing the running part, allowing the user to generate the report after having run all the tests they wanted to. * A tweakable verbosity level. It is also possible to specify a subset of the profiling data as input for the report. The documentation was also completed, with examples for all the intended uses-cases. With these changes, `coverage.py` is considered mature, the remaining rough edges being located in other scripts (`tests.py` and `configure.py`). It is now possible to generate a coverage report for any test desired. Also on: https://github.com/denesb/scylla.git coverage-py-missing-features/v1 Botond Dénes (5): scripts/coverage.py: allow specifying the input files to generate the report from scripts/coverage.py: add capability of running a test directly scripts/coverage.py: add --verbose parameter scripts/coverage.py: document intended uses-cases HACKING.md: redirect to ./coverage.py for more details scripts/coverage.py \| 143 +++++++++++++++++++++++++++++++++++++++----- HACKING.md \| 19 +----- 2 files changed, 129 insertions(+), 33 deletions(-)	2021-05-24 14:54:28 +03:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Yaron Kaikov	dd453ffe6a	install.sh: Setup aio-max-nr upon installation This is a follow up change to #8512. Let's add aio conf file during scylla installation process and make sure we also remove this file when uninstall Scylla As per Avi Kivity's suggestion, let's set aio value as static configuration, and make it large enough to work with 500 cpus. Closes #8650	2021-05-24 14:24:20 +03:00
Takuya ASADA	3d307919c3	scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem Currently, var-lib-scylla.mount may fails because it can start before MDRAID volume initialized. We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for device become available, but systemd manual says it automatically configure dependency for mount unit when we specify filesystem path by "absolute path of a device node". So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>. Fixes #8279 Closes #8681	2021-05-24 14:24:08 +03:00
Nadav Har'El	5206665b15	test/cql-pytest: fix python mistake in an xfailing test The xfailing test cassandra_tests/validation/entities/collections_test.py:: testSelectionOfEmptyCollections had a Python mistake (using {} instead of set() for an empty set), which resulted in its failure when run against Cassandra. After this patch it passes on Cassandra and fails on Scylla - as expected (this is why it is marked xfail). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 13:14:54 +03:00
Nadav Har'El	f26b31e950	test/cql-pytest: mark some tests with scylla-only Tests which are known to test a Scylla-only feature (such as CDC) or to rely on a known and difference between Scylla and Cassandra should be marked "scylla-only", so they are skipped when running the tests against Cassandra (test/cql-pytest/run-cassandra) instead of reporting errors. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 13:03:48 +03:00
Nadav Har'El	c8117584e3	test/cql-pytest: clean up test_create_large_static_cells_and_rows The test test_create_large_static_cells_and_rows had its own implementation of "nodetool flush" using Scylla's REST API. Now that we have a nodetool.flush() function for general use in cql-pytest, let's use it and save a bit of duplication. Another benefit is that now this test can be run (and pass) against Cassandra. To allow this test to run on Cassandra, I had to remove a "USING TIMEOUT" which wasn't necessary for this test, and is not a feature supported by Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 12:31:51 +03:00
Eliran Sinvani	f2091bb227	workload prioritization: Reduce the logging sensitivity to "glitches" in availability Before this patch every failure to pull the configuration have been reported as a warning. However this is confusing for users for two reasons: 1. It pollutes the logs if the configuration is polled which is Scylla's mode of operation. Such a line is logged every failed iteration. 2. It confuses users because even though this level is warning, it logs out an exception and the log message contains the word failed. We see it a lot during QA runs and customer questions from the field. Point 2 is only solvable by reducing the verbosity of the logged information, which will make debugging harder. Point 1 is addressed here in the following manner, first the one shot configuration pull function is not handling the exception itself, this is OK because it is harmless to fail once or twice in a row in configuration pulling like in every other query, the caller is the one that will be responsible to handle the exception and log the information. Second, the polling loop capture the exceptions being thrown from the configuration pulling function and only report an error with the latest exception if the polling has failed in consecutive iterations over the last 90 seconds. This value was chosen because this is about the empirical worst case time that it takes to a node to notice one of the other nodes in the cluster is down (hence not querying it). It is not important for the user or to us to be notified on temporary glitches in availability (through this error at least) and since we are eventually consistent is ok that some nodes will catch up with the configuration later than others. We also set a threshold in which if the configuration still couldn't be retrieved then the logging level is bumped to ERROR. Closes #8574	2021-05-24 10:51:47 +02:00
Piotr Sarna	17f4a55664	qos: remove unused with_user_service_level helper This helper function is an artifact of forward-porting service levels, and it wouldn't even compile when used because of mismatched function declarations. It's not used anywhere in the open-source code, so it's removed to avoid future merge conflicts. Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>	2021-05-24 11:42:51 +03:00
Michał Chojnowski	4b60e69e7c	keys, compound: take the argument to from_single_value() by reference Since serialize_value needs to copy the values to a bigger buffer anyway, there is no point in copying the argument higher in the call chain. This patch eliminates some pointless copies, for example in alternator/executor.cc Closes #8688	2021-05-24 11:20:24 +03:00
Asias He	425e3b1182	gossip: Introduce direct failure detector Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes #8488 Closes #8036	2021-05-24 10:47:06 +03:00
Piotr Sarna	890ed201fd	Merge 'Enable -Wunused-private-field warning' from Avi Kivity The -Wunused-private-field was squelched when we switched to clang to make the change easier. But it is a useful warning, so re-enable it. It found a serious bug (#8682) and a few minor instances of waste. Closes #8683 * github.com:scylladb/scylla: build: enable -Wunused-private-field warning test: drop unused fields table: drop unused field database_sstable_write_monitor::_compaction_manager streaming: drop unused fields sstables: mx reader: drop unused _column_value_length field sstables: index_consumer: drop unused max_quantity field compaction: resharding_compaction: drop unused _shard field compaction: compaction_read_monitor: drop unused _compaction_manager field raft: raft_services: drop unused _gossiper field repair: drop unused _nr_peer_nodes field redis: drop unused fields _storage_proxy and _requests_blocked_memory mutation_rebuilder: drop unused field _remaining_limit db: data_listeners: remove unused field _db cql3: insert_json_statement: note bug with unused _if_not_exists cql3: authorized_prepared_statement_cache: drop unused field _logger auth: service_level_resource_view: drop unused field _resource	2021-05-24 09:21:10 +02:00
Michał Chojnowski	03faf139c8	collection_mutation: don't linearize collection values Yet another patch preventing potentially large allocations. Currently, collection_mutation{_view,}_description linearize each collection value during deserialization. It's not unthinkable that a user adds a large element to a list or a map, so let's avoid that. This patch removes the dependency on linearizing_input_stream, which does not provide a way to read fragmented subbuffers, and replaces it with a new helper, which does. (Extending linearizing_input_stream is not viable without rewriting it completely). Only linearization of collection values is corrected in this patch. Collection keys are still linearized. Storing them in managed_bytes is likely to be more harmful than helpful, because large map keys are extremely unlikely, and UUIDs, which are used as keys in lists, do not fit into manages_bytes's small value optimization, so this would incure an extra allocation for every list element. Note: this patch leaves utils/linearizing_input_stream.hh unused. Refs: #8120 Closes #8690	2021-05-23 12:16:56 +03:00
Michał Chojnowski	65be64d0fe	types: don't linearize values in abstract_type::hash Yet another patch aiming to prevent potentially large allocations. abstract_type::hash somehow evaded the anti-linearization patches until now. Fix that. Note that decimals and varints are still linearized, but we leave it be, under the assumption that nobody inserts 128KiB-large varints into a database. Refs: #8120 Closes #8689	2021-05-23 12:11:53 +03:00
Michał Chojnowski	ffdb706984	keys, compound: eliminate some careless copies of shared pointers Using `auto` copies the shared pointers. We don't want that, so let's use `const auto&`. Closes #8686	2021-05-23 12:11:46 +03:00
Michał Chojnowski	ebe485953a	types: fix a case of type punning via union Type punning via unions is legal in C, but illegal (undefined behaviour) in C++. Use the legal bit_cast instead. Closes #8685	2021-05-23 10:12:56 +03:00
Michał Chojnowski	e4405692ae	types: remove some dead code Closes #8684	2021-05-23 09:57:30 +03:00
Michał Chojnowski	23909e91a4	alternator: executor: eliminate some pointless reserializations There are places where abstract_type::deserialize is called just to pass the result to compound_wrapper::from_singular, which immediately serializes it again. Get rid of this ritual by adding a version of from_singular which takes a serialized argument. As a bonus, along the way we eliminate some pointless copies of lw_shared_ptr and std::shared_ptr caused by two careless uses of `auto`. Closes #8687	2021-05-23 09:42:09 +03:00
Gleb Natapov	b4d6bdb16e	raft: test: check that a leader does not send probes to a follower in the snapshot mode Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>	2021-05-23 01:06:12 +02:00
Michał Chojnowski	d72b91053b	logalloc: fix quadratic behaviour of reclaim_from_evictable As an optimization for optimistic cases, reclaim_from_evictable first evicts the requested amount of memory before attempting to reclaim segments through compactions. However, due to an oversight, it does this before every compaction instead of once before all compactions. Usually reclaim_from_evictable is called with small targets, or is preemptible, and in those cases this issue is not visible. However, when the target is bigger than one segment and the reclaim is not preemptible, which is he case when it's called from allocating_section, this results in a quadratic explosion of evictions, which can evict several hundred MiB to reclaim a few MiB. Fix that by calculating the target of memory eviction only once, instead of recalculating it after every compaction. Fixes #8542. Closes #8611	2021-05-22 20:49:00 +02:00
Avi Kivity	78e392c01d	build: enable -Wunused-private-field warning The -Wunused-private-field was squelched when we switched to clang to make the change easier. But it is a useful warning, so re-enable it. It found a serious bug (#8682) and a few minor instances of waste.	2021-05-21 21:05:16 +03:00
Avi Kivity	7e5a0b6fd0	test: drop unused fields Drop unused fields in various tests and test libraries.	2021-05-21 21:04:49 +03:00
Avi Kivity	1d508106be	table: drop unused field database_sstable_write_monitor::_compaction_manager	2021-05-21 21:04:20 +03:00
Avi Kivity	84be89eb3b	streaming: drop unused fields	2021-05-21 21:03:23 +03:00
Avi Kivity	047b3f85d3	sstables: mx reader: drop unused _column_value_length field	2021-05-21 21:02:55 +03:00
Avi Kivity	32d9ba2fbb	sstables: index_consumer: drop unused max_quantity field	2021-05-21 21:02:16 +03:00
Avi Kivity	cb587aaa5c	compaction: resharding_compaction: drop unused _shard field	2021-05-21 21:01:54 +03:00
Avi Kivity	f62469b7c5	compaction: compaction_read_monitor: drop unused _compaction_manager field A constructor that now takes on argument is made explicit.	2021-05-21 21:00:47 +03:00
Avi Kivity	b8137986e6	raft: raft_services: drop unused _gossiper field	2021-05-21 21:00:04 +03:00
Avi Kivity	0b8b9f0cbf	repair: drop unused _nr_peer_nodes field	2021-05-21 20:59:23 +03:00
Avi Kivity	195c969304	redis: drop unused fields _storage_proxy and _requests_blocked_memory Probably carried over by copy-paste. Also drop storage_proxy include.	2021-05-21 20:58:32 +03:00
Avi Kivity	a0257d95c2	mutation_rebuilder: drop unused field _remaining_limit And its initializer.	2021-05-21 20:57:33 +03:00
Avi Kivity	924f93028a	db: data_listeners: remove unused field _db Remove the unused field and the constructor that populated it.	2021-05-21 20:56:42 +03:00
Avi Kivity	539948760f	cql3: insert_json_statement: note bug with unused _if_not_exists The parser accepts INSERT JSON ... IF NOT EXISTS but we later ignore it. This is a bug (#8682). Note it down and shut down a compiler error that will result when we enable -Wunused-private-field.	2021-05-21 20:54:44 +03:00
Avi Kivity	adbe7ad919	cql3: authorized_prepared_statement_cache: drop unused field _logger	2021-05-21 20:54:01 +03:00
Avi Kivity	ed160df0f9	auth: service_level_resource_view: drop unused field _resource It should probably be used in operator<<(std::ostream&), but whoever implements that gets to re-add the field.	2021-05-21 20:53:01 +03:00
Botond Dénes	61c43b8983	HACKING.md: redirect to ./coverage.py for more details scripts/coverage.py now has a detailed help, don't repeat its content in HACKING.md, instead redirect users who want to learn more to the script's help.	2021-05-21 11:50:39 +03:00
Botond Dénes	cd17932b96	scripts/coverage.py: document intended uses-cases	2021-05-21 11:50:39 +03:00
Botond Dénes	4647472fee	scripts/coverage.py: add --verbose parameter	2021-05-21 11:50:39 +03:00
Botond Dénes	2d7fe702e1	scripts/coverage.py: add capability of running a test directly Through `coverage.py`. This saves the user from all the required env setup required for `coverage.py` to successfully generate the report afterwards. Instead all of this is taken care automatically, by just running: ./scripts/coverage.py --run ./build/coverage/.../mytest arg1 ... argN `coverage.py` takes care of running the test and generating a coverage report from it. As a side effect, also fix `main()` ignoring its `argv` parameter.	2021-05-21 11:50:39 +03:00
Botond Dénes	64c4557ba8	scripts/coverage.py: allow specifying the input files to generate the report from Currently `coverage.py` includes all raw profiling data found at PATH automatically. This patch gives an option to override this, instead including only the given input files in the report.	2021-05-21 11:50:39 +03:00
Nadav Har'El	a2379b96b1	alternator test: test for large BatchGetItem This patch adds an Alternator test, test_batch_get_item_large, which checks a BatchGetItem with a moderately large (1.5 MB) response. The test passes - we do not have a bug in BatchGetItem - but it does reproduce issue #8522 - the long response is stored in memory as one long contiguous string and causes a warning about an over-sized allocation: WARN ... seastar_memory - oversized allocation: 2281472 bytes. Incidentally, this test also reproduces a second contiguous allocation problem - issue #8183 (in BatchWriteItem which we use in this test to set up the item to read). Refs #8522 Refs #8183 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210520161619.110941-1-nyh@scylladb.com>	2021-05-21 08:38:53 +02:00
Avi Kivity	4383674760	cql3: result_set: switch rows to chunked_vector _rows uses a deque, but doesn't need any special functionality. Switch to chunked_vector, which uses one less allocation in the common case (std::deque has an extra allocation for managing its chunks). Closes #8679	2021-05-20 20:14:15 +03:00
Avi Kivity	eac6fb8d79	gdb: bypass unit test on non-x86 The gdb self-tests fail on aarch64 due to a failure to use thread-local variables. I filed [1] so it can get fixed. Meanwhile, disable the test so the build passes. It is sad, but the aarch64 build is not impacted by these failures. [1] https://sourceware.org/bugzilla/show_bug.cgi?id=27886 Closes #8672	2021-05-20 20:14:15 +03:00
Asias He	2ec1f719de	repair: Always use run_replace_ops Currently, the new NODE_OPS_CMD for replace operation is used only when repair based node operation is enabled. However, We can use the NODE_OPS_CMD to run replace operation and use streaming instead of repair to sync data as well. After this patch, we will use streaming inside run_replace_ops if repair based node ops is not enabled. So that we can take the benefits that NODE_OPS_CMD brings in commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation). Fixes #8013	2021-05-20 20:14:15 +03:00
Avi Kivity	bb51f7d928	Update seastar submodule * seastar 847fccaf5e...28dddd2683 (13): > reactor: disable xfs extent size hints if using the kernel page cache > smp: replace _reactors global with a local > Merge "Add test for IO-scheduler (fails now)" from Pavel E > weak_ptr: lift restriction on copying > core: expose hidden method from parent class > perftune.py: __get_feature_file(): verify that parameters are not None > gate: assert no outstanding requests when destroyed > httpd: add status_types > cmake: use -O2 for CMAKE_CXX_FLAGS_DEV with clang > compat: source_location: use std::source_location only if available > iotune: disambiguate "this" lambda capture in C++20 mode > Merge "Consider disk saturation request lengths" from Pavel E > Merge 'seastar-addr2line: support oneline backtrace in resolve call' from Benny Halevy	2021-05-20 20:14:15 +03:00
Benny Halevy	5724233609	scylla-gdb: scylla_io_queues: support io_group._max_bytes_count _maximum_request_size is renamed to _max_bytes_count in `40a29d5590` This patch adds support for ioq io_group._max_bytes_count if io_group._maximum_request_size isn't found. Test: scylla-gdb(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210520151537.710123-1-bhalevy@scylladb.com>	2021-05-20 20:14:15 +03:00
Avi Kivity	30034371e7	Merge "Remove most of global pointers from repair" from Pavel " There are many global stuff in repair -- a bunch of pointers to sharded services, tracker, map of metas (maybe more). This set removes the first group, all those services had become main-local recently. Along the way a call to global storage proxy is dropped. To get there the repair_service is turned into a "classical" sharded<> service, gets all the needed dependencies by references from main and spreads them internally where needed. Tracker and other stuff is left global, but tracker is now the candidate for merging with the now sharded repair_service, since it emulates the sharded concept internally. Overall the change is - make repair_service sharded and put all dependencies on it at start - have sharded<repair_service> in API and storage service - carry the service reference down to repair_info and repair_meta constructions to give them the depedencies - use needed services in _info and _meta methods tests: unit(dev), dtest.repair(dev) " * 'br-repair-service' of https://github.com/xemul/scylla: (29 commits) repair: Drop most of globals from repair repair: Use local references in messaging handler checks repair: Use local references in create_writer() repair: Construct repair_meta with local references repair: Keep more stuff on repair_info repair: Kill bunch of global usages from insert_repair_meta repair: Pass repair service down to meta insertion repair: Keep local migration manager on repair_info repair: Move unused db captures repair: Remove unused ms captures repair: Construct repair_info with service repair: Loop over repair sharded container repair: Make sync_data_using_repair a method repair: Use repair from storage service repair: Keep repair on storage service repair: Make do_repair_start a method repair: Pass repair_service through the API until do_repair_start repair: Fix indentation after previous patch repair: Split sync_data_using_repair repair: Turn repair_range a repair_info method ...	2021-05-20 10:57:48 +03:00
Pavel Solodovnikov	b51b11f226	transport: remove extraneous `qos/service_level_controller` includes from headers Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:32:15 +03:00
Pavel Solodovnikov	238273d237	treewide: remove evidently unneded storage_proxy includes from some places Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:19:32 +03:00
Pavel Solodovnikov	0663aa6ca1	service_level_controller: remove extraneous `service/storage_service.hh` include Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:18:41 +03:00
Pavel Solodovnikov	d7a77a993f	sstables/writer: remove extraneous `service/storage_service.hh` include Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:03:24 +03:00
Pavel Solodovnikov	c3a7b55507	treewide: remove extraneous database.hh includes from headers Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:59:14 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Pavel Solodovnikov	9352a08468	cql3: remove extraneous includes from some headers Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:32:57 +03:00
Piotr Sarna	223a59c09c	test: make rjson allocator test working in sanitize mode Following Nadav's advice, instead of ignoring the test in sanitize/debug modes, the allocator simply has a special path of failing sufficiently large allocation requests. With that, a problem with the address sanitizer is bypassed and other debug mode sanitizers can inspect and check if there are no more problems related to wrapping the original rapidjson allocator. Closes #8539	2021-05-20 00:42:47 +03:00
Pavel Solodovnikov	ae213e1e25	cql3: various forward declaration cleanups Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 00:18:00 +03:00
Pavel Solodovnikov	94b5c6333f	utils: add missing <limits> header in `extremum_tracking.hh` This makes all headers in scylla to be self-sufficient up to the moment. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 00:04:51 +03:00
Pavel Solodovnikov	a66de8658b	raft: add tests for RPC module Now RPC module has some basic testing coverage to make sure RPC configuration is updated appropriately on configuration changes (i.e. `add_server` and `remove_server` are called when appropriate). The test suite currenty consists of the following test-cases: * Loading server instance with configuration from a snapshot. * Loading server instance with configuration from a log. * Configuration changes (remove + add node). * Leader elections don't lead to RPC configuration changes. * Voter <-> learner node transitions also don't change RPC configuration. * Reverting uncommitted configuration changes updates RPC configuration accordingly (two cases: revert to snapshot config or committed state from the log). Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-19 23:14:04 +03:00
Pavel Solodovnikov	e030e291a8	test: add CHECK_EVENTUALLY_EQUAL utility macro It would be good to have a `CHECK` variant in addition to an existing `REQUIRE_EVENTUALLY_EQUAL` macro. Will be used in raft RPC tests. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-19 23:12:55 +03:00
Pavel Solodovnikov	2067cc75c6	raft: replication_test: reset test rpc network between test runs Currently, emulated rpc network is shared between all test cases in `replication_test.cc` (see static `rpc::net` map). Though, its value is not reset when executing a subsequent test case, which opens a possibility for heap-use-after-free bugs. Also, make all `send_*` functions in test rpc class to throw an error if a node being contacted is not in the network instead of past-the-end access. This allows to safely contact a non-existent node, which will be used in RPC tests later. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-19 23:06:29 +03:00
Avi Kivity	c71d007797	consistency_level: deinline assure_sufficient_live_nodes() assure_sufficient_live_nodes() is a huge template calling other huge templates, and requires "network_topology_strategy.hh". It is inlined in consistency_level.hh. This increases compile time and recompiles. Move the template out-of-line and use "extern template" to instantiate it. This is not ideal as new callers would require updates to the instantiated signatures, but I think our goal should be to de-template it completely instead. Meanwhile, this reduces some pain. Ref #1. Closes #8637	2021-05-19 15:03:51 +03:00
Avi Kivity	d8121961fa	Merge 'cql-pytest: add nodetool flush feature and use it in a test' from Nadav Har'El The first patch adds a nodetool-like capability to the cql-pytest framework. It is not meant to be used to test nodetool itself, but rather to give CQL tests the ability to use nodetool operations - currently only one operation - "nodetool flush". We try to use Scylla's REST API, if possible, and only fall back to using an external "nodetool" command when the REST API is not available - i.e., when testing Cassandra. The benefit of using the REST API is that we don't need to run the jmx server to test Scylla. The second patch is an example of using the new nodetool flush feature in a test that needs to flush data to reproduce a bug (which has already been fixed). Closes #8622 * github.com:scylladb/scylla: cql-pytest: reproducer for issue #8138 cql-pytest: add nodetool flush feature	2021-05-19 14:40:18 +03:00
Nadav Har'El	fd8d15a1a6	cql-pytest: reproducer for issue #8138 We add a reproducing test for issue #8138, were if we write to an TWCS table, scanning it would yield no rows - and worse - crash the debug build. This test requires "nodetool flush" to force the read to happen from sstables, hence the nodetool feature was implemented in the previous patch (on Scylla, it uses the REST API - not actually running nodetool or requiring JMX). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-19 13:58:14 +03:00
Nadav Har'El	49580a4701	cql-pytest: add nodetool flush feature This patch adds a nodetool-compatible capability to the cql-pytest framework. It is not meant to be used to test nodetool itself, but rather to give CQL tests the ability to use nodetool operations - currently one operation - "nodetool flush". Use it in a test as: import nodetool nodetool.flush(cql, table) I chose a functional API with parameters ("cql") instead of a fixture with an implied connection so that in the future we may allow multiple multiple nodes and this API will allow sending nodetool requests to different nodes. However, multi-node support is not implemented yet, nor used in any of the existing tests. The implementation uses Scylla's REST API if available, or if not, falls back to using an external "nodetool" command (which can be overridden using the NODETOOL environment variable). This way, both cql-pytest/run (Scylla) and cql-pytest/run-cassandra (Cassandra) now correctly support these nodetool operations, and we still don't need to run JMX to test Scylla. The reason We want to support nodetool.flush() is to reproduce bugs that depend on data reaching disk. We already had such a reproducer in test_large_cells_rows.py - it too did something similar - but it was Scylla-only (using only the REST API). Instead of copying such code to multiple places, we better have a common nodetool.flush() function, as done in this patch. The test in test_large_cells_rows.py can later be changed to use the new function. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-19 13:55:25 +03:00
Avi Kivity	794d272e35	Merge "Refine allocation strategy" from Pavel E " This set does two things: - hides migrate-fn machinery in allocation_strategy header - conceptualizes dynamic objects The former is possible after IMR rework -- nowadays users of LSA don't need to do anything special with "migrators" so they can be turned to be internal allocation-strategy helpers. The latter is to make sure dynamic objects do not forget to overload the size_for_allocation_strategy(). If this happens the whole thing compiles fine and sometimes works, but generates memory corruptions, so it's worth adding more confidence here. tests: unit(dev) " * 'br-lsa-hide-migrators' of https://github.com/xemul/scylla: bptree: Require dynamic object for nodes reconstruct allocation_strategy, code: Conceptualize dynamic objects allocation_strategy: Hide migrators allocation_strategy, code: Simplify alloc() allocation_strategy: Mark size_for_allocation_strategy noexcept	2021-05-19 10:14:51 +03:00
Pavel Emelyanov	0c4ba56594	bptree: Require dynamic object for nodes reconstruct The B+ tree is not intrusive and supports both kinds of objects -- dynamic (in sense of previous patch) and fixed-size. Respectively, the nodes provide .storage_size() method and get the embedded object storage size themselves. Thus, if a dynamic object is used with the tree but it misses the .storage_size() itself this would come unnoticed. Fortunately, dynamic objects use the .reconstruct() method, so the method should be equipeed with the DybnamicObject concept. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-19 09:23:49 +03:00
Pavel Emelyanov	9216a5bc08	allocation_strategy, code: Conceptualize dynamic objects Usually lsa allocation is performed with the construct() helper that allocates a sizeof(T) slot and constructs it in place. Some rare objects have dynamic size, so they are created by alloc()ating a slot of some specific size and (!) must provide the correct overload of size_for_allocation_strategy that reports back the relevant storage size. This "must provide" is not enforced, if missed a default sizer would be instantiated, but won't work properly. This patch makes all users of alloc() conform to DynamicObject concept which requires the presense of .storage_size() method to tell that size. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-19 09:23:49 +03:00
Pavel Emelyanov	b8a4f32b48	allocation_strategy: Hide migrators After IMR rework the only lsa-migrating functionality is standard one that calls move constructors on lsa slots. Hide the whole thing inside allocation-strategy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-19 09:23:49 +03:00
Pavel Emelyanov	28f01aadc9	allocation_strategy, code: Simplify alloc() Todays alloc() accepts migrate-fn, size and alignment. All the callers don't really need to provide anything special for the migrate-fn and are just happy with default alignof() for alignment. The simplification is in providing alloc() that only accepts size arg and does the rest itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-19 09:23:49 +03:00
Pavel Emelyanov	fdfcda97d7	allocation_strategy: Mark size_for_allocation_strategy noexcept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-19 09:23:49 +03:00
Botond Dénes	dbb6851d4d	test/manual/sstable_scan_footprint: don't double close the semaphore The semaphore `stats_collector` references is the one obtained from the database object, which is already stopped by `database::stop()`, making the stop in `~stats_collector()` redundant, and even worse, as it triggers an assert failure. Remove it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>	2021-05-18 17:55:52 +03:00
Avi Kivity	16ff92745f	Merge 'perf: add alternator frontend to perf_simple_query' from Piotr Sarna The perf_simple_query tool is extended with another protocol aside from CQL - alternator. The alternative (pun intended) benchmark can be executed by using the `--alternator X` parameter, where X specifies one of the alternator's mandatory write isolation options: - "forbid_rmw" - forbids RMW (read-modify-write) requests - "unsafe" - never uses LWT (lightweight transactions), even for RMW - "always_use_lwt" - uses LWT even for non-RMW requests - "only_rmw_uses_lwt" - that one's rather self-explanatory Alternator cooperates with existing `--write` and `--delete` parameters. Aside from being able to check for improvements/regressions in the alternator module, it's also possible to check how different isolation levels influence the number of allocations and overall performance, or to compare alternator against CQL. Example output showing the difference in isolation levels: ```bash $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --write --alternator only_rmw_uses_lwt --default-log-level error random-seed=1235000092 Started alternator executor 10873.76 tps (202.9 allocs/op, 12.4 tasks/op, 369921 insns/op) 11096.09 tps (202.7 allocs/op, 12.1 tasks/op, 374792 insns/op) 11100.09 tps (203.0 allocs/op, 12.1 tasks/op, 376469 insns/op) 11068.98 tps (203.1 allocs/op, 12.1 tasks/op, 377132 insns/op) 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median absolute deviation: 14.85 maximum: 11100.09 minimum: 10873.76 $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --random-seed 1235000092 --write --alternator always_use_lwt \ --default-log-level error random-seed=1235000092 Started alternator executor 3605.35 tps (877.4 allocs/op, 174.6 tasks/op, 986666 insns/op) 3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op) 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) 3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op) 3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op) median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) median absolute deviation: 75.15 maximum: 3605.35 minimum: 3409.88 ``` Closes #8656 * github.com:scylladb/scylla: perf: add alternator frontend to perf_simple_query cdc: make metadata.hh self-sufficient test: add minimal alternator_test_env	2021-05-18 16:17:54 +03:00
Piotr Sarna	6c6ccda8a0	perf: add alternator frontend to perf_simple_query The perf_simple_query tool is extended with another protocol aside from CQL - alternator. The alternative (pun intended) benchmark can be executed by using the `--alternator X` parameter, where X specifies one of the alternator's mandatory write isolation options: - "forbid_rmw" - forbids RMW (read-modify-write) requests - "unsafe" - never uses LWT (lightweight transactions), even for RMW - "always_use_lwt" - uses LWT even for non-RMW requests - "only_rmw_uses_lwt" - that one's rather self-explanatory Alternator cooperates with existing --write and --delete parameters. Aside from being able to check for improvements/regressions in the alternator module, it's also possible to check how different isolation levels influence the number of allocations and overall performance, or to compare alternator against CQL. $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --write --alternator only_rmw_uses_lwt --default-log-level error random-seed=1235000092 Started alternator executor 10873.76 tps (202.9 allocs/op, 12.4 tasks/op, 369921 insns/op) 11096.09 tps (202.7 allocs/op, 12.1 tasks/op, 374792 insns/op) 11100.09 tps (203.0 allocs/op, 12.1 tasks/op, 376469 insns/op) 11068.98 tps (203.1 allocs/op, 12.1 tasks/op, 377132 insns/op) 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median 11081.24 tps (203.2 allocs/op, 12.1 tasks/op, 377290 insns/op) median absolute deviation: 14.85 maximum: 11100.09 minimum: 10873.76 $ ./build/release/test/perf/perf_simple_query_g --smp 1 \ --random-seed 1235000092 --write --alternator always_use_lwt \ --default-log-level error random-seed=1235000092 Started alternator executor 3605.35 tps (877.4 allocs/op, 174.6 tasks/op, 986666 insns/op) 3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op) 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) 3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op) 3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op) median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op) median absolute deviation: 75.15 maximum: 3605.35 minimum: 3409.88	2021-05-18 15:10:31 +02:00
Piotr Sarna	6e28c01c53	cdc: make metadata.hh self-sufficient The header relies on topology_description class definition, which is part of cdc/generation.hh.	2021-05-18 15:10:31 +02:00
Piotr Sarna	b6d6247a74	test: add minimal alternator_test_env A minimal implementation of alternator test env, a younger cousin of cql_test_env, is implemented. Note that using this environment for unit tests is strongly discouraged in favor of the official test/alternator pytest suite. Still, alternator_test_env has its uses for microbenchmarks.	2021-05-18 15:10:31 +02:00
Takuya ASADA	a3b25e3d29	unified/uninstall.sh: simplify uninstall.sh, delete all files correctly Current uninstall.sh is trying to do similar logic with install.sh, but it makes script larger meaninglessly, and also it failing to remove few files under /opt/scylladb. Let's just do rm -rf /opt/scylladb, and drop few other files located out side of /opt/scylladb. Closes #8662	2021-05-18 14:55:18 +02:00
Asias He	0858619cba	storage_service: Abort restore_replica_count when node is removed from the cluster Consider the following procedure: - n1, n2, n3 - n3 is down - n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the cluster - n1 is down in the middle of removenode operation Node n1 will set n3 to removing gossip status during removenode operation. Whenever existing nodes learn a node is in removing gossip status, they will call restore_replica_count to stream data from other nodes for the ranges n3 loses if n3 was removed from the cluster. If the streaming fails, the streaming will sleep and retry. The current max number of retry attempts is 5. The sleep interval starts at 60 seconds and increases 1.5 times per sleep. This can leave the cluster in a bad state. For example, nodes can go out of disk space if the streaming continues. We need a way to abort such streaming attempts. To abort the removenode operation and forcely remove the node, users can run `nodetool removenode force` on any existing nodes to move the node from removing gossip status to removed gossip status. However, the restore_replica_count will not be aborted. In this patch, a status checker is added in restore_replica_count, so that once a node is in removed gossip status, restore_replica_count will be aborted. This patch is for older releases without the new NODE_OPS_CMD infrastructure where such abort will happen automatically in case of error. Fixes #8651 Closes #8655	2021-05-18 14:55:18 +02:00
Botond Dénes	82bff1bcc6	test: cql_test_env: use proper scheduling groups Currently `cql_test_env` runs its `func` in the default (main) group and also leaves all scheduling groups in `dbcfg` default initialized to the same scheduling group. This results in every part of the system, normally isolated from each other, running in the same (default) scheduling group. Not a big problem on its own, as we are talking about tests, but this creates an artificial difference between the test and the real environment, which is ever more pronounced since certain query parameters are selected based on the current scheduling group. To bring cql test env just that little bit closer to the real thing, this patch creates all the scheduling groups main does (well almost) and configures `dbcfg` with them. Creating and destroying the scheduling group on each setup-teardown of cql test env breaks some internal seastar components which don't like seeing the same scheduling group with the same name but different id. So create the scheduling groups once on first access and keep them around until the test executable is running. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>	2021-05-18 13:44:54 +03:00
Botond Dénes	300ee974f7	test: use with_cql_test_env_thread where needed Currently `with_cql_test_env()` is equivalent to `with_cql_test_env_thread()`, which resulted in many tests using the former while really needing the latter and getting away with it. This equivalence is incidental and will go away soon, so make sure all cql test env using tests that expect to be run in a thread use the appropriate variant. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>	2021-05-18 13:44:52 +03:00
Avi Kivity	6db826475d	Merge "Introduce segregate scrub mode" from Botond " The current scrub compaction has a serious drawback, while it is very effective at removing any corruptions it recognizes, it is very heavy-handed in its way of repairing such corruptions: it simply drops all data that is suspected to be corrupt. While this is the safest way to cleanse data, it might not be the best way from the point of view of a user who doesn't want to loose data, even at the risk of retaining some business-logic level corruption. Mind you, no database-level scrub can ever fully repair data from the business-logic point of view, they can only do so on the database-level. So in certain cases it might be desirable to have a less heavy-handed approach of cleansing the data, that tries as hard as it can to not loose any data. This series introduces a new scrub mode, with the goal of addressing this use-case: when the user doesn't want to loose any data. The new mode is called "segregate" and it works by segregating its input into multiple outputs such that each output contains a valid stream. This approach can fix any out-of-order data, be that on the partition or fragment level. Out-of-order partitions are simply written into a separate output. Out of order fragments are handled by injecting a partition-end/partition-start pair right before them, so that they are now in a separate (duplicate) partition, that will just be written into a separate output, just like a regular out-of-order partition. The reason this series is posted as an RFC is that although I consider the code stable and tested, there are some questions related to the UX. * First and foremost every scrub that does more than just discard data that is suspected to be corrupt (but even these a certain degree) have to consider the possibility that they are rehabilitating corruptions, leaving them in the system without a warning, in the sense that the user won't see any more problems due to low-level corruptions and hence might think everything is alright, while data is still corrupt from the business logic point of view. It is very hard to draw a line between what should and shouldn't scrub do, yet there is a demand from users for scrub that can restore data without loosing any of it. Note that anybody executing such a scrub is already in a bad shape, even if they can read their data (they often can't) it is already corrupt, scrub is not making anything worse here. * This series converts the previous `skip_corrupted` boolean into an enum, which now selects the scrub mode. This means that `skip_corrupted` cannot be combined with segregate to throw out what the former can't fix. This was chosen for simplicity, a bunch of flags, all interacting with each other is very hard to see through in my opinion, a linear mode selector is much more so. * The new segregate mode goes all-in, by trying to fix even fragment-level disorder. Maybe it should only do it on the partition level, or maybe this should be made configurable, allowing the user to select what to happen with those data that cannot be fixed. Tests: unit(dev), unit(sstable_datafile_test:debug) " * 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla: test: boost/sstable_datafile_test: add tests for segregate mode scrub api: storage_service/keyspace_scrub: expose new segregate mode sstables: compaction/scrub: add segregate mode mutation_fragment_stream_validator: add reset methods mutation_writer: add segregate_by_partition api: /storage_service/keyspace_scrub: add scrub mode param sstables: compaction/scrub: replace skip_corrupted with mode enum sstables: compaction/scrub: prevent infinite loop when last partition end is missing tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests	2021-05-18 13:43:01 +03:00
Botond Dénes	5eb4517f56	read_context: move_to_next_partition(): make reader creation atomic Otherwise an interleaving cache update can clear the `_prev_snapshot` before the reader is created, leading to the reader being created via a null mutation source. Tests: unit(dev, release, debug:row_cache_test) Fixes #8671. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210518092317.227433-1-bdenes@scylladb.com>	2021-05-18 13:41:48 +03:00
Piotr Sarna	c8653d1321	cql3: enhance the fix for index paging type check The original fix stripped the reversed type only from the base table column, but it's better to be safe than sorry, so the reverse is also stripped from the view column. Refs #8667 Message-Id: <cb5dedb0b8b6b5eea3a69863ae50a0e906482665.1621330463.git.sarna@scylladb.com>	2021-05-18 12:47:35 +03:00
Takuya ASADA	60c0b37a4c	install.sh: apply correct file security context when copying files Currently, unified installer does not apply correct file security context while copying files, it causes permission error on scylla-server.service. We should apply default file security context while copying files, using '-Z' option on /usr/bin/install. Also, because install -Z requires normalized path to apply correct security context, use 'realpath -m <PATH>' on path variables on the script. Fixes #8589 Closes #8602	2021-05-18 12:09:51 +03:00
Takuya ASADA	6faa8b97ec	install.sh: fix not such file or directory on nonroot Since we have added scylla-node-exporter, we needed to do 'install -d' for systemd directory and sysconfig directory before copying files. Fixes #8663 Closes #8664	2021-05-18 12:03:45 +03:00
Avi Kivity	593ad4de1e	Merge 'Fix type checking in index paging' from Piotr Sarna When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666 Closes #8667 * github.com:scylladb/scylla: test: add a test case for paging with desc clustering order cql3: relax a type check for index paging	2021-05-18 11:34:59 +03:00
Kamil Braun	03ad111beb	tree-wide: comments on deprecated functions to access global variables Closes #8665	2021-05-18 11:31:10 +03:00
Botond Dénes	ae366868fb	multishard_mutation_query: save_reader(): avoid round-trip for destroying rparts Force its destruction when saving the reader. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514140844.119362-1-bdenes@scylladb.com>	2021-05-18 10:07:13 +03:00
Botond Dénes	c98b0d0de8	test: cql_test_env: add trace logs to execute_cql() In tests executing tons of these, it is useful to be able to enable a trace logging of each one, to see which is the last successful one. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514140531.118390-1-bdenes@scylladb.com>	2021-05-18 10:06:22 +03:00
Michał Chojnowski	49793a4919	cql3: change the internal type of tuples::in_value from std::vector to chunked_vector While having a large list in the IN clause is unlikely, it's still an arbitrarily large piece of user-provided data. On principle, let's use a chunked container here to prevent large contiguous allocations.	2021-05-17 17:12:07 +02:00
Michał Chojnowski	dcbc053ecd	cql3: change the internal type of lists::value from std::vector to chunked_vector Lists can grow very big. Let's use a chunked vector to prevent large contiguous allocations.	2021-05-17 17:09:55 +02:00
Piotr Sarna	c36f432423	test: add a test case for paging with desc clustering order Issue #8666 revealed an issue with validating types for paged indexed queries - namely, the type checking mechanism is too strict in comparing types and fails on mismatched clustering order - e.g. an `int` column type is different from `int` with DESC clustering order. As a result, users see a very confusing message (because reversed types are printed as their underlying type): > Mismatched types for base and view columns c: int and int This test case fails before the fix for #8666 and thus acts as a regression test.	2021-05-17 17:06:50 +02:00
Piotr Sarna	544ef2caf3	cql3: relax a type check for index paging When recreating the paging state from an indexed query, a bunch of panic checks were introduced to make sure that the code is correct. However, one of the checks is too eager - namely, it throws an error if the base column type is not equal to the view column type. It usually works correctly, unless the base column type is a clustering key with DESC clustering order, in which case the type is actually "reversed". From the point of view of the paging state generation it's not important, because both types deserialize in the same way, so the check should be less strict and allow the base type to be reversed. Tests: unit(release), along with the additional test case introduced in this series; the test also passes on Cassandra Fixes #8666	2021-05-17 17:06:50 +02:00
Michał Chojnowski	4baeea0199	cql3: in multi_item_terminal, return the vector of items by value Returning by reference requires that the elements are internally stored in in the multi_item_terminal as a std::vector, but in the next patch we will change the internal type of lists::value from std::vector to utils::chunked_vector. The copy is not a problem because all users of multi_item_terminal were copying the returned vector.	2021-05-17 16:46:28 +02:00
Botond Dénes	dca808dd51	perf/perf_simple_query: add --enable-cache option Allowing for testing performance with/out cache. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210517045402.16153-1-bdenes@scylladb.com>	2021-05-17 14:06:18 +02:00
Raphael S. Carvalho	10ae77966c	compaction_manager: Don't swallow exception in procedure used by reshape and resharding run_custom_job() was swallowing all exceptions, which is definitely wrong because failure in a resharding or reshape would be incorrectly interpreted as success, which means upper layer will continue as if everything is ok. For example, ignoring a failure in resharding could result in a shared sstable being left unresharded, so when that sstable reaches a table, scylla would abort as shared ssts are no longer accepted in the main sstable set. Let's allow the exception to be propagated, so failure will be communicated, and resharding and reshape will be all or nothing, as originally intended. Fixes #8657. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>	2021-05-17 13:57:05 +02:00
Pavel Solodovnikov	f38c5b5359	raft: replication_test: extract tickers initialization into a separate func Extract raft tickers initialization into `init_raft_tickers()` functon. This will be later used in rpc tests. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Pavel Solodovnikov	e0f8ded9bf	raft: replication_test: support passing custom `apply_fn` to `change_configuration()` This will be used later in rpc tests to support passing a dummy apply function, which does not need to update state machine at all. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Pavel Solodovnikov	3d669df2cb	raft: replication_test: introduce `test_server` aggregate struct Use a somewhat neater structure to represent `create_raft_server()` return value instead of cumbersome `std::pair<std::unique_ptr<raft::server>, state_machine*>`. Not only `test_server` is much shorter, but it is also much more descriptive. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Pavel Solodovnikov	a29db1deda	raft: replication_test: support voter<->learner configuration changes Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Pavel Solodovnikov	def97cd730	raft: remove duplicate `create_command` function from `replication_test` Include the version from `helpers.hh`. This also makes possible to use additional utilities from this header file, like `id()` and `address_set()`, which comes handy in simple tests and will be used in rpc testing. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Pavel Solodovnikov	0389001496	raft: avoid 'using' statements in raft testing helpers header It is generally considered a bad practice to use the `using` directives at global scope in header files. Also, many parts of `test/raft/helpers.hh` were already using `raft::` prefixes explicitly, so definitely not much to lose there. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Avi Kivity	8d6e575f59	perf_fast_forward: report instructions per fragment Use a hardware counter to report instructions per fragment. Results vary from ~4k insns/f when reading sequentially to more than 1M insns/f. Instructions per fragment can be a more stable metric than frags/sec. It would probably be even more stable with a fake file implementation that works in-memory to eliminate seastar polling instruction variation. Closes #8660	2021-05-17 11:33:24 +02:00
Tomasz Grabiec	8dddfab5db	Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work. This PR was created by @jul-stas and @StarostaGit Fixes #8343 This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader) Closes #8634 * github.com:scylladb/scylla: boost/tests: Add virtual_table_test for basic infrastructure boost/tests: Test memtable_filling_virtual_table as mutation_source db/system_keyspace: Add system.status virtual table db/virtual_table: Add a way to specify a range of partitions for virtual table queries. db/virtual_table: Introduce memtable_filling_virtual_table db: Add virtual tables interface db: Introduce chained_delegating_reader	2021-05-17 11:29:37 +02:00
Piotr Sarna	1a625806d8	cql-pytest: introduce service level test suite The test suite leverages the fact that authentication is now enabled in cql-pytest to perform validations on service level statements.	2021-05-17 10:49:45 +02:00
Piotr Sarna	588a0dfd38	cql-pytest: add enabling authentication by default Following alternator unit tests, cql-pytest now also boots Scylla/Cassandra with authentication enabled. Unconditionally enabling authentication does not ruin any existing test case, while it enables testing more scenarios. For instance, Scylla-specific service levels can only be created and attached to roles, which depends on authentication being enabled. A sad side-effect is that Scylla boots slower with PasswordAuthenticator than without it - it takes 15 seconds to set up the default superuser account due to a hardcoded sleep duration [1] :( That should be solved by a separate fix though. [1]: auth/common.hh: inline future<> delay_until_system_ready(seastar::abort_source& as) { return sleep_abortable(15s, as); }	2021-05-17 10:49:45 +02:00
Piotr Sarna	7d10213567	qos: fix validating service level timeouts for negative values Commit message of `6e8305449` claimed to validate against negative timeout values, while it turned out not to be the case. The check is now added.	2021-05-17 10:49:45 +02:00
Botond Dénes	5e39cedbe3	evictable_reader: remove _reader_created flag This flag is not really needed, because we can just attempt a resume on first use which will fail with the default constructed inactive read handle and the reader will be created via the recreate-after-evicted path. This allows the same path to be used for all reader creation cases, simplifying the logic and more importantly making further patching easier without the special case. To make the recreate path (almost) as cheap for the first reader creation as it was with the special path, `_trim_range_tombstones` and `_validate_partition_key` is only set when really needed. Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>	2021-05-16 14:45:46 +03:00
Botond Dénes	3b57106627	evictable_reader: remove destructor We now have close() which is expected to clean up, no need for cleanup in the destructor and consequently a destructor at all. Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>	2021-05-16 12:19:41 +03:00
Benny Halevy	f4cfa530cc	perf: enable instructions_retired_counter only once per executor::run Enabling it for each run_worker call will invoke ioctl PERF_EVENT_IOC_ENABLE in parallel to other workers running and this may skew the results. Test: perf_simple_query Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>	2021-05-16 12:13:27 +03:00
Pavel Emelyanov	0068988e81	repair: Drop most of globals from repair No code left that uses these globals, so rip them altogether. Also drop the former messaging init/uninit methods that are now only setting up those globals. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	315698c683	repair: Use local references in messaging handler checks Some time ago checks for sys-dist-ks and view-update-generator to be locally initalized were moved inside the repair service message handlers. Now everything is ready to use service's reference instead of global pointers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	e748e16352	repair: Use local references in create_writer() The repair_writer::create_writer() method needs sys-dist-ks and view-update-generator. It's only called from repair_meta which already has both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	394acdc139	repair: Construct repair_meta with local references The repair_meta needs sys-dist-ks and view-update-generator. Now when it's created both are available. Once from the repair-service and another time from the repair_info. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	548c694e8c	repair: Keep more stuff on repair_info The repair-meta is once created from the repair_info. It will need the sys-dist-ks and view-update-generator. Put them into the info to have them at meta creation. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	2012ea26cd	repair: Kill bunch of global usages from insert_repair_meta The insert_repair_meta needs to peek global proxy to get db from, migration_manager to call get_schema_for_write(), global messaging to pass it as argument to the mentioned call and to construct the repair_meta. All three can be obtained from repair-service which's now passed there as argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	abc40cccfd	repair: Pass repair service down to meta insertion The repair_meta will need to get local references to sys_dist_ks and view_update_generator. One of the places where it's created is insert_repair_meta that's called (almost) directly from the repair messaging handler which already has the repair service. One thing to take care of is that the handler reshards on entry, so do the container().invoke_on() and get the local repair from the lambda's argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	c47e1f9776	repair: Keep local migration manager on repair_info The repair_range routine needs to mess with migration manager. Fortunatelly the routine had been patched to be repair_info's method and the repair_info itself can get the migration manager from repair_service. ATTN -- the obtained reference is local, not sharded<>, but the repair_info doesn't reshard and can carry local reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	60cbb700ef	repair: Move unused db captures Similarly to previous patch -- the db captures can also be relaxed in some places. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	96b546797e	repair: Remove unused ms captures Now when the repair service is passed around repair-info creation a lot of local messaging captures become unused and can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	c63990e4a7	repair: Construct repair_info with service The repair_info object will need to carry more services references on board. This now can be easily achieved by passing the repair service into the repair_info constructor. The info can then get all it needs from the service. This patch is the step #1 here -- replace db and messaging args with repair service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	9bc122c99f	repair: Loop over repair sharded container Previous patches made sync_data_using_repair and do_repair_start methods of repair service. This was done to have local repair reference near the creation of repair_info. The last step left is to make the local repair service available inside the .invoke_on lambda. This patch makes this invoke_on on the repair service itself thus automagically getting the local repair service in lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	a49df10b42	repair: Make sync_data_using_repair a method The do_..._with_repair()-s all call sync_data_using_repair, the latter was previously prepared to receive local repair service reference via "this", so it's finally time to make it happen. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	5c020880f9	repair: Use repair from storage service This is the continuation of the previous patch -- the do_..._with_repair functions become repair_service methods and will get local repair service reference as "this". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	23e8e60ec0	repair: Keep repair on storage service Storage service calls a bunch of do_something_with_repair() methods. All of them need the local repair_service and the only way to get it is by keeping it on storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	4cbcc81167	repair: Make do_repair_start a method The do_repair_start() creates repair_info which will need the repair_service reference, so turn this function into a method to have repair_service as "this". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	4f9623fd87	repair: Pass repair_service through the API until do_repair_start The do_repair_start() will need the repair_service reference in the next patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	a2baabedad	repair: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	d47b7e5387	repair: Split sync_data_using_repair The routine in question creates repair_info inside. The repair_info will need to receive local repair_service reference somehow, this split prepares ground for this change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	d92d404629	repair: Turn repair_range a repair_info method This routine uses global migration_manager pointer. Next patches will keep the reference on a manager on repair_info and it will be possible to use this->migration_manager reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	5c24b0750e	repair: Move local_is_initialized checks up the stack The sys-dist-ks and view-update-generator checked are global pointers and are performed inside static repair_meta's method. At the same time the caller of this method is repair_service class which already has both on-board, so move the checks up. Later they will use the service-local references, not global pointers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	6a0d0bb093	repair: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	735a076a63	repair: Do init/uninit of messaging in start/stop Right now repair messaging handlers are set-up on all shards by doing messaging.invoke_on_all() calls. Since now repair service is sharded and its .start() and .stop() are invoke-on-all-ed, it's better to move messaging init/deinit into them. The indentation is deliberately left broken until next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	ea3b0877a4	repair: Add dependencies to repair service The repair service needs database, migration manager, messaging and sys-dist-ks + view-update-generator pair. Put all these guys on it in advance and equip the service with getters for future use. Some dependencies are sharded because they are used in cross-shard context and needs more care to be cleaned out later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	ebc6a81700	repair: Add repair_service::start method To be stuffed later. There's no deferred ::stop call because sharded<>::stop calls it by itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	bbb92882de	repair: Make repair_service sharded<> It will pop up on all shards, but the existing initialization will only happen on shard 0. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	715c4d5a47	repair: Remove unused service arg from messaging init Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	ad16d0a2b5	repair: Make repair_service::_tracker be unique-ptr Now the repair service exists in a single instance, but it's becoming a sharded<> service. Tracker expects to be constructed once, so make it a pointer and next patch eill instantiate it on shard 0 only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	aa2b4f7821	repair: Turn repair_service into class Now it's a struct with everything being public Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Tomasz Grabiec	28ac8d0f2b	Merge "raft: randomized_nemesis_test framework" from Kamil We introduce `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. Next comes `logical_timer`: it is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`. It will be used to introduce timeouts in the tests, among other things. To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `persistence`. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error. We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue`. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message. `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc`. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it. In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions. `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic. The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server. `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server. `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. Finally, we add a simple test that serves as an example of using the implemented framework. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types. * kbr/randomized-nemesis-test-v5: raft: randomized_nemesis_test: basic test raft: randomized_nemesis_test: ticker raft: randomized_nemesis_test: environment raft: randomized_nemesis_test: server raft: randomized_nemesis_test: delivery queue raft: randomized_nemesis_test: network raft: randomized_nemesis_test: heartbeat-based failure detector raft: randomized_nemesis_test: memory backed persistence raft: randomized_nemesis_test: rpc raft: randomized_nemesis_test: impure_state_machine raft: randomized_nemesis_test: introduce logical_timer raft: randomized_nemesis_test: `PureStateMachine` concept	2021-05-14 17:33:40 +02:00
Tomasz Grabiec	0fdd2f8217	Merge "raft: fsm cleanups" from Gleb * scylla-dev/raft-cleanup-v1: raft: drop _leader_progress tracking from the tracker raft: move current_leader into the follower state raft: add some precondition checks	2021-05-14 17:24:59 +02:00
Asias He	e4872a78b5	storage_service: Delay update pending ranges for replacing node In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem for older releases without the patch "repair: Switch to use NODE_OPS_CMD for replace operation", a minimum fix is implemented in this patch. Once existing nodes learn the replacing node is in HIBERNATE state, they add the replacing as replacing, but only add the replacing to the pending list only after the replacing node is marked as alive. With this patch, when the existing nodes start to write to the replacing node, the replacing node is already alive. Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test Fixes: #8013 Closes #8614	2021-05-14 17:24:28 +02:00
Tomasz Grabiec	102dcfc1fd	Merge "scylla-gdb.py: introduce scylla read-stats" from Botond Too many or too resource-hungry reads often lie at the heart of issues that require an investigation with gdb. Therefore it is very useful to have a way to summarize all reads found on a shard with their states and resource consumptions. This is exactly what this new command does. For this it uses the reader concurrency semaphores and their permits respectively, which are now arranged in an intrusive list and therefore are enumerable. Example output: (gdb) scylla read-stats Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1 permits count memory table/description/state 1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active 16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active 1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive 1 0 0 ./view_builder/active 1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active 20 1 14334414 Total * botond/scylla-gdb.py-scylla-reads/v5: scylla-gdb.py: introduce scylla read-stats scylla-gdb.py: add pretty printer for std::string_view scylla-gdb.py: std_map() add __len__() scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()	2021-05-14 16:07:14 +02:00
Takuya ASADA	838acb44d0	scylla-fstrim.timer: fix wrong description from 'daily' to 'weekly' It scheduled weekly, not daily. Fixes #8633 Closes #8644	2021-05-14 16:02:12 +02:00
Asias He	b8749f51cb	repair: Consider memory bloat when calculate repair parallelism The repair parallelism is calculated by the number of memory allocated to repair and memory usage per repair instance. Currently, it does not consider memory bloat issues (e.g., issue #8640) which cause repair to use more memory and cause std::bad_alloc. Be more conservative when calculating the parallelism to avoid repair using too much memory. Fixes #8641 Closes #8652	2021-05-14 16:02:08 +02:00
Piotr Sarna	c1cb7d87e1	auth: remove the fixed 15s delay during auth setup The auth intialization path contains a fixed 15s delay, which used to work around a couple of issues (#3320, #3850), but is right now quite useless, because a retry mechanism is already in place anyway. This patch speeds up the boot process if authentication is enabled. In particular, for a single-node clusters, common for test setups, auth initialization now takes a couple of milliseconds instead of the whole 15 seconds. Fixes #8648 Closes #8649	2021-05-14 16:01:59 +02:00
Kamil Braun	c21311ecca	raft: randomized_nemesis_test: basic test This is a simple test that serves as an example of using the framework implemented in the previous commits. We introduce `ExRegister`, an implementation of `PureStateMachine` that stores an `int32_t` and handles ``exchange'' and ``read'' inputs; an exchange replaces the state with the given value and returns the previous state, a read does not modify the state and returns the current state. In order to pass the inputs to Raft we must serialize them into commands so we implement instances of `ser::serializer` for `ExReg`'s input types.	2021-05-14 15:11:01 +02:00
Kamil Braun	66b9bc6fe1	raft: randomized_nemesis_test: ticker `ticker` calls the given function as fast as the Seastar reactor allows and yields between each call. It may be provided a limit for the number of calls; it crashes the test if the limit is reached before the ticker is `abort()`ed. The commit also introduces a `with_env_and_ticker` helper function which creates an `environment`, a `ticker`, and passes references to them to the given function. It destroys them after the function finishes by calling `abort()`.	2021-05-14 15:11:01 +02:00
Kamil Braun	c7cef58797	raft: randomized_nemesis_test: environment `environment` represents a set of `raft_server`s connected by a `network`. The `network` inside is initialized with a message delivery function which notifies the destination server's failure detector on each message and if the message contains an RPC payload, pushes it into the destination's `delivery_queue`. Needs to be periodically `tick()`ed which ticks the network and underlying servers. New servers can be created in the environment by calling `new_server`.	2021-05-14 15:11:01 +02:00
Kamil Braun	5095a4158e	raft: randomized_nemesis_test: server `raft_server` is a package that contains `raft::server` and other facilities needed for the server to communicate with its environment: the delivery queue, the set of snapshots (shared by `impure_state_machine`, `rpc` and `persistence`) and references to the `impure_state_machine` and `rpc` instances of this server.	2021-05-14 15:11:01 +02:00
Kamil Braun	f139fd4c28	raft: randomized_nemesis_test: delivery queue The fact that `network` has delivered a message does not mean the message was processed by the receiver. In fact, `network` assumes that delivery is instantaneous, while processing a message may be a long, complex computation, or even require IO. Thus, after a message is delivered, something else must ensure that it is processed by the destination server. That something in our framework is `delivery_queue`. It will be the bridge between `network` and `rpc`. While `network` is shared by all servers - it represents the ``environment'' in which the servers live - each server has its own private `delivery_queue`. When `network` delivers an RPC message it will end up inside `delivery_queue`. A separate fiber, `delivery_queue::receive_fiber()`, will process those messages by calling `rpc::receive` (which is a potentially long operation, thus returns a `future<>`) on the `rpc` of the destination server.	2021-05-14 15:11:01 +02:00
Kamil Braun	2956f5f76c	raft: randomized_nemesis_test: network `network` is a simple priority queue of "events", where an event is a message associated with delivery time. Each message contains a source, a destination, and payload. The queue uses a logical clock to decide when to deliver messages; it delivers are messages whose associated times are smaller than the current time. The exact delivery method is unknown to `network` but passed as a `deliver_t` function in the constructor. The type of payload is generic.	2021-05-14 15:11:01 +02:00
Kamil Braun	3068a0aa70	raft: randomized_nemesis_test: heartbeat-based failure detector In order to simulate a production environment as closely as possible, we implement a failure detector which uses heartbeats for deciding whether to convict a server as failed. We convict a server if we don't receive a heartbeat for a long enough time. Similarly to `rpc`, `failure_detector` assumes a message passing method given by a `send_heartbeat_t` function through the constructor. `failure_detector` uses the knowledge about existing servers to decide who to send heartbeats to. Updating this knowledge happens through `add_server` and `remove_server` functions.	2021-05-14 15:11:01 +02:00
Kamil Braun	51df600478	raft: randomized_nemesis_test: memory backed persistence `persistence` represents the data that does not get lost between server crashes and restarts. We store a log of commands in `_stored_entries`. It is invariably ``contiguous'', meaning that the index of each entry except the first is equal to the index of the previous entry plus one at all times (i.e. after each yield). We assume that the caller provides log entries in strictly increasing index order and without gaps. Additionally to storing log entries, `persistence` can be asked to store or load a snapshot. To implement this it takes a reference to a set of snapshots (`snapshots_t&`) which it will share with `impure_state_machine` and an implementation of `rpc` coming in a later commit. We ensure that the stored log either ``touches'' the stored snapshot on the right side or intersects it.	2021-05-14 15:11:01 +02:00
Kamil Braun	7a1f6e6d7b	raft: randomized_nemesis_test: rpc We implement the `raft::rpc` interface, allowing Raft servers to communicate with other Raft servers. The implementation is mostly boilerplate. It assumes that there exists a method of message passing, given by a `send_message_t` function passed in the constructor. It also handles the receival of messages in the `receive` function. It defines the message type (`message_t`) that will be used by the message-passing method. The actual message passing is implemented with `network` and `delivery_queue` which are introduced in later commits. The only slightly complex thing in `rpc` is the implementation of `send_snapshot` which is the only function in the `raft::rpc` interface that actually expects a response. To implement this, before sending the snapshot message we allocate a promise-future pair and assign to it a unique ID; we store the promise and the ID in a data structure. We then send the snapshot together with the ID and wait on the future. The message receival function on the other side, when it receives the snapshot message, applies the snapshot and sends back a snapshot reply message that contains the same ID. When we receive a snapshot reply message we look up the ID in the data structure and if we find a promise, we push the reply through that promise. `rpc` also keeps a reference to `snapshots_t` - it will refer to the same set of snapshots as the `impure_state_machine` on the same server. It accesses the set when it receives or sends a snapshot message.	2021-05-14 15:11:01 +02:00
Kamil Braun	905126acc3	raft: randomized_nemesis_test: impure_state_machine To replicate a state machine, our Raft implementation requires it to be represented with the `raft::state_machine` interface. `impure_state_machine` is an implementation of `raft::state_machine` that wraps a `PureStateMachine`. It keeps a variable of type `state_t` representing the current state. In `apply` it deserializes the given command into `input_t`, uses the transition (`delta`) function to produce the next state and output, replaces its current state with the obtained state and returns the output (more on that below); it does so sequentially for every given command. We can think of `PureStateMachine` as the actual state machine - the business logic, and `impure_state_machine` as the ``boilerplate'' that allows the pure machine to be replicated by Raft and communicate with the external world. The interface also requires maintainance of snapshots. We introduce the `snapshots_t` type representing a set of snapshots known by a state machine. `impure_state_machine` keeps a reference to `snapshots_t` because it will share it with an implementation of `raft::persistence` coming with a later commit. Returning outputs is a bit tricky because apply is ``write-only'' - it returns `future<>`. We use the following technique: 1. Before sending a command to a Raft leader through `server::add_entry`, one must first directly contact the instance of `impure_state_machine` replicated by the leader, asking it to allocate an ``output channel''. 2. On such a request, `impure_state_machine` creates a channel (represented by a promise-future pair) and a unique ID; it stores the input side of the channel (the promise) with this ID internally and returns the ID and the output side of the channel (the future) to the requester. 3. After obtaining the ID, one serializes the ID together with the input and sends it as a command to Raft. Thus commands are (ID, machine input) pairs. 4. When `impure_state_machine` applies a command, it looks for a promise with the given ID. If it finds one, it sends the output through this channel. 5. The command sender waits for the output on the obtained future. The allocation and deallocation of channels is done using the `impure_state_machine::with_output_channel` function. The `call` function is an implementation of the above technique. Note that only the leader will attempt to send the output - other replicas won't find the ID in their internal data structure. The set of IDs and channels is not a part of the replicated state. A failure may cause the output to never arrive (or even the command to never be applied) so `call` waits for a limited time. It may also mistakenly `call` a server which is not currently the leader, but it is prepared to handle this error.	2021-05-14 15:11:01 +02:00
Kamil Braun	3e02befccd	raft: randomized_nemesis_test: introduce logical_timer This is a wrapper around `raft::logical_clock` that allows scheduling events to happen after a certain number of logical clock ticks. For example, `logical_timer::sleep(20_t)` returns a future that resolves after 20 calls to `logical_timer::tick()`.	2021-05-13 11:34:00 +02:00
Kamil Braun	15e3bd2620	raft: randomized_nemesis_test: `PureStateMachine` concept The commit introduces `PureStateMachine`, which is the most direct translation of the mathematical definition of a state machine to C++ that I could come up with. Represented by a C++ concept, it consists of: a set of inputs (represented by the `input_t` type), outputs (`output_t` type), states (`state_t`), an initial state (`init`) and a transition function (`delta`) which given a state and an input returns a new state and an output. The rest of the testing infrastructure is going to be generic w.r.t. `PureStateMachine`. This will allow easily implementing tests using both simple and complex state machines by substituting the proper definition for this concept. One possibility of modifying this definition would be to have `delta` return `future<pair<state_t, output_t>>` instead of `pair<state_t, output_t>`. This would lose some ``purity'' but allow long computations without reactor stalls in the tests. Such modification, if we decide to do it, is trivial.	2021-05-13 11:34:00 +02:00
Alejo Sanchez	68f69671b5	raft: style: test optionals directly Avoid using has_value() and test optional directly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>	2021-05-12 20:39:52 +02:00
Piotr Wojtczak	e6254acfd3	boost/tests: Add virtual_table_test for basic infrastructure	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	8825ae128d	boost/tests: Test memtable_filling_virtual_table as mutation_source Uses the infrastructure for testing mutation_sources, but only a subset of it which does not do fast forwarding (since virtual_table does not support it).	2021-05-12 17:05:35 +02:00
Juliusz Stasiewicz	874f4de60c	db/system_keyspace: Add system.status virtual table This change uses the previously introduced memtable_filling_virtual_table to expose nodetool status as a virtual table.	2021-05-12 17:05:35 +02:00
Tomasz Grabiec	57ed93bf44	db/virtual_table: Add a way to specify a range of partitions for virtual table queries. This change introduces a query_restrictions object into the virtual table infrastructure, for now only holding a restriction on partition ranges. That partition range is then implemented into memtable_filling_virtual_table.	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	38720847f2	db/virtual_table: Introduce memtable_filling_virtual_table This change adds a more specific implementation of the virtual table called memtable_filling_virtual_table. It produces results by filling a memtable on each read.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	61a0314952	db: Add virtual tables interface This change introduces the basic interface we expect each virtual table to implement. More specific implementations will then expand upon it if needed.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	8333d66d4e	db: Introduce chained_delegating_reader This change adds a new type of mutation reader which purpose is to allow inserting operations before an invocation of the proper reader. It takes a future to wait on and only after it resolves will it forward the execution to the underlying flat_mutation_reader implementation.	2021-05-12 17:05:34 +02:00
Eliran Sinvani	5eb84f110e	gossiper: remove excess error logging from gossiper We remove a log of severity error that is later thrown as an exception, being catched few lines below and then printed out as a warning. Fixes #8616 Closes #8617	2021-05-12 15:02:35 +02:00
Tomasz Grabiec	f8d7374400	Merge 'Add additional sstable stats' from Michael Livshin Refs #251. Closes #8630 * github.com:scylladb/scylla: statistics: add global bloom filter memory gauge statistics: add some sstable management metrics sstables: make the `_open` field more useful sstables: stats: noexcept all accessors	2021-05-12 14:35:13 +02:00
Avi Kivity	c3f17ea0a3	Merge "Fix query performance for range tombstone covering many rows" from Tomasz " Row cache reader can produce overlapping range tombstones in the mutation fragment stream even if there is only a single range tombstone in sstables, due to #2581. For every range between two rows, the row cache reader queries for tombstones relevant for that range. The result of the query is trimmed to the current position of the reader (=position of the previous row) to satisfy key monotonicity. The end position of range tombstones is left unchanged. So cache reader will split a single range tombstone around rows. Those range tombstones are transient, they will be only materialized in the reader's stream, they are not persisted anywhere. That is not a problem in itself, but it interacts badly with mutation compactor due to #8625. The range_tombstone_accumulator which is used to compact the mutation fragment stream needs to accumulate all tombstones which are relevant for the current clustering position in the stream. Adding a new range tombstone is O(N) in the number of currently active tombstones. This means that producing N rows will be O(N^2). In a unit test introduced in this series, I saw reading 137'248 rows which overlap with a range tombstone take 245 seconds. Almost all of CPU time is in drop_unneeded_tombstones(). The solution is to make the cache reader trim range tombstone end to the currently emited sub-range, so that it emits non-overlapping range tombstones. Fixes #8626. Tests: - row_cache_test (release) - perf_row_cache_reads (release) " * tag 'fix-perf-many-rows-covered-by-range-tombstone-v2' of github.com:tgrabiec/scylla: tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone row_cache: Avoid generating overlapping range tombstones range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed	2021-05-12 14:07:48 +03:00
Tomasz Grabiec	a9dd7a295d	tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone Reproduces #8626. Output: test_scan_with_range_delete_over_rows Populating with rows Rows: 702710 Scanning... read: 540.007324 [ms], preemption: {count: 2356, 99%: 1.131752 [ms], max: 1.148589 [ms]}, cache: 251/252 [MB] read: 651.942688 [ms], preemption: {count: 1176, 99%: 1.131752 [ms], max: 1.009652 [ms]}, cache: 251/252 [MB]	2021-05-12 11:58:36 +02:00
Michael Livshin	357ab759ee	statistics: add global bloom filter memory gauge Refs #251. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-12 03:48:07 +03:00
Michael Livshin	5abeadde4d	statistics: add some sstable management metrics Add the following metrics, as part of #251: - open for writing (a.k.a. "created", unless I'm missing something?) - open for reading - deleted - currently open for reading/writing (gauges) Refs #251. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-12 03:48:07 +03:00
Michael Livshin	9a2b54fcf6	sstables: make the `_open` field more useful The field is hitherto only used in scylla-gdb.py. Let it store the open mode (if any). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-12 03:48:07 +03:00
Michael Livshin	1f83251b2b	sstables: stats: noexcept all accessors Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-12 03:48:07 +03:00
Benny Halevy	c0dafa75d9	utils: phased_barrier: advance_and_await: make noexcept As a function returning a future, simplify its interface by handling any exceptions and returning an exceptional future instead of propagating the exception. In this specific case, throwing from advance_and_await() will propagate through table::await_pending_* calls short-circuiting a .finally clause in table::stop(). Also, mark as noexcept methods of class table calling advance_and_await and table::await_pending_ops that depends on them. Fixes #8636 A followup patch will convert advance_and_await to a coroutine. This is done separately to facilitate backporting of this patch. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>	2021-05-12 01:36:11 +02:00
Benny Halevy	b4cbd46adb	row_cache: create_underlying_reader: call read_context on_underlying_created only on success ctx.on_underlying_created() mustn't be called if src.make_reader failed and a reader isn't created. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210511054525.35090-1-bhalevy@scylladb.com>	2021-05-12 01:34:48 +02:00
Tomasz Grabiec	6863a5e43b	row_cache: Avoid generating overlapping range tombstones Row cache reader can produce overlapping range tombstones in the mutation fragment stream even if there is only a single range tombstone in sstables, due to #2581. For every range between two rows, the row cache reader queries for tombstones relevant for that range. The result of the query is trimmed to the current position of the reader (=position of the previous row) to satisfy key monotonicity. The end position of range tombstones is left unchanged. So cache reader will split a single range tombstone around rows. Those range tombstones are transient, they will be only materialized in the reader's stream, they are not persisted anywhere. That is not a problem in itself, but it interacts badly with mutation compactor due to #8625. The range_tombstone_accumulator which is used to compact the mutation fragment stream needs to accumulate all tombstones which are relevant for the current clustering position in the stream. Adding a new range tombstone is O(N) in the number of currently active tombstones. This means that producing N rows will be O(N^2). In a unit test, I saw reading 137'248 rows which overlap with a range tombstone take 245 seconds. Almost all of CPU time is in drop_unneeded_tombstones(). The solution is to make the cache reader trim range tombstone end to the currently emited sub-range, so that it emits non-overlapping range tombstones. Fixes #8626.	2021-05-12 00:10:24 +02:00
Tomasz Grabiec	80cd829139	range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed Recalculation of the current tombstone is O(N) in the number of active range tombstones. This can be a significant overhead, so better avoid it. Solves the problem of quadratic complexity when producing lots of overlaping range tombstones with a common end bound. Refs #8625 Refs #8626	2021-05-12 00:10:24 +02:00
Nadav Har'El	cee4c075d2	Merge 'Fix index name conflicts with regular tables' from Piotr Sarna When an index is created without an explicit name, a default name is chosen. However, there was no check if a table with conflicting name already exists. The check is now in place and if any conflicts are found, a new index name is chosen instead. When an index is created with an explicit name and a conflicting regular table is found, index creation should simply fail. This series comes with a test. Fixes #8620 Tests: unit(release) Closes #8632 * github.com:scylladb/scylla: cql-pytest: add regression tests for index creation cql3: fail to create an index if there is a name conflict database: check for conflicting table names for indexes	2021-05-11 18:40:15 +03:00
Nadav Har'El	c7a814fd5c	utils/enum_option.hh: make it easier to compare the value The operator== of enum_option<> (which we use to hold multi-valued Scylla options) makes it easy to compare to another enum_option wrapper, but ugly to compare the actual value held. So this patch adds a nicer way to compare the value held. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210511120222.1167686-1-nyh@scylladb.com>	2021-05-11 18:39:10 +03:00
Benny Halevy	9ba960a388	utils: phased_barrier::operation do not leak gate entry when reassigned utils::phased_barrier holds a `lw_shared_ptr<gate>` that is typically `enter()`ed in `phased_barrier::start()`, and left when the operation is destroyed in `~operation`. Currently, the operation move-assign implementation is the default one that just moves the lw_shared gate ptr from the other operation into this one, without calling `_gate->leave()` first. This change first destroys *this when move-assigned (if not self) to call _gate->leave() if engaged, before reassigning the gate with the other operation::_gate. A unit test that reproduces the issue before this change and passes with the fix was added to serialized_action_test. Fixes #8613 Test: unit(dev), serialized_action_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	1d8234f52d	Merge "reader_concurrency_semaphore: improve diagnostics printout" from Botond " The current printout is has multiple problems: * It is segregated by state, each having its own sorting criteria; * Number of permits and count resources is collapsed in to a single column, not clear which is the one printed. * Number of available/initial units of the semaphore are not printed; This series solves all this problems: * It merges all states into a single table, sorted by memory consumption, in descending order. * It separates number of permits and count resources into separate columns. * Prints a summary of the semaphore units. * Provides a cap on the maximum amount of printable lines, to not blow up the logs. The goal of all this is to make it easy to find the culprit a semaphore problem: easily spot the big memory consumers, then unpack the name column to determine which table and code path is responsible. This brings the printout close to the recently `scylla reads` scylla-gdb.py command, providing a uniform report format across the two tools. Example report: INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics: permits count memory table/description/state 7 2 77M ks.tbl1/op1/active 6 3 59M ks.tbl1/op0/active 4 0 36M ks.tbl1/op2/active 3 1 36M ks.tbl0/op2/active 11 2 43M permits omitted for brevity 31 8 251M total " * 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla: test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources	2021-05-11 18:39:10 +03:00
Avi Kivity	eed89a9b56	Update tools/jmx submodule (toppartitions multi-sampler query) * tools/jmx 440313e...a7c4c39 (1): > storage_service: Fix getToppartitions to always return both reads and writes	2021-05-11 18:39:10 +03:00
Nadav Har'El	af485f5226	secondary index: fix index name in IndexInfo system table In commit `3e39985c7a` we added the Cassandra-compatible system table system."IndexInfo" (note the capitalized table name) which lists built indexes. Because we already had a table of built materialized views, and indexes are implemented as materialized views, the index list was implemented as a virtual table based on the view list. However, the name of each materialized view listed in the list of views looks like something_index, with the suffix "_index", while the name of the table we need to print is "something". We forgot to do this transformation in the virtual table - and this is what this patch does. This bug can confuse applications which use this system table to wait for an index to be built. Several tests translated from Cassandra's unit tests, in cassandra_tests/validation/entities/secondary_index_test.py fail in wait_for_index() because of this incompatibility, and pass after this patch. This patch also changes the unit test that enshrined the previous, wrong, behavior, to test for the correct behavior. This problem is typical of C++ unit tests which cannot be run against Cassandra. Fixes #8600 Unfortunately, although this patch fixes "typical" applications (including all tests which I tried) - applications which read from IndexInfo in a "typical" method to look for a specific index being ready, the implementation is technically NOT correct: The problem is that index names are not sorted in the right order, because they are sorted with the "_index" prefix. To give an example, the index names "a" should be listed before "a1", but the view names "a1_index" comes before "a_index" (because in ASCII, 1 comes before underscore). I can't think of any way to fix this bug without completely reimplementing IndexInfo in a different way - probably based on a temporary memtable (which is fine as this is not a performance-critical operation). We'll need to do this rewrite eventually, and I'll open a new issue. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	61c7f874cc	Merge 'Add per-service-level timeouts' from Piotr Sarna Ref: #7617 This series adds timeout parameters to service levels. Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this: ```cql CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s; ATTACH SERVICE LEVEL sl2 TO cassandra; ALTER SERVICE LEVEL sl2 WITH write_timeout = null; ``` Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`). Closes #7913 * github.com:scylladb/scylla: docs: add a paragraph describing service level timeouts test: add per-service-level timeout tests test: add refreshing client state transport: add updating per-service-level params client_state: allow updating per service level params qos: allow returning combined service level options qos: add a way of merging service level options cql3: add preserving default values for per-sl timeouts qos: make getting service level public qos: make finding service level public treewide: remove service level controller from query state treewide: propagate service level to client state sstables: disambiguate boost::find cql3: add a timeout column to LIST SERVICE LEVEL statement db: add extracting service level info via CQL types: add a missing translation for cql_duration cql3: allow unsetting service level timeouts cql3: add validating service level timeout values db: add setting service level params via system_distributed cql3: add fetching service level attrs in ALTER and CREATE cql3: add timeout to service level params qos: add timeout to service level info db,sys_dist_ks: add timeout to the service level table migration_manager: allow table updates with timestamp cql3: allow a null keyword for CQL properties	2021-05-11 18:39:10 +03:00
Nadav Har'El	3c2e852dd9	Merge 'scylla-gdb unit test' from Michael Livshin This patchset adds a basic scylla-gdb.py test to the test suite. First two patches add the test itself (disabled), subsequent ones are fixes for scylla-gdb.py to make the test pass, and the last one enables the test. Closes #8618 * github.com:scylladb/scylla: test: enable scylla-gdb/run scylla-gdb.py: "this" -> "self" scylla-gdb.py: wrap std::unordered_{set,map} and flat_hash_map scylla-gdb.py: robustify execution_strategy traversal scylla-gdb.py: recognize new sstable reader types scylla-gdb.py: make list_unordered_map more resilient scylla-gdb.py: robustify netw & gms scylla-gdb.py: redo find_db() in terms of sharded() scylla-gdb.py: debug::logalloc_alignment may not exist scylla-gdb.py: handle changed container type of keyspaces scylla-gdb.py: walk intrusive containers using provided link fields test: add a basic test for scylla-gdb.py test.py: refine test mode control	2021-05-11 18:39:10 +03:00
Avi Kivity	b1f9df279a	Merge "Untie cdc, storage service and migration notifier knot" from Pavel E " Storage service needs migration notifier reference to pass it to cdc service via get_local_storage_service(). This set removes - get_local_storage_service from cdc - migration notifier from storage service - db_context::builder from cdc (released nuclear binding energy) tests: unit(dev) " * 'br-cdc-no-storage-service' of https://github.com/xemul/scylla: storage_service: Remove migration notifier dependency cdc: Remove db_context::builder cdc: Provide migration notifier right at once cdc: Remove db_context::builder::with_migration_notifier	2021-05-11 18:39:10 +03:00
Michael Livshin	ff7d781988	test: enable scylla-gdb/run It should pass now. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Avi Kivity	6548436db3	Merge "Improve coverage support" from Botond " This patch-set builds on the existing very basic coverage generation support and greatly improves it, adding an almost fully automated way of generating reports, as well as a more manual way. At the heart of this is a new build mode, coverage, that is dedicated to coverage report generation, containing all the required build flags, without interfering with that of the "host" build mode, like currently (with the --coverage flag). Additionally a new script, scripts/coverage.py, is added which automates the magic behind the scenes needed to get from raw profile files to a nice html report, as long as the raw files are at the expected place. There are still some rough edges: * There is no direct ninja support for coverage generation, one has to build the tests, then run them via test.py. * Building and running just a few tests is a miserable experience (#8608). * Only boost unit tests are supported at the moment when using test.py. * A --verbose flag for coverage.py would be nice. * coverage.py could have a way to run a test itself, automatically adding the required ENV variable(s). I plan on addressing all these in the future, in the meanwhile, with this series, the coverage report generation is made available for non-masochists as well. " * 'coverage-improvements/v1' of https://github.com/denesb/scylla: HACKING.md: update the coverage guide test.py: add basic coverage generation support scripts: introduce coverage.py configure.py: replace --coverage with a coverage build mode configure.py: make the --help output more readable configure.py: add build mode descriptions configure.py: fix fallback mode selection for checkheaders target configure.py: centralize the declaration of build modes	2021-05-11 18:39:10 +03:00
Michael Livshin	ee80c81593	scylla-gdb.py: "this" -> "self" Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Asias He	4f0a1cbca3	repair: Wire off-strategy compaction for decommission When decommission is done, all nodes that receive data from the decommission node will run node_ops_cmd::decommission_done handler. Trigger off-strategy compaction inside the handler to wire off-strategy for decommission. Refs #5226 Closes #8607	2021-05-11 18:39:10 +03:00
Michael Livshin	b711fc5762	scylla-gdb.py: wrap std::unordered_{set,map} and flat_hash_map Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Nadav Har'El	df9faba652	Merge 'storage_proxy: place unique_response_handler:s in small_vector instead of std::vector' from Avi Kivity This cuts an allocation in the write path. Instruction count reduction isn't large, but performance does improve (results are consistent): before: 196369.48 tps ( 55.2 allocs/op, 13.2 tasks/op, 51658 insns/op) after: 199290.32 tps ( 54.2 allocs/op, 13.2 tasks/op, 51600 insns/op) (this is perf_simple_query --write --smp 1 --operations-per-shard 1000000) Since small_vector requires noexcept move constructor and assignment, they corresponding unique_response_handler members are adjusted/added respectively. Closes #8606 * github.com:scylladb/scylla: storage_proxy: place unique_response_handler:s in small_vector instead of std::vector storage_proxy: make unique_response_handler friendly to small_vector storage_proxy: give a name to a vector of unique_response_handlers	2021-05-11 18:39:10 +03:00
Michael Livshin	b0fbd0062e	scylla-gdb.py: robustify execution_strategy traversal Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Yaron Kaikov	588a065304	scylla_io_setup: configure "aio-max-nr" before iotune On severl instance types in AWS and Azure, we get the following failure during scylla_io_setup process: ``` ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application ``` We have scylla_prepare:configure_io_slots() running before the scylla-server.service start, but the scylla_io_setup is taking place before 1) Let's move configure_io_slots() to scylla_util.py since both scylla_io_setup and scylla_prepare are import functions from it 2) cleanup scylla_prepare since we don't need the same function twice 3) Let's use configure_io_slots() during scylla_io_setup to avoid such failure Fixes: #8587 Closes #8512	2021-05-11 18:39:10 +03:00
Michael Livshin	4ea6c7cd49	scylla-gdb.py: recognize new sstable reader types Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Nadav Har'El	fb0c4e469a	Merge 'token_metadata: Fix get_all_endpoints to return nodes in the ring' from Asias He The get_all_endpoints() should return the nodes that are part of the ring. A node inside _endpoint_to_host_id_map does not guarantee that the node is part of the ring. To fix, return from _token_to_endpoint_map. Fixes #8534 Closes #8536 * github.com:scylladb/scylla: token_metadata: Get rid of get_all_endpoints_count range_streamer: Handle everywhere_topology range_streamer: Adjust use_strict_sources_for_ranges token_metadata: Fix get_all_endpoints to return nodes in the ring	2021-05-11 18:39:10 +03:00
Michael Livshin	513695c5ba	scylla-gdb.py: make list_unordered_map more resilient Some unordered_map instantiations have cache=true, some cache=false, but we don't need to care. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	2a386c06d9	scylla-gdb.py: robustify netw & gms Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	76c2d792c9	scylla-gdb.py: redo find_db() in terms of sharded() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	ed2d471e79	scylla-gdb.py: debug::logalloc_alignment may not exist I haven't found a way to make it stay -- __attribute__((used)) is not enough and apparently lld is going to ignore __attribute__((retain)) until at least LLVM 13. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	77d8272cca	scylla-gdb.py: handle changed container type of keyspaces Used to be std::unordered_map, but is a flat_hash_map now. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	69a5aef620	scylla-gdb.py: walk intrusive containers using provided link fields clang & gdb apparently conspire to not reveal template argument types beyond the first one -- at least for some templates, and definitely for Boost's intrusive container ones. This severely restricts our ability to find the right intrusive list link by examining the container type. Allow the caller to simply provide the relevant field name, so we don't have to guess. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	73f9f08df6	test: add a basic test for scylla-gdb.py (And disable it initially, because it won't pass without subsequent commits) Runs only in release mode, to keep things more realistic. Doesn't exercise Scylla much at present -- just stops it after several compactions and tries (almost) all "scylla *" commands in order. Refs #6952. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Michael Livshin	3bff94cd29	test.py: refine test mode control * Add ability to skip tests in individual modes using "skip_in_<mode>". * Add ability to allow tests in specific modes using "run_in_<mode>". * Rename "skip_in_debug_mode" to "skip_in_debug_modes", because there is an actual mode named "debug" and this is confusing. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-05-11 18:39:10 +03:00
Piotr Sarna	1cb804f024	cql-pytest: add regression tests for index creation This commit adds unit tests for an issue with index creation after a table with malicious name is previously created as well. The cases cover both indexes with a default name and the ones with explicit name set.	2021-05-11 17:34:37 +02:00
Piotr Sarna	0ef0a4c78d	cql3: fail to create an index if there is a name conflict When an index with an explicit name is created, it's underlying materalized view's name is set to <index-name>_index. If there already exists a regular table with such a name, the creation should fail with a proper error message.	2021-05-11 15:21:00 +02:00
Piotr Sarna	fa53bf5c1e	database: check for conflicting table names for indexes When an index is created without an explicit name, a default name is chosen. However, there was no check if a table with conflicting name already exists. The check is now in place and if any conflicts are found, a new index name is chosen instead.	2021-05-11 15:20:59 +02:00
Botond Dénes	69d04d161e	test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics Not really testing anything, at least not automatically. It just provides coverage for the diagnostics dump code, as well as allows for developers to inspect the printout visually when making changes.	2021-05-10 18:06:30 +03:00
Botond Dénes	542be8d208	scylla-gdb.py: introduce scylla read-stats Too many or too resource-hungry reads often lie at the heart of issues that require an investigation with gdb. Therefore it is very useful to have a way to summarize all reads found on a shard with their states and resource consumptions. This is exactly what this new command does. For this it uses the reader concurrency semaphores and their permits respectively, which are now arranged in an intrusive list and therefore are enumerable. Example output: (gdb) scylla read-stats Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1 permits count memory table/description/state 1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active 16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active 1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive 1 0 0 ./view_builder/active 1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active 20 1 14334414 Total The command accepts a list of semaphores to dump reads from as its arguments, or if none are provided, it will dump reads from the semaphores of the local database instance.	2021-05-10 16:38:26 +03:00
Piotr Sarna	7f086d8f73	docs: add a paragraph describing service level timeouts Along with examples.	2021-05-10 12:39:41 +02:00
Piotr Sarna	570c63d39b	test: add per-service-level timeout tests The test suite checks if per-service-level timeouts work and validate their input.	2021-05-10 12:39:41 +02:00
Piotr Sarna	43f1f9e445	test: add refreshing client state With a helper client state refresher, some attributes which are usually only refreshed after a client disconnects and then reconnects, can be verified in the test suite.	2021-05-10 12:39:41 +02:00
Piotr Sarna	6da59b8a38	transport: add updating per-service-level params Per-service-level parameters (currently timeouts) are now updated when a new connection is established. The other connections which have the changed role are currently not immediately reloaded.	2021-05-10 12:39:41 +02:00
Piotr Sarna	7ee5686d6c	client_state: allow updating per service level params Per-service-level params can now be updated with a helper function.	2021-05-10 12:39:41 +02:00
Piotr Sarna	368a6976ff	qos: allow returning combined service level options Originally, the API for finding a service level controller returned its name, which also implied that only a single service level may be active for a user and provide its options. After adding timeout parameters it makes more sense to return a result which combines multiple service level parameters - e.g. a user can be attached to one level for read timeouts and a separate one for write timeouts.	2021-05-10 12:39:41 +02:00
Piotr Sarna	cbedefb0f9	qos: add a way of merging service level options In order to combine multiple service level options coming from multiple roles, a helper function is provided to merge two of them. The semantics depend on each parameter, but for timeouts, which are the only parameters at the time of writing this message, the minimum value of the two is taken. That in particular means that when service level A has timeout = 50ms and service level B has timeout = 1s, the resulting service level options would set the timeout to 50ms.	2021-05-10 12:39:41 +02:00
Piotr Sarna	4ba1ac57a1	cql3: add preserving default values for per-sl timeouts In order for per-service-level timeouts to work as expected, a special value is reserved for internally marking the timeouts as deleted.	2021-05-10 11:48:14 +02:00
Piotr Sarna	fb4e8951f5	qos: make getting service level public	2021-05-10 11:48:14 +02:00
Piotr Sarna	06d0e1853d	qos: make finding service level public	2021-05-10 11:48:14 +02:00
Piotr Sarna	e257ec11c0	treewide: remove service level controller from query state ... since it's accessible through its member, client state.	2021-05-10 11:48:14 +02:00
Piotr Sarna	d1f2e8b469	treewide: propagate service level to client state ... since it's going to be used to set up per-service-level timeouts.	2021-05-10 11:48:14 +02:00
Piotr Sarna	00e59a9823	sstables: disambiguate boost::find There are multiple functions named `find` in boost, so to avoid future clashes, this one is explicitly marked as belonging to boost::range.	2021-05-10 11:48:14 +02:00
Piotr Sarna	04880f4e44	cql3: add a timeout column to LIST SERVICE LEVEL statement Listing service levels now includes the timeout parameter.	2021-05-10 11:48:14 +02:00
Piotr Sarna	e8d271fea9	db: add extracting service level info via CQL	2021-05-10 11:45:09 +02:00
Piotr Sarna	b7a8aecb39	types: add a missing translation for cql_duration Its data type is duration_type.	2021-05-10 11:04:39 +02:00
Piotr Sarna	e225e01449	cql3: allow unsetting service level timeouts via using 'null' as a value.	2021-05-10 11:04:36 +02:00
Piotr Sarna	6e83054497	cql3: add validating service level timeout values The checks cover proper granulatity (1ms) and not using negative values.	2021-05-10 11:00:51 +02:00
Piotr Sarna	7bb34fdede	db: add setting service level params via system_distributed Service level params (various timeout values) are now properly stored in system_distributed.service_levels table.	2021-05-10 10:43:23 +02:00
Piotr Sarna	4ce83b9a93	cql3: add fetching service level attrs in ALTER and CREATE ALTER SERVICE LEVEL and CREATE SERVICE LEVEL statements now extract service level attrs and pass them to the service level controller.	2021-05-10 10:43:23 +02:00
Piotr Sarna	aa37974192	cql3: add timeout to service level params Timeout value can now be properly parsed from CQL.	2021-05-10 10:43:21 +02:00
Piotr Sarna	3339ea1d0d	qos: add timeout to service level info Service level information now consists of the timeout config, which stores the timeout value for all operations.	2021-05-10 10:22:11 +02:00
Piotr Sarna	ef8da7930f	db,sys_dist_ks: add timeout to the service level table In order to be able to store timeouts in the service level table, an appropriate column is added.	2021-05-10 10:10:38 +02:00
Piotr Sarna	7e6beabf27	migration_manager: allow table updates with timestamp In order to avoid needless schema disagreements, a way of announcing a schema change with fixed timestamp is added. That way, when nodes update schemas of their internal tables (e.g. during updates), it's possible for all nodes to use an identical timestamp for this operation, which in turn makes their digests identical.	2021-05-10 10:10:38 +02:00
Piotr Sarna	774d7546d9	cql3: allow a null keyword for CQL properties This keyword is going to be useful for resetting service level parameters.	2021-05-10 10:10:38 +02:00
Botond Dénes	a6166671ef	reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header Provide a quick summary in the first line of the printout, about the available/initial resources, number of queued reads and number of inactive reads.	2021-05-10 10:15:47 +03:00
Botond Dénes	0a908a47d6	reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines This report is logged, so we don't want huge printouts, cap the table at 20 lines, and print only a summary for the rest. For manual dumps, allow the limit to be set to a custom value, including no limit at all.	2021-05-10 10:15:47 +03:00
Botond Dénes	f0fc3eaefc	reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order So the largest memory consumer are at the top.	2021-05-10 10:15:47 +03:00
Botond Dénes	06e17c48e5	reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table The goal of the printout is to allow finding the culprit for semaphore related problems and this usually involves finding the table/op/state eating the most memory. This is much easier when all the permit summaries are in a single table.	2021-05-10 10:15:47 +03:00
Botond Dénes	595a44bee2	reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources Currently we have a single "count" column and it is not at all clear what it refers to: the number of permits or count resources used by them. Whichever it is, it only represent one of them, so in this commit we add a "permits" column, which in addition to clearing things up, supplies further information to the printout.	2021-05-10 10:15:47 +03:00
Avi Kivity	dd904f7776	storage_proxy: place unique_response_handler:s in small_vector instead of std::vector This cuts an allocation in the write path. Instruction count reduction isn't large, but performance does improve (results are consistent): before: 196369.48 tps ( 55.2 allocs/op, 13.2 tasks/op, 51658 insns/op) after: 199290.32 tps ( 54.2 allocs/op, 13.2 tasks/op, 51600 insns/op) (this is perf_simple_query --write --smp 1 --operations-per-shard 1000000)	2021-05-09 23:53:10 +03:00
Avi Kivity	b015dedd96	storage_proxy: make unique_response_handler friendly to small_vector small_vector wants the move constructor to be noexcept, and move assighment to exist (and be noexcept). These are easy to achieve.	2021-05-09 19:22:20 +03:00
Avi Kivity	8398510382	storage_proxy: give a name to a vector of unique_response_handlers We'd like to change the vector to a more involved type, so to avoid repeating it everywhere, give it a name. The actual type isn't changed in this patch.	2021-05-09 19:17:50 +03:00
Benny Halevy	e041411947	storage_service: do_isolate_on_error: promote warning to error This is a serious condition that warrants a log error. Fixes #8610 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210509110221.146275-1-bhalevy@scylladb.com>	2021-05-09 14:15:50 +03:00
Gleb Natapov	78c5a72b32	raft: drop _leader_progress tracking from the tracker The tracker maintains a separate pointer to current leader progress, but all this complexity is not needed because the tracker already have find() function that can either find a leader's progress by id or return null. Removing the tracking simplifies code and make going out of sync (which is always a possibility if a state is maintained in two different places) impossible.	2021-05-09 13:55:55 +03:00
Gleb Natapov	1245736776	raft: move current_leader into the follower state Only when fsm is in the follower state current_leader has any meaning. In the leader state a node is always its own follower and in a candidate state there is no leader. To make sure that the current_leader value cannot be out of sync with fsm state move it into the follower state.	2021-05-09 13:55:55 +03:00
Raphael S. Carvalho	8480839932	LCS/reshape: Don't reshape single sstable in level 0 with strict mode With strict mode, it could happen that a sstable alone in level 0 is selected for offstrategy compaction, which means that we could run into an infinite reshape process. This is fixed by respecting the offstrategy threshold. Unit test is added. Fixes #8573. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>	2021-05-09 11:09:54 +03:00
Benny Halevy	2a168c3224	atomic_cell: get rid of is_value_fragments It isn't used. Along with it, get rid also of: managed_bytes::is_fragmented and managed_bytes_basic_view::is_fragmented Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210506174115.171048-1-bhalevy@scylladb.com>	2021-05-09 11:08:53 +03:00
Botond Dénes	04c5e42f80	HACKING.md: Core dump debugging: link to debugging.md Instead of some slides from an internal summit. debugging.md has much more details then said slides (which lacks the associated voice recording). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210507125956.381763-1-bdenes@scylladb.com>	2021-05-07 15:32:12 +02:00
Botond Dénes	93cff4925d	HACKING.md: update the coverage guide To include the automated way via test.py as well as the manual way, via coverage.py.	2021-05-07 15:54:49 +03:00
Botond Dénes	435d699393	test.py: add basic coverage generation support Add support for the newly added coverage mode. When --mode=coverage, also invoke the coverage generation report script to produce a coverage report after having run the tests. There are still some rough edges, alternator and cql tests don't work.	2021-05-07 15:54:49 +03:00
Botond Dénes	bc3a424b0e	scripts: introduce coverage.py This script finds all the profiling data generated by tests at the specified path, it merges them and generates a html report combining all their results. It can be used both as a standalone python script, or imported and invoked from python directly.	2021-05-07 15:54:49 +03:00
Botond Dénes	c2808fcd0d	configure.py: replace --coverage with a coverage build mode A separate build mode is a much better fit for coverage generation. Generating coverage requires certain flags and optimization modes, which is much better expressed with a separate build mode, then by bolting it on top of an existing one, possibly conflicting with its own requirements. This patch therefore converts the current `--coverage` flag to a build mode of its own. The build mode is based on debug mode, in fact seastar is built in plain debug mode, with some extra cflags. The new build mode is called "coverage" and it is a non-default mode (by default configure.py doesn't generate build files for it).	2021-05-07 15:23:31 +03:00
Botond Dénes	62cc0fcb78	configure.py: make the --help output more readable The huge amount of choices for the --with argument obscures the help output, making it hard to read. This patch removes the choices list and instead manually checks the passed in artifacts. Unknown artifacts are removed from the list and if it remains empty the script is aborted. Available artifacts can be listed by a new --list-artifacts flag.	2021-05-07 15:23:29 +03:00
Botond Dénes	7f3228a197	configure.py: add build mode descriptions A short description of each build mode in the help text of the option which chooses them.	2021-05-07 14:49:22 +03:00
Botond Dénes	693c2cc20a	configure.py: fix fallback mode selection for checkheaders target Currently modes[0] is used as the fallback when 'dev' is not available. But modes is a dict with mode names as keys, so this won't work. Replace it with modes.keys()[0] to select the first key instead.	2021-05-07 11:35:01 +03:00
Botond Dénes	240ee1070c	configure.py: centralize the declaration of build modes Currently the declaration of build modes is scattered throughout the script. We have several places where build modes are mentioned hardcoded, and related configuration is also scattered in several data structures. This commit centralized all this into a single data structure, all other code uses this to iterate over modes and to mutate their configuration. This patch was motivated by the wish to make it easier to add a new build mode, which is what the next patch does. This is not something we do often, but I believe these changes also serve to make the code easier to understand and modify.	2021-05-07 11:31:48 +03:00
Botond Dénes	365951f7f7	scylla-gdb.py: add pretty printer for std::string_view This should really be provided by the C++ standard library and indeed I do recall pretty-printing for std:: types working sometimes. They don't for the most of the time however, which is not a disaster if you just want to use them in the gdb shell, but is not fine if we want to rely on them in internal scripts, which is what the next patch does. So provide it.	2021-05-07 09:09:21 +03:00
Gleb Natapov	0634674aef	raft: add some precondition checks Check that fsm does not process messages from itself and that it does not tries to become its own follower.	2021-05-07 08:04:16 +03:00
Tomasz Grabiec	abe3d7d7d3	Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly. This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case. Test results (perf_simple_query --smp 1 --task-quota-ms 10): ``` before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op) after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op) ``` 4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes). The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile. I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles. Closes #8592 * github.com:scylladb/scylla: storage_proxy, treewide: use utils::small_vector inet_address_vector:s storage_proxy, treewide: introduce names for vectors of inet_address utils: small_vector: add print operator for std::ostream hints: messages.hh: add missing #include	2021-05-06 18:00:54 +02:00
Tomasz Grabiec	6aec8cc447	Merge "raft: fixes and improvements for snapshot transfer" from Gleb * scylla-dev/raft-snapshot-fixes-v4: raft: document that add entry my throw commit_status_unknown raft: test: add test of a leadership change during ongoing snapshot transfer raft: test: retry submitting an entry if it was dropped raft: test: wait for the log to be fully replicated on new leader only raft: drop waiters with outdated terms raft: make snapshot transfer abortable raft: accept snapshots transfer from multiple nodes simultaneously raft: do not send probes while transferring snapshot raft: handle messages sending errors raft: test: return error from rpc module if nodes are disconnected raft: fix a typo in a variable name	2021-05-06 17:44:22 +02:00
Avi Kivity	d6d6758857	Merge 'Switch to use NODE_OPS_CMD for decommission and bootstrap operation' from Asias He In commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation), we switched replace operation to use the new NODE_OPS_CMD infrastructure. In this patch set, we continue the work to switch decommission and bootstrap operation to use NODE_OPS_CMD. Fixes #8472 Fixes #8471 Closes #8481 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for bootstrap operation repair: Switch to use NODE_OPS_CMD for decommission operation	2021-05-06 17:28:19 +03:00
Avi Kivity	f2132150c4	Merge "Extract reader concurrency semaphore tests into separate file" from Botond " The current home of these tests -- mutation_reader_test -- is already one of the larges test files we have. To reduce the size of the former and to make finding these tests easier they are moved to a separate file. " * 'reader-concurrency-semaphore-test/v2' of https://github.com/denesb/scylla: test: move reader_concurrency_semaphore related tests into separate file test: mutation_reader_test: convert restricted reader tests to semaphore tests	2021-05-06 17:13:45 +03:00
Gleb Natapov	aa7ea333da	raft: document that add entry my throw commit_status_unknown	2021-05-06 11:59:36 +03:00
Gleb Natapov	3a1bff26dd	raft: test: add test of a leadership change during ongoing snapshot transfer	2021-05-06 11:34:31 +03:00
Gleb Natapov	612e0f08c4	raft: test: retry submitting an entry if it was dropped	2021-05-06 11:34:31 +03:00
Gleb Natapov	0b2c9c549a	raft: test: wait for the log to be fully replicated on new leader only When forcing new leader it should be enough to wait for log to be fully replicated to that particular leader.	2021-05-06 11:34:31 +03:00
Gleb Natapov	d2f58d8656	raft: drop waiters with outdated terms Currently an entry is declared to be dropped only when an entry with different term is committed with the same index, but that may create a situation where, if no new entries are submitted for a long time, an already dropped entry will not be noticed for a long time as well. Consider the case where a client submits 10 entries on a leader A, but before they get replicated the leadership moves to a node B. B will commit a dummy entry which will be committed eventually and will release one of the waiters on A, but if anything else is submitted to B 9 other waiters will wait forever. The way to solve that is to drop all waiters that wait for a term smaller that one been committed. There is no chance they will be committed any longer since terms in the log may only grow.	2021-05-06 11:34:31 +03:00
Gleb Natapov	6abe2772dc	raft: make snapshot transfer abortable A snapshot transfer may take a lot of time and meanwhile a leader doing it may lose the leadership. If that happens the ongoing snapshot transfer becomes obsolete since the snapshot will be rejected by the receiving node as coming from an old leader. Make snapshot transfer abortable and abort them when leader changes.	2021-05-06 11:34:31 +03:00
Gleb Natapov	50d545a138	raft: accept snapshots transfer from multiple nodes simultaneously A leader may change while one of its followers is in snapshot transfer mode and that node may get additional request for snapshot transfer from a new leader while previous transfer is still not aborted. Currently such situation will trigger an assert. This patch allows to have active snapshot transfers from multiple nodes, but only one of them will succeed in the end, all other will be replied to with 'fail'.	2021-05-06 11:34:31 +03:00
Gleb Natapov	073a9be4c7	raft: do not send probes while transferring snapshot If a follower is in snapshot transfer mode there is no need to send probe append messages to it.	2021-05-06 11:34:31 +03:00
Gleb Natapov	08077a21b7	raft: handle messages sending errors Fail to send a message should not abort raft server.	2021-05-06 11:34:31 +03:00
Gleb Natapov	d0ebd79deb	raft: test: return error from rpc module if nodes are disconnected Returning an error when nodes are disconnected closer resembles what will happen in real networking.	2021-05-06 11:34:31 +03:00
Gleb Natapov	c4d87d7a23	raft: fix a typo in a variable name	2021-05-06 11:33:47 +03:00
Asias He	5a410cb6e3	token_metadata: Get rid of get_all_endpoints_count It is now only a wrapper for count_normal_token_owners. Refs #8534	2021-05-06 15:36:20 +08:00
Botond Dénes	c872a963b6	test: move reader_concurrency_semaphore related tests into separate file The mutation_reader_test is already one of our largest test files. Move the reader concurrency semaphore related tests to a new file, making them easier to find making the mutation reader test a little bit smaller too.	2021-05-06 08:59:47 +03:00
Botond Dénes	5f217b6dee	test: mutation_reader_test: convert restricted reader tests to semaphore tests These two tests (restricted_reader_timeout and restricted_reader_max_queue_length) are testing the semaphore in reality, but through the restricted reader, which is distracting as it needlessly brings in an additional layer into the picture. Rewrite them to test the semaphore directly, getting much lighter in the process.	2021-05-06 08:57:12 +03:00
Botond Dénes	77da141604	scylla-gdb.py: std_map() add __len__()	2021-05-06 08:41:49 +03:00
Botond Dénes	0b6705c253	scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__() Apparently, if a container is passed to `list()` it tries to obtain its size first, which in this case leads to infinite recursion. To prevent this, coerce `self` to an iterator.	2021-05-06 08:41:49 +03:00
Asias He	4793894fac	range_streamer: Handle everywhere_topology The everywhere_topology returns the number of nodes in the cluster as RF. This makes only streaming from the node losing the range impossible since no node is losing the range after bootstrap. Shortcut it not to use strict source in case the keyspace is everywhere_topology. Refs #8503	2021-05-06 10:02:11 +08:00
Asias He	1b7414860b	range_streamer: Adjust use_strict_sources_for_ranges Now the get_all_endpoints() returns the number of nodes in the ring. We need to adjust the checking for using strict source or not. Use strict when number of nodes in the ring is equal or more than RF Refs #8534	2021-05-06 10:02:11 +08:00
Asias He	ddeabba6aa	token_metadata: Fix get_all_endpoints to return nodes in the ring The get_all_endpoints() should return the nodes that are part of the ring. A node inside _endpoint_to_host_id_map does not guarantee that the node is part of the ring. To fix, return from _token_to_endpoint_map. Fixes #8534	2021-05-06 10:02:11 +08:00
Avi Kivity	e9802348b5	storage_proxy, treewide: use utils::small_vector inet_address_vector:s Replace std::vector<inet_address> with a small_vector of size 3 for replica sets (reflecting the common case of local reads, and the somewhat less common case of single-datacenter writes). Vectors used to describe topology changes are of size 1, reflecting that up to one node is usually involved with topology changes. At those counts and below we save an allocation; above those counts everything still works, but small_vector allocates like std::vector. In a few places we need to convert between std::vector and the new types, but these are all out of the hot paths (or are in a hot path, but behind a cache).	2021-05-05 18:36:54 +03:00
Avi Kivity	cea5493cb7	storage_proxy, treewide: introduce names for vectors of inet_address storage_proxy works with vectors of inet_addresses for replica sets and for topology changes (pending endpoints, dead nodes). This patch introduces new names for these (without changing the underlying type - it's still std::vector<gms::inet_address>). This is so that the following patch, that changes those types to utils::small_vector, will be less noisy and highlight the real changes that take place.	2021-05-05 18:36:48 +03:00
Gleb Natapov	745f63991f	raft: test: fix c&p error in a test Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>	2021-05-05 17:18:49 +02:00
Avi Kivity	ddb1f0e6ca	Merge "Choose the user max-result-size for service levels" from Botond " Choosing the max-result-size for unlimited queries is broken for unknown scheduling groups. In this case the system limit (unlimited) will be chosen. A prime example of this break-down is when service levels are used. This series fixes this in the same spirit as the similar semaphore selection issue (#8508) was fixed: use the user limit as the fall-back in case of unknown scheduling groups. To ensure future fixes automatically apply to both query-classification related configurations, selecting the max result size for unlimited queries is now delegated to the database, sharing the query classification logic with the semaphore selection. Fixes: #8591 Tests: unit(dev) " * 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla: service/storage_proxy: get_max_result_size() defer to db for unlimited queries database: add get_unlimited_query_max_result_size() query_class_config: add operator== for max_result_size database: get_reader_concurrency_semaphore(): extract query classification logic	2021-05-05 18:11:10 +03:00
Lauro Ramos Venancio	15f72f7c9e	TWCS: initialize _highest_window_seen The timestamp_type is an int64_t. So, it has to be explicitly initialized before using it. This missing inicialization prevented the major compactation from happening when a time window finishes, as described in #8569. Fixes #8569 Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com> Closes #8590	2021-05-05 17:31:05 +03:00
Avi Kivity	1ed3f54f4a	Merge "size_tiered_compaction_strategy: get_buckets improvements" from Benny " This patchset contains 3 main improvements to STCS get_buckets implementation and algorithm: 1. Consider only current bucket for each sstable. No need to scan all buckets using a map since the inserted sstables are sorted by size. 2. Use double precision for keeping bucket average size. Prevent rounding error accumulation. 3. Don't let the bucket average drift too high. As we insert increasingly larger sstables into a bucket, it's average size drifts up and eventually this may break the bucket invariant that all sstables in the bucket should be within the (bucket_low, bucket_high) range relative to the bucket average. Test: unit(dev) DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy, compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy Fixes #8584 " * tag 'stcs-buckets-v3' of github.com:bhalevy/scylla: compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable	2021-05-05 16:25:12 +03:00
Avi Kivity	6977064693	dist: scylla_raid_setup: reduce xfs block size to 1k Since Linux 5.12 [1], XFS is able to to asynchronously overwrite sub-block ranges without stalling. However, we want good performance on older Linux versions, so this patch reduces the block size to the minimum possible. That turns out to be 1024 for crc-protected filesystems (which we want) and it can also not be smaller than the sector size. So we fetch the sector size and set the block size to that if it is larger than 512. Most SSDs have a sector size of 512, so this isn't a problem. Tested on AWS i3.large. Fixes #8156. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b Closes #8585	2021-05-05 16:07:50 +03:00
Nadav Har'El	64a4e5e059	cross-tree: reduce dependency on db/config.hh and database.hh Every time db/config.hh is modified (e.g., to add a new configuration option), 110 source files need to be recompiled. Many of those 110 didn't really care about configuration options, and just got the dependency accidentally by including some other header file. In this patch, I remove the include of "db/config.hh" from all header files. It is only needed in source files - and header files only need forward declarations. In some cases, source files were missing certain includes which they got incidentally from db/config.hh, so I had to add these includes explicitly. After this patch, the number of source files that get recompiled after a change to db/config.hh goes down from 110 to 45. It also means that 65 source files now compile faster because they don't include db/config.hh and whatever it included. Additionally, this patch also eliminates a few unnecessary inclusions of database.hh in other header files, which can use a forward declaration or database_fwd.hh. Some of the source files including one of those header files relied on one of the many header files brought in by database.hh, so we need to include those explicitly. In view_update_generator.hh something interesting happened - it needs database.hh because of code in the header file, but only included database_fwd.hh, and the only reason this worked was that the files including view_update_generator.hh already happened to unnecessarily include database.hh. So we fix that too. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505121830.964529-1-nyh@scylladb.com>	2021-05-05 15:24:25 +03:00
Nadav Har'El	5fbd78ed96	CONTRIBUTING.md: add the requirement for self-contained headers As far as I can tell, we never documented requirement for self-contained headers in our coding style. So let's do it now, and explain the "ninja dev-headers" command and how to use it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505120908.963388-1-nyh@scylladb.com>	2021-05-05 15:10:46 +03:00
Botond Dénes	9a32889ac0	test: boost/sstable_datafile_test: add tests for segregate mode scrub Add two new unit test dedicated to the new segregate scrub mode.	2021-05-05 14:35:04 +03:00
Botond Dénes	550a1cd036	api: storage_service/keyspace_scrub: expose new segregate mode Allow invoking scrub with the newly added segregate mode as well.	2021-05-05 14:35:04 +03:00
Botond Dénes	674a77ead0	sstables: compaction/scrub: add segregate mode In segregate mode scrub will segregate the content of of input sstables into potentially multiple output sstables such that they respect partition level and fragment level monotonicity requirements. This can be used to fix data where partitions or even fragments are out-of-order or duplicated. In this case no data is lost and after the scrub each sstables contains valid data. Out-of-order partitions are fixed by simply being written into a separate output, compared to the last one compaction was writing into. Out-of-order fragments are fixed by injecting a partition-end/partition-start pair right before them, effectively moving them into a separate (duplicate) partition which is then treated in the above mentioned way. This mode can fix corruptions where partitions are out-of-order or duplicated. This mode cannot fix corruptions where partitions were merged, although data will be made valid from the database level, it won't be on the business-logic level.	2021-05-05 14:33:49 +03:00
Benny Halevy	ead96e21c3	compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:37 +03:00
Benny Halevy	c1681cb9ea	compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high SSTables are added in increasing size order so the bucket's average might drift upwards. Don't let it drift too high, to a point where the smallest SSTable might fall out of range. For example, here's a simulation run of the algorithm for these sstable sizes: [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774] the simulated compaction strategy options are: min_sstable_size = 4 bucket_low = 0.66667 bucket_high = 1.5 For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high) UNCHANGED: buckets={ ( 14.0) 21.0 ( 31.5): [21] ( 82.0) 123.0 ( 184.5): [123] ( 276.4) 414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523] ( 508.7) 763.0 (1144.5): [752, 774] } IMPROVED: buckets={ ( 14.0) 21.0 ( 31.5): [21] ( 82.0) 123.0 ( 184.5): [123] ( 247.0) 370.5 ( 555.8): [252, 363, 379, 394, 407, 428] ( 320.5) 480.8 ( 721.1): [463, 467, 470, 523] ( 508.7) 763.0 (1144.5): [752, 774] } Fixes #8584 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:28 +03:00
Benny Halevy	d3aa5265ab	compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number Using integer division lose accuracy by rounding down the result. Each time we calculate: ``` auto total_size = bucket.size() * old_average_size; auto new_average_size = (total_size + size) / (bucket.size() + 1); ``` We accumulate the rounding error. total_size might be too small since old_average_size was previously rounded down, and then new_average_size is rounded down again. Rather than trying to compensate for the rounding errors by e.g. adding size / 2 to the dividend, simply keep the average as a double precision number. Note that we multiply old_average_size by options.bucket_{low,high}, that are double precision too so the size comparisons are already using FP instructions implicitly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:25 +03:00
Benny Halevy	44b094f9a5	compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size Since now it became a reference used to update the bucket's average size after a new sstable is inserted into it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:20 +03:00
Benny Halevy	336a4dc0fd	compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable Since the sstables are sorted in increasing size order there is no need to consider all buckets to find a matching one. Instead, just consider the most recently inserted bucket. Once we see a sstable size outside the allowed range for this bucket, create a new bucket and consider this one for the next sstable. Note, `old_average_size` should be renamed since this change turns it into a reference and it's assigned with the new average_size. This patch keeps the old name to reduce the churn. The following patch will do only the rename. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:05 +03:00
Botond Dénes	9d5e958331	service/storage_proxy: get_max_result_size() defer to db for unlimited queries Defer picking the appropriate max result size for unlimited queries to the database, which is already the place where we make query classifying decisions. This move means that all these decisions are now centralized in the database, not scattered in different places and fixing one fixes all users.	2021-05-05 13:30:50 +03:00
Botond Dénes	992819b188	database: add get_unlimited_query_max_result_size() Similar to the already existing get_reader_concurrency_semaphore(), this method determines the appropriate max result size for the query class, which is deduced from the current scheduling group. This method shares its scheduling group -> query class association mechanism with the above mentioned semaphore getter.	2021-05-05 13:30:42 +03:00
Nadav Har'El	58e275e362	cross-tree: reduce dependency on db/config.hh and database.hh Every time db/config.hh is modified (e.g., to add a new configuration option), 110 source files need to be recompiled. Many of those 110 didn't really care about configuration options, and just got the dependency accidentally by including some other header file. In this patch, I remove the include of "db/config.hh" from all header files. It is only needed in source files - and header files only need forward declarations. In some cases, source files were missing certain includes which they got incidentally from db/config.hh, so I had to add these includes explicitly. After this patch, the number of source files that get recompiled after a change to db/config.hh goes down from 110 to 45. It also means that 65 source files now compile faster because they don't include db/config.hh and whatever it included. Additionally, this patch also eliminates a few unnecessary inclusions of database.hh in other header files, which can use a forward declaration or database_fwd.hh. Some of the source files including one of those header files relied on one of the many header files brought in by database.hh, so we need to include those explicitly. In view_update_generator.hh something interesting happened - it needs database.hh because of code in the header file, but only included database_fwd.hh, and the only reason this worked was that the files including view_update_generator.hh already happened to unnecessarily include database.hh. So we fix that too. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505102111.955470-1-nyh@scylladb.com>	2021-05-05 13:23:00 +03:00
Avi Kivity	83a826a4de	Merge 'Azure Ls v2 local disk setup' from Lubos Kosco fixes #8325 The iotune tests happened on Centos 8.2 both with stock and elrepo kernel, using Scylla 4.3 rc3 results are in https://docs.google.com/spreadsheets/d/1_uYq8UxY47XF5jreetrpleykLPqNGjfPXIirvTPh6rk/edit#gid=1101336711 Closes #7807 * github.com:scylladb/scylla: scylla_io_setup: add disk properties for L Azure instances scylla_util.py: add new class for Azure cloud support	2021-05-05 12:39:00 +03:00
Avi Kivity	3114f09d76	utils: small_vector: add print operator for std::ostream In order to replace std::vector with utils::small_vector, it needs to support this feature too.	2021-05-05 12:10:59 +03:00
Avi Kivity	84ea06f15b	hints: messages.hh: add missing #include Make the header self-contained.	2021-05-05 12:10:17 +03:00
Botond Dénes	104a47699c	mutation_fragment_stream_validator: add reset methods Allow resetting the validator to a given partition or mutation fragment. This allows a user which is able to fix corrupt streams to reset the validator to a partition or row which the validator normally wouldn't accept and hence it wouldn't advance its internal state to it.	2021-05-05 12:03:42 +03:00
Botond Dénes	a53e6bc6e8	mutation_writer: add segregate_by_partition Add a new segregator which segregates a stream, potentially containing duplicate or even out-of-order partitions, into multiple output streams, such that each output stream has strictly monotonic partitions. This segregator will be used by a new scrub compaction mode which is meant to fix sstables containing duplicate or out-of-order data.	2021-05-05 12:03:42 +03:00
Botond Dénes	34643ac997	api: /storage_service/keyspace_scrub: add scrub mode param Add direct support to the newly added scrub mode enum. Instead of the legacy `skip_corrupted` flag, one can now select the desired mode from the mode enum. `skip_corrupted` is still supported for backwards compatibility but it is ignored when the mode enum is set.	2021-05-05 12:03:42 +03:00
Botond Dénes	03728f5c26	sstables: compaction/scrub: replace skip_corrupted with mode enum We want to add more modes than the current two, so replace the current boolean mode selector with an enum which allows for easy extensions.	2021-05-05 12:03:42 +03:00
Botond Dénes	ba75115e20	sstables: compaction/scrub: prevent infinite loop when last partition end is missing Scrub compaction will add the missing last partition-end in a stream when allowed to modify the stream. This however can cause an infinite loop: 1) user calls fill_buffer() 2) process fragments until underlying is at EOS 3) add missing partition end 4) set EOS 5) user sees that last buffer wasn't empty 6) calls fill_buffer() again 7) goto (3) To prevent this cycle, break out of `fill_buffer()` early when both the scrub reader and the underlying is at EOS.	2021-05-05 12:03:42 +03:00
Botond Dénes	41181a5c2f	tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests No point in creating a permit for every mutation fragment.	2021-05-05 12:03:42 +03:00
Botond Dénes	e84c31fab8	query_class_config: add operator== for max_result_size	2021-05-05 11:20:22 +03:00
Botond Dénes	9313acb304	database: get_reader_concurrency_semaphore(): extract query classification logic Into a local function. In the next patch we want to add another method which needs to classify queries based on the current scheduling group, so prepare for sharing this logic.	2021-05-05 10:41:04 +03:00
Tomasz Grabiec	121eb32679	Merge 'test: perf: report instructions retired per operations' from Avi Kivity Instructions retired per op is a much more stable than time per op (inverse throughput) since it isn't much affected by changes in CPU frequencey or other load on the test system (it's still somewhat affected since a slower system will run more reactor polls per op). It's also less indicative of real performance, since it's possible for fewer inststructions to execute in more time than more instructions, but that isn't an issue for comparative tests). This allows incremental changes to the code base to be compared with more confidence. Current results are around 55k instructions per read, and 52k for writes. Closes #8563 * github.com:scylladb/scylla: test: perf: tidy up executor_stats snapshot computation test: perf: report instructions retired per operations test: perf: add RAII wrapper around Linux perf_event_open() test: perf: make executor_stats_snapshot() a member function of executor	2021-05-05 00:54:08 +02:00
Tomasz Grabiec	b8665c459d	Merge "raft: replication test updates" from Alejo Cleanups, fixes, and configuration change support for replication tests. * alejo/raft-tests-replication-01-fixes-v13: raft: replication test: remove obsolete helper raft: replication test: add_entry with retries raft: replication test: support config change raft: replication test: add dummy command support raft: replication test: test both with and without prevote raft: replication test: make initial leader just default raft: replication test: create command helper raft: replication test: free elections as helper raft: replication test: fix election connectivity raft: replication test: fix custom election raft: replication test: add helpers for threshold and election raft: replication test: connectivity improvement raft: replication test: helper for server_address raft: replication test: use wait_log() raft: replication test: cycle leader more raft: replication test: fix a test description raft: replication test: remove multiple state machines raft: replication test: remove checksum raft: replication test: remove unused class param	2021-05-04 18:52:47 +02:00
Alejo Sanchez	27ad2a0f28	raft: replication test: remove obsolete helper As we are now serially adding commands with consecutive integers there is no need to build vectors of commands. Remove helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-04 11:01:07 -04:00
Alejo Sanchez	0a54fd848b	raft: replication test: add_entry with retries The current leader might have stepped down. Try again and learn if there's a new leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-04 11:00:46 -04:00
Nadav Har'El	df65d09e08	Merge ' cdc: log: fill cdc$deleted_ columns in pre-images ' from Piotr Grabowski Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example: ``` INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1); INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1); ``` For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode): ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` `v=NULL` has two meanings: 1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2). 2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image. Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image). A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row: If in pre-image 'true' mode: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` If in pre-image 'full' mode: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1 ``` A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation. No such change is necessary for the post-image rows, as those images are always generated in the `full` mode. Additional example: Additional example of trouble decoding pre-images before this change. tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode: ``` INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1); INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1); ``` ``` INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1); ``` generated pre-image: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` ``` INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1); ``` generated pre-image: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` Both pre-images look the same, but: 1. `v=NULL` in tbl2 describes v being omitted from the pre-image. 2. `v=NULL` in tbl3 described v being `NULL` in the pre-image. Closes #8568 * github.com:scylladb/scylla: cdc: log: assert post_image is always in full mode cdc: tests: check cdc$deleted_ columns in images cdc: log: fill cdc$deleted_ columns in pre-images	2021-05-04 14:45:27 +03:00
Lubos Kosco	c26bcf29f9	scylla_io_setup: add disk properties for L Azure instances	2021-05-04 13:13:05 +02:00
Lubos Kosco	f627fcbb0c	scylla_util.py: add new class for Azure cloud support	2021-05-04 13:12:42 +02:00
Piotr Grabowski	cd6154e8bf	cdc: log: assert post_image is always in full mode Add an assertion that checks that post_image can never be in non-full mode.	2021-05-04 12:33:15 +02:00
Piotr Grabowski	778fbb144f	cdc: tests: check cdc$deleted_ columns in images Add a test that checks whether the cdc$deleted_ columns are properly filled in the pre/post-image rows. This test checks tables with only atomic columns, tables with frozen collections and non-frozen collections. The test is performed with both 'true' pre-image mode and 'full' pre-image mode.	2021-05-04 12:33:15 +02:00
Calle Wilund	7e345e37e8	cql/cdc_batch_delete_postimage_test - rename test files + fix result The tests, when added, where not named kosher (_test), which the runner apparently quaintly, require to pick it up (instead of the more sensisble .cql). Thusly, the test was never run beyond initial creation, and also bit-rotted slightly during behaviour changes. Renamed and re-resulted. Closes #8581	2021-05-04 12:47:33 +03:00
Avi Kivity	ef2313325b	Merge "Teach sstables streams new streams API" from Pavel E " Recent changes in seastar added the ability for data sinks to advertise the buffer size up to the stream level. This change was needed to make the output stack honor the io-queue's max request length. There are two more places left to patch. The first is the sstables checksumming writer. This is the sink implementation that has another sink inside. So this one is patched to report up (to the output stream) the buffer size from the lower sink (which is a file data sink that already "knows" the maximim IO lengths). The second one is the compress sink, but this sink embeds an output stream inside, so even if it's working with larger buffers, that inner stream will split them properly. So this place is patched just to stop using the deprecated output stream constructor. tests: unit(dev) " * 'br-streams-napi' of https://github.com/xemul/scylla: sstables: Make checksum sink report buffer size from lower sink sstables: Report buffer size from compressed file sink	2021-05-04 12:22:38 +03:00
Pavel Emelyanov	13b07a3c58	sstables: Make checksum sink report buffer size from lower sink The checksum sink carries another sink on board and forwards the put buffers lower, so there's no point in making these two have different buffer sizes. This is what really happens now, but this change makes this more explicit and makes the checksumming code conform to the new output stream API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-04 12:01:30 +03:00
Pavel Emelyanov	01b979beca	sstables: Report buffer size from compressed file sink This change just moves the place from which the output_stream knows the compression::uncompressed_chunk_length() value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-04 12:01:27 +03:00
Benny Halevy	946f9d9c83	commitlog: segment_manager::shutdown: abort on errors Currently, if sync_all_segments fails during shutdown, _shutdown is never set, causing replenish_reserve to hang, as possibly seen in #8577. It is better if scylla aborts on critical errors during shutdown rather than just hang. Refs #8577 Test: unit(dev) DTest: commitlog_test.py Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-04 10:00:03 +03:00
Pekka Enberg	6583a04e5d	Update seastar submodule * seastar f1b6b95b...847fccaf (1): > perftune.py: fix parsing of 'write_back_cache' YAML option	2021-05-04 09:12:49 +03:00
Benny Halevy	f01307d816	commitlog: allocate_segment_ex: make_checked_file To make sure no errors writing to commitlog are tolerated. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-04 09:00:58 +03:00
Avi Kivity	6ffd813b7b	Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency. When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter. This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached. This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value. Fixes #8102 Tests: - unit(dev) - some manual tests: - shutting down repair coordinator during hints replay, - shutting down node participating in repair during hints replay, Closes #8452 * github.com:scylladb/scylla: repair: introduce abort_source for repair abort repair: introduce abort_source for shutdown storage_proxy: add abort_source to wait_for_hints_to_be_replayed storage_proxy: stop waiting for hints replay when node goes down hints: dismiss segment waiters when hint queue can't send repair: plug in waiting for hints to be sent before repair repair: add get_hosts_participating_in_repair storage_proxy: coordinate waiting for hints to be sent config: add wait_for_hint_replay_before_repair option storage_proxy: implement verbs for hint sync points messaging_service: add verbs for hint sync points storage_proxy: add functions for syncing with hints queue db/hints: make it possible to wait until current hints are sent db/hints: add a metric for counting processed files db/hints: allow to forcefully update segment list on flush	2021-05-03 18:47:27 +03:00
Alejo Sanchez	56e977ae69	raft: replication test: support config change Add support for configuration change on leader. Keep track of servers in config in test. Add a dummy entry to confirm configuration changed. If the add fails, because the old leader was not in the new config and stepped down, the config is considered changed, too. Add a test with some configuration changes. Add a test cycling every scenario for 1 of 4 nodes removed. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	8d8af92cbb	raft: replication test: add dummy command support Use a special value as dummy entry to be ignored when seen in state machine input. Ignore dummy entries for count. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	4aa52be7e5	raft: replication test: test both with and without prevote Before this change the default was prevote enabled. With this change each test is run with and without prevote. This duplicates the number of test cases. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	e759e492c7	raft: replication test: make initial leader just default The test suite requires an initial leader and at the moment it's always just 0. Make it default and simplify code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	eb5bbcdec7	raft: replication test: create command helper Factor out repeated code and make it available for other uses. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	eb94dd26dc	raft: replication test: free elections as helper Add a helper to run free elections and use it in partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	cb297a57df	raft: replication test: fix election connectivity If a leader was already disconnected the election of a new leader could re-connect. Save original connectivity and restore it when done electing new leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	9909983e38	raft: replication test: add helpers for threshold and election Add 2 helper functions for making nodes reach timeout threshold and to elect a specific node. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	38526d7a2f	raft: replication test: connectivity improvement Replace simple full disconnect of a node with specific from -> to disconnection tracking. This will help electing new leaders. Say there are {A,B,C} with A leader and we want to elect B. Before this patch, we would disconnect A, run an election with just {B,C}, and then re-connect A. If we have {A,B} and want to elect B, this won't work as B needs 2/2+1 votes and A is disconnected. Even if we made A stepped down. This patch corrects this shortcoming. (@gleb-cloudius) With this patch, we can specify other followers (not the previous or next leader) to not see the old leader, but the new and old leaders see each other just fine. In the example {A,B,C} above we can cut A<->B specifcally. Also, this is closer to etcd testing and should help porting cases. NOTE: in the current test implementation failure_detector reports node.is_alive(other_node) if there is a connection both ways. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	f53dea432c	raft: replication test: helper for server_address A helper function to convert from local 0-based id to raft 1-based server_address. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	294e16cf8b	raft: replication test: use wait_log() Use wait_log() helper in leftover election code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	355c8a052f	raft: replication test: cycle leader more For ported etcd test cycle leader, cycle some more. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	5b2c9a6c94	raft: replication test: fix a test description Fix replace_log_leaders_log_empty description comment. Reported by @kbraun Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	bbb56e2265	raft: replication test: remove multiple state machines Checksum was removed so undo support for multiple versions added in: test: add support for different state machines `43dc5e7dc2` NOTE: as there is a test with custom total_values, expected value cannot be static const anymore. (line 630) Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	e77af8573b	raft: replication test: remove checksum Previously, entries were added in parallel and we needed to check if order was broken. Using a simple checksum was better than a hash as you could easily find the position it broke (we add consecutive numbers). Now order of entries is forced so it's not useful. This patch removes it. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	9335941b49	raft: replication test: remove unused class param persisted_snapshots is not used in state_machine class. Remove it. Reported by @kbraun Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Benny Halevy	9cc45fe5c9	flat_mutation_reader: consume_mutation_fragments_until: maybe yield after each popped mutation_fragment Address the following reactor stall seen with 4.6.dev-0.20210421.2ad09d0bf: ``` 2021-04-29T17:19:11+00:00 perf-latency-nemesis-fix-late-db-node-afb25d9a-5 !INFO \| scylla[9515]: Reactor stalled for 19 ms on shard 2. Backtrace: 0x4044de4 0x4044121 0x4044caf 0x7f537c6601df 0x13792ff 0x137fc18 0x11d89ec 0x1444424 0x13edd69 0x12bdc57 0x12bc1fa 0x12bb6f3 0x12ba304 0x12b94ce 0x1282525 0x12812ec 0x1524fda 0x12aa3ec 0x12aa228 0x4057d3f 0x4058fc7 0x407797b 0x40234ba 0x93f8 0x101902 void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791 seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1223 seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1104 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1118 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1206 ?? ??:0 logalloc::region_impl::object_descriptor::encode(char&, unsigned long) const at ./utils/logalloc.cc:1184 (inlined by) logalloc::region_impl::alloc_small(logalloc::region_impl::object_descriptor const&, unsigned int, unsigned long) at ./utils/logalloc.cc:1293 logalloc::region_impl::alloc(migrate_fn_type const, unsigned long, unsigned long) at ./utils/logalloc.cc:1515 managed_bytes at ././utils/managed_bytes.hh:149 (inlined by) managed_bytes at ././utils/managed_bytes.hh:198 (inlined by) atomic_cell_or_collection::copy(abstract_type const&) const at ./atomic_cell.cc:86 operator() at ./mutation_partition.cc:1462 (inlined by) std::__exception_ptr::exception_ptr compact_radix_tree::tree<cell_and_hash, unsigned int>::copy_slots<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const::{lambda(unsigned int)#1}, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > >(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, cell_and_hash const, unsigned int, unsigned int, compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const::{lambda(unsigned int)#1}&&, row::row(schema const&, column_kind, row const&)::$_14&) at ././utils/compact-radix-tree.hh:1406 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:1293 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >(compact_radix_tree::variadic_union<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:829 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >(compact_radix_tree::variadic_union<compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > const&, row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:832 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::clone<compact_radix_tree::tree<cell_and_hash, unsigned int>::leaf_node, row::row(schema const&, column_kind, row const&)::$_14&>(row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:837 (inlined by) std::pair<compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head, std::__exception_ptr::exception_ptr> compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head::clone<row::row(schema const&, column_kind, row const&)::$_14&>(row::row(schema const&, column_kind, row const&)::$_14&, unsigned int) const at ././utils/compact-radix-tree.hh:499 void compact_radix_tree::tree<cell_and_hash, unsigned int>::clone_from<row::row(schema const&, column_kind, row const&)::$_14&>(compact_radix_tree::tree<cell_and_hash, unsigned int> const&, row::row(schema const&, column_kind, row const&)::$_14&) at ././utils/compact-radix-tree.hh:1866 (inlined by) row at ./mutation_partition.cc:1465 deletable_row at ././mutation_partition.hh:831 (inlined by) rows_entry at ././mutation_partition.hh:940 (inlined by) rows_entry* allocation_strategy::construct<rows_entry, schema const&, clustering_key_prefix const&, deletable_row const&>(schema const&, clustering_key_prefix const&, deletable_row const&) at ././utils/allocation_strategy.hh:153 (inlined by) operator() at ././cache_flat_mutation_reader.hh:467 operator() at ././row_cache.hh:601 (inlined by) decltype(auto) with_allocator<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}::operator()() const::{lambda()#1}>(allocation_strategy&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}::operator()() const::{lambda()#1}) at ././utils/allocation_strategy.hh:311 (inlined by) operator() at ././row_cache.hh:600 (inlined by) decltype(auto) logalloc::allocating_section::with_reclaiming_disabled<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}&>(logalloc::region&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}&) at ././utils/logalloc.hh:757 (inlined by) operator() at ././utils/logalloc.hh:779 (inlined by) _ZN8logalloc18allocating_section12with_reserveIZNS0_clIZN5cache11lsa_manager36run_in_update_section_with_allocatorIZNS3_26cache_flat_mutation_reader18maybe_add_to_cacheERK14clustering_rowEUlvE_EEvOT_EUlvE_EEDcRNS_6regionESC_EUlvE_EEDcSC_ at ././utils/logalloc.hh:728 (inlined by) decltype(auto) logalloc::allocating_section::operator()<cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}>(logalloc::region&, cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&)::{lambda()#1}) at ././utils/logalloc.hh:778 (inlined by) void cache::lsa_manager::run_in_update_section_with_allocator<cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}>(cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&)::{lambda()#1}&&) at ././row_cache.hh:599 (inlined by) cache::cache_flat_mutation_reader::maybe_add_to_cache(clustering_row const&) at ././cache_flat_mutation_reader.hh:459 (inlined by) cache::cache_flat_mutation_reader::maybe_add_to_cache(mutation_fragment const&) at ././cache_flat_mutation_reader.hh:446 operator() at ././cache_flat_mutation_reader.hh:321 (inlined by) operator() at ././flat_mutation_reader.hh:549 seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&>(flat_mutation_reader) at ././seastar/include/seastar/core/future.hh:2135 (inlined by) auto seastar::futurize_invoke<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&>(flat_mutation_reader) at ././seastar/include/seastar/core/future.hh:2166 (inlined by) {lambda()#2} seastar::do_until<consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, consume_mutation_fragments_until<{lambda()#1}, mutation_fragment, {lambda(mutation_fragment)#1}>(seastar::future, flat_mutation_reader, {lambda()#1}, mutation_fragment, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(flat_mutation_reader&, seastar::future<void>) at ././seastar/include/seastar/core/loop.hh:341 (inlined by) seastar::future<void> consume_mutation_fragments_until<cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(flat_mutation_reader&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda(mutation_fragment)#1}&&, cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:547 (inlined by) cache::cache_flat_mutation_reader::read_from_underlying(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:317 (inlined by) operator() at ././cache_flat_mutation_reader.hh:277 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:2135 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&, seastar::internal::monostate) at ././seastar/include/seastar/core/future.hh:1979 (inlined by) seastar::future<void> seastar::future<void>::then_impl<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:1601 (inlined by) seastar::internal::future_result<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, void>::future_type seastar::internal::call_then_impl<seastar::future<void> >::run<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(seastar::future<void>&, seastar::internal::future_result&&) at ././seastar/include/seastar/core/future.hh:1234 (inlined by) seastar::future<void> seastar::future<void>::then<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}&&) at ././seastar/include/seastar/core/future.hh:1520 (inlined by) cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:276 operator() at ././cache_flat_mutation_reader.hh:266 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:2135 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&, seastar::internal::monostate) at ././seastar/include/seastar/core/future.hh:1979 (inlined by) seastar::future<void> seastar::future<void>::then_impl<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:1601 (inlined by) seastar::internal::future_result<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, void>::future_type seastar::internal::call_then_impl<seastar::future<void> >::run<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>(seastar::future<void>&, seastar::internal::future_result&&) at ././seastar/include/seastar/core/future.hh:1234 (inlined by) seastar::future<void> seastar::future<void>::then<cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void> >(cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&&) at ././seastar/include/seastar/core/future.hh:1520 (inlined by) cache::cache_flat_mutation_reader::do_fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:265 operator() at ././cache_flat_mutation_reader.hh:240 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2135 (inlined by) auto seastar::futurize_invoke<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2166 (inlined by) seastar::future<void> seastar::do_until<cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}, cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}) at ././seastar/include/seastar/core/loop.hh:341 (inlined by) cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:239 (inlined by) operator() at ././cache_flat_mutation_reader.hh:230 cache::cache_flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././cache_flat_mutation_reader.hh:235 flat_mutation_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:405 (inlined by) seastar::future<bool> flat_mutation_reader::impl::fill_buffer_from<flat_mutation_reader>(flat_mutation_reader&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ./flat_mutation_reader.cc:203 operator() at ./row_cache.cc:406 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2135 (inlined by) auto seastar::futurize_invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}&) at ././seastar/include/seastar/core/future.hh:2166 (inlined by) seastar::future<void> seastar::do_until<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#2}, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#3}) at ././seastar/include/seastar/core/loop.hh:341 (inlined by) single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ./row_cache.cc:405 (inlined by) operator() at ./row_cache.cc:402 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::__invoke_other, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60 (inlined by) std::__invoke_result<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>::type std::__invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::__invoke_result&&, (single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:95 (inlined by) std::invoke_result<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>::type std::invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&>(std::invoke_result&&, (single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&)...) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/functional:88 (inlined by) auto seastar::internal::future_invoke<single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&, seastar::internal::monostate>(single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}&, seastar::internal::monostate&&) at ././seastar/include/seastar/core/future.hh:1209 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1582 (inlined by) _ZN7seastar8futurizeINS_6futureIvEEE22satisfy_with_result_ofIZZNS2_14then_impl_nrvoIZN34single_partition_populating_reader11fill_bufferENSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEEUlvE_S2_EET0_OT_ENKUlONS_8internal22promise_base_with_typeIvEERSF_ONS_12future_stateINSJ_9monostateEEEE_clESM_SN_SR_EUlvE_EEvSM_SI_ at ././seastar/include/seastar/core/future.hh:2120 operator() at ././seastar/include/seastar/core/future.hh:1575 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, single_partition_populating_reader::fill_buffer(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, seastar::future>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2228 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2637 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2796 operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3987 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(std::__invoke_other, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:60 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>, void>::type std::__invoke_r<void, seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/invoke.h:110 (inlined by) std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_97>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:291 std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/10/../../../../include/c++/10/bits/std_function.h:622 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60 ?? ??:0 ?? ??:0 ``` Fixes #8579 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8580	2021-05-03 14:06:26 +03:00
Avi Kivity	9d018b5f40	Update seastar submodule * seastar 0b2c25d133...f1b6b95b69 (10): > Merge "Cap streams buffer sizes with IO limits" from Pavel > io_queue: fix mismatched class/struct tag for priority_class_data > perftune.py: instrument bonding tuning flow with 'nic' parameter > perftune.py: strip a newline off the 'scheduler' file content > perftune.py: add support for virtual non-raid disk devices > doc: fix typos in doc/tutorial.md > Merge "Add IO in-disk stats" from Pavel E > iotune: Perform fs-check on all directories > file: Keep reference on io-queue > Merge "Assorted set of improvements over io-queue" from Pavel E	2021-05-03 13:21:25 +03:00
Eliran Sinvani	fc93133cbe	Service level controller: fix wrong default service level removal log An out of block log print resulted in repeated prints about removal of the default service level. The period of this print is every time the configuration is scanned for changes. It happens when the default service level is one of the last on the map (sorted as in the map). Fixes #8567 Closes #8576	2021-05-03 09:08:41 +03:00
Pavel Solodovnikov	4c351ff260	raft: switch `group_id` type from `uint64_t` to `utils::UUID` Introduce a tagged id struct for `group_id`. Raft code would want to generate quite a lot of unique raft groups in the future (e.g. tablets). UUID is designed exactly for that (e.g. larger capacity than `uint64_t`, obviously, and also has built-in procedures to generate random ids). Also, this is a preparation to make "raft group 0" use a random ID instead of a literal fixed `0` as a group id. The purpose is that every scylla cluster must have a unique ID for "raft group 0" since we don't want the nodes from some other cluster to disrupt the current cluster. This can happen if, for some reason, a foreign node happens to contact a node in our cluster. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>	2021-05-02 16:39:54 +03:00
Pavel Solodovnikov	a7bd7dd122	utils: make basic UUID constructors constexpr Mark default and `UUID(most_sig_bits, least_sig_bits)` ctors as constexpr. This allows to construct constexpr constants using UUID type. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210429170630.533596-2-pa.solodovnikov@scylladb.com>	2021-05-02 16:39:52 +03:00
Avi Kivity	ae660eeec4	logalloc: reduce minimum lsa reserve in allocating_section to 1 Many workloads have fairly constant and small request sizes, so we don't need large reserves for them. These workloads suffer needlessly from the current large reserve of 10 segments (1.2MB) when they really need a few hundred bytes. Reduce the reserve to a minimum of 1 segment. Note that due to #8542 this can make a large difference. Consider a workload that has a 1000-byte footprint in cache. If we've just consumed some free memory and reduced the reserve to zero, then we'll evict about 50,000 objects before proceeding to compact. With the reserved reduced to 1, we'll evict 128 objects. All this for 1000 bytes of memory. Of course, #8542 should be fixed, but reducing the reserve provides some quick relief and makes sense even with the larger fix. The reserve will quickly grow for workloads that handle bigger requests, so they won't see an impact from the reduction. Closes #8572	2021-05-02 15:22:04 +02:00
Pavel Emelyanov	6de8bb663b	storage_service: Remove migration notifier dependency The only reason why storage service keeps a refernce on the migration notifier is that the latter was needed by cdc before previous patch. Now cdc gets the notifier directly from main, so storage service is a bit more off the hook. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-29 22:47:13 +03:00
Pavel Emelyanov	cc813ef0dd	cdc: Remove db_context::builder Right now the builder is just an opaque transfer between cdc_service constructor args and cdc_service's db_context constructor args. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-29 22:46:57 +03:00
Pavel Emelyanov	3a7ca647af	cdc: Provide migration notifier right at once The only way db_context's migration notifier reference is set up is via cdc_service->db_context::builder->.build chain of calls. Since the builder's notifier optional reference is always disengaged (the .with_migration_notifier is removed by previous patch) the only possible notifier reference there is from the storage service which, in turn, is the same as in main.cc. Said that -- push the notifier reference onto db_context directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-29 22:40:24 +03:00
Pavel Emelyanov	421a514c30	cdc: Remove db_context::builder::with_migration_notifier It's unused and removing it makes next patch's life simpler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-29 22:39:12 +03:00
Piotr Grabowski	b1650114eb	cdc: log: fill cdc$deleted_ columns in pre-images Before this change, cdc$deleted_ columns were all NULL in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example: INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1); INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1); For this example, pre-image generated for the second operation would look like this (in both 'true' and 'full' pre-image mode): pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 v=NULL has two meanings: 1. If pre-image was in 'true' mode, v=NULL describes that v was not affected (affected columns: pk, ck, v2). 2. If pre-image was in 'full' mode, v=NULL describes that v was equal to NULL in the pre-image. Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image). A solution to this problem is to fill in the cdc$deleted_ columns for pre-images. After this change, for the INSERT described above, CDC now generates the following log row: If in pre-image 'true' mode: pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 If in pre-image 'full' mode: pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1 A client library now can properly decode a pre-image row. If it sees a NULL value, it can now check the cdc$deleted_ column to determine if this NULL value was a part of pre-image or it was omitted due to not being an affected column in the delta operation. No such change is necessary for the post-image rows, as those images are always generated in the 'full' mode. Additional example of trouble decoding pre-images before this change. tbl2 - 'true' pre-image mode, tbl3 - 'full' pre-image mode: INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1); INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1); INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1); generated pre-image: pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1); generated pre-image: pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 Both pre-images look the same, but: 1. v=NULL in tbl2 describes v being omitted from the pre-image. 2. v=NULL in tbl3 described v being NULL in the pre-image.	2021-04-29 18:04:07 +02:00
Botond Dénes	0e8818f6ac	scylla-gdb.py: scylla apply: don't change current shard Scylla apply iterates over all shards in undetermined order and leaves the last shard as the current one. This is counter-intuitive and can lead to surprises as the user might not expect the current shard to be changed by a command that just executes a command on each shard. This patch ensures that both in case of the happy and error paths the current shard is unchanged. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210429104937.61315-1-bdenes@scylladb.com>	2021-04-29 14:48:29 +03:00
Botond Dénes	9fc3cba055	sstables: improve error message for invalid sstable paths The error message currently complains about "invalid version" and later says the reason is that the path is not recognized. This is confusing so change the error message to start with "invalid path" instead. It is the path that is invalid not the version after all. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210429092749.52659-1-bdenes@scylladb.com>	2021-04-29 12:50:48 +03:00
Botond Dénes	824b49aeb4	tools/scylla-sstable-index: use defer() to close sstables manager So it is closed when loading the sstable throws an exception too. Failing to close the manager will mask the real error as the user will only see the assert failure due to failing to close the manager. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210429092248.50968-1-bdenes@scylladb.com>	2021-04-29 12:50:25 +03:00
Eliran Sinvani	0320110b04	messaging service: be more verbose when shutting down servers and clients We encountered a phenomena where shutting down the messaging service don't complete, leaving the shutdown process stuck. Since we couldn't pinpoint where exactly the shutdown went wrong, here we add some verbosity to the shutdown stages so we can more accurately pinpoint the culprit. Closes #8560	2021-04-29 12:28:17 +03:00
Botond Dénes	26ae9555d1	test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front The fuzzy test consumes a large chunk of resource from the semaphore up-front to simulate a contested semaphore. This isn't an accurate simulation, because no permit will have more than 1 units in reality. Furthermore this can even cause a deadlock since `8aaa3a7` as now we rely on all count units being available to make forward progress when memory is scarce. This patch just cuts out this part of the test, we now have a dedicated unit test for checking a heavily contested semaphore, that does it properly, so no need to try to fix this clumsy attempt that is just making trouble at this point. Refs: #8493 Tests: release(multishard_mutation_query_test:fuzzy_test) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>	2021-04-29 11:45:53 +03:00
Benny Halevy	ad8d93dd1a	repair: repair_meta::stop: demote log message to debug level This log message was added in `77cc694a08`. info log level was erroneously left over from development and it's too noisy. Demote it to debug level. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210429071539.1264244-1-bhalevy@scylladb.com>	2021-04-29 11:07:59 +03:00
Avi Kivity	706428a8a3	Merge 'cql3: Check if partition-key restrictions are all EQ at preparation time' from Dejan Mircevski Previously, we checked if all partition-key restrictions were EQ at runtime. This is, however, known already at prep time; no need to redo it on every query execution. Move the check to prep time. Tests: unit (dev, debug), perf_simple_query Closes #8565 * github.com:scylladb/scylla: cql3: Replace runtime check with a prepared flag cql3: Track IN partition-key restrictions cql3: Inline add_single_column_restriction cql3: Inline statement_restrictions::add_restriction	2021-04-29 08:41:16 +03:00
Dejan Mircevski	57fa66a0a7	cql3: Replace runtime check with a prepared flag Checking that every PK restriction is an EQ was happening at runtime. This is wasteful, as the result is always the same. Replace that check with a flag computed once at preparation time. Separate the simple-case processing into its own function rather than pass the flag as an extra parameter. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-28 16:44:48 -04:00
Dejan Mircevski	4661aa0269	cql3: Track IN partition-key restrictions Add a bool member to statement_restrictions indicating whether any of the partition columns are restricted by IN, which requires more complex processing. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-28 15:47:32 -04:00
Tomasz Grabiec	6c168ee0eb	row_cache: Always touch the partition on entry This fixes a potential cause for reactor stalls during memory reclamation. Applies only to schemas without clustering columns. Every partition in cache has a dummy row at the end of the clustering range (last dummy). That dummy must be evicted last, because MVCC logic needs it to be there at all times. If LRU picks it for eviction and it's not the last row, eviction does nothing and moves on. Eventually, all other rows in this partition will be evicted too and then the partition will go away with it. Mutation reader updates the position of rows in the LRU (aka touching) as it walks over them. However, it was not always touching the last dummy row. If the partition was fully cached, and schema had no clustering key, it would exit early before reaching the last dummy row, here: inline void cache_flat_mutation_reader::move_to_next_entry() { clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position()); if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) { move_to_next_range(); That's because no_clustering_row_between() is always true for any key in tables with no clustering columns, and the reader advances to end-of-stream without advancing _next_row to the last dummy. This is expected and desired, it means that the query range ends at the current row and there is no need to move further. We would not take this exit for tables with a non-singular clustering key domain and open-ended query range, since there would be a key gap before the last dummy row. Refs #2972. The effect of leaving the last dummy row not touched will be that such scans will segregate rows in the LRU, bring all regular rows to the front, and dummy rows at the tail. When eviction reaches the band of dummy rows, it will have to walk over it, because evicting them releases no memory. This can cause a reactor stall. An easy fix for the scenario would be to always touch the dummy entry when entering the partition. It's unlikely that the read will not proceed to the regular rows. It would be best to avoid linking such dummies in the LRU, but that's a much more complex change. Discovered in perf_row_cache_update, test_small_partitions(). I saw 200ms stalls with -m8G. Refs #8541. Tests: - row_cache_test (release) - perf_simple_query [no change] Message-Id: <20210427111619.296609-1-tgrabiec@scylladb.com>	2021-04-28 21:59:28 +03:00
Dejan Mircevski	35e733ee88	cql3: Inline add_single_column_restriction Invoking statement_restrictions::add_single_column_restriction() outside the constructor would leave some data members out-of-date. Prevent it by deleting the method and inlining its body into the only call site. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-28 13:34:53 -04:00
Avi Kivity	2b252ef9b7	test: perf: tidy up executor_stats snapshot computation Now that executor_stats_snapshot() is a member function, we can move the capture of _count into invocations into it, capturing all the stats in one place.	2021-04-28 19:02:35 +03:00
Dejan Mircevski	fc1c9b4289	cql3: Inline statement_restrictions::add_restriction Invoking this method outside the constructor would leave some data members out-of-date. Prevent it by deleting the method and inlining its body into the only call site. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-28 11:57:02 -04:00
Avi Kivity	863b49af03	test: perf: report instructions retired per operations Instructions retired per op is a much more stable than time per op (inverse throughput) since it isn't much affected by changes in CPU frequencey or other load on the test system (it's still somewhat affected since a slower system will run more reactor polls per op). It's also less indicative of real performance, since it's possible for fewer inststructions to execute in more time than more instructions, but that isn't an issue for comparative tests). This allows incremental changes to the code base to be compared with more confidence.	2021-04-28 18:46:55 +03:00
Avi Kivity	0bc98caf3e	test: perf: add RAII wrapper around Linux perf_event_open() Make it easy to embed in other classes. A helper function is provided for the instructions retired counter.	2021-04-28 18:41:02 +03:00
Avi Kivity	498e6b9a64	test: perf: make executor_stats_snapshot() a member function of executor I'd like to add an instructions counter which isn't accessible via a global, so make the snapshot function a member. Out of respect to #1, define functions for getting the number of allocations and tasks processed, as they need heavy header files.	2021-04-28 18:38:35 +03:00
Benny Halevy	96ef204676	dht/token: shard_of: reuse shard_of_minimum_token Returning shard 0 for the minimum token better be hardcoded in one place. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210428113339.1092555-1-bhalevy@scylladb.com>	2021-04-28 15:08:36 +03:00
Benny Halevy	662355519d	dht/i_partitioner: split_range_to_single_shard: drop unused lambda capture of start_shard Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210428113440.1099877-1-bhalevy@scylladb.com>	2021-04-28 15:07:57 +03:00
Benny Halevy	31b80b5752	scylla-gdb: scylla shard: print current shard with no arg Currently `scylla shard` with no args results in the following error: ``` (gdb) scylla shard Traceback (most recent call last): File "master-scylla-gdb.py", line 2384, in invoke id = int(arg) ValueError: invalid literal for int() with base 10: '' Error occurred in Python: invalid literal for int() with base 10: '' ``` Instead, let it just print the current shard, similar to `(gdb) thread`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210428093913.1070051-1-bhalevy@scylladb.com>	2021-04-28 13:04:17 +02:00
Avi Kivity	a43e896396	test: perf: don't truncate allocation/req and tasks/req report I used {:.0} to truncate to integer, but apparently that resulted in only one significant digit in the report, so 93.1 was reported as 90. Use the {:5.1f} to avoid truncation, and even get an extra digit (we can have fractional tasks/op due to batching). Current result is 93.1 allocs/op, 20.1 tasks/op (which suggests batch size of around 10). Closes #8550	2021-04-28 12:50:13 +02:00
Avi Kivity	3e6232bb92	Merge "Wire offstrategy compaction to repair-based removenode" from Raphael " From now on, offstrategy compaction is triggered on completion of repair-based removenode. So compaction will no longer act aggressively while removenode is going on, which helps reducing both latency and operation time. Refs #5226. " * 'offstrategy_removenode' of github.com:raphaelsc/scylla: repair: Wire offstrategy compaction to repair-based removenode table: introduce trigger_offstrategy_compaction() repair/row_level: make operations_supported static const	2021-04-28 12:02:07 +03:00
Nadav Har'El	7d2df8a9bc	test/alternator,cql-pytest: fix resource leak on failure In the alternator and cql-pytest test frameworks, we have some convenient contextmanager-based functions that allows us to create a temporary resource (e.g., a table) that will be automatically deleted, for example: with create_stream_test_table(...) as table: test_something(table) However, our implementation of these functions wasn't safe. We had code looking like: table = ... yield table table.delete() The thinking was that the cleanup part (the table.delete()) will be called after the user's code. However, if the user's code threw (i.e., a failed assertion), the cleanup wasn't called... When the user's code throws, it looks as if the "yield" throws. So the correct code should look like: table = ... try: yield table finally: table.delete() Python's contextmanager documentation indeed gives this idiom in its example. This patch fixes all contextmanager implementations in our tests to do the cleanup even if the user's "with" block throws. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210428083748.552203-1-nyh@scylladb.com>	2021-04-28 10:51:02 +02:00
Takuya ASADA	c9324634ca	scylla_raid_setup: enabling mdmonitor.service on Debian variants On Debian variants, mdmonitor.service cannnot enable because it missing [Install] section, so 'systemctl enable mdmonitor.service' will fail, not able to run mdmonitor after the system restarted. To force running the service, add Wants=mdmonitor.service on var-lib-scylla.mount. Fixes #8494 Closes #8530	2021-04-28 11:32:27 +03:00
Asias He	60ba8eb9b8	sstables: Add debug info when create_sharding_metadata generates zero ranges The range passed to create_sharding_metadata is supposed to be owned or at least partially owned by the shard. Log keys, range and split ranges for debugging if the range does not belong to the shard. This is helpful for debugging "Failed to generate sharding metadata for foo.db" issues reported. Refs #7056 Closes #8557	2021-04-28 11:22:06 +03:00
Avi Kivity	abb111297a	Merge 'Calculate partition ranges from expr::expression' from Dejan Mircevski In an ongoing effort to drop the `restrictions` class hierarchy, rewrite the partition-range calculation code to use the new `expression` objects. Refs: #7217 #3815 Tests: unit (dev, debug) Closes #8525 * github.com:scylladb/scylla: cql3: Specialize partition-range computation for EQ cql3: Replace some bounds_ranges calls cql3: Get partition range from expr::expression cql3: Track partition-range expressions	2021-04-28 10:26:00 +03:00
Asias He	84a78f4558	repair: Switch to use NODE_OPS_CMD for bootstrap operation In commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation), we switched replace operation to use the new NODE_OPS_CMD infrastructure. In this patch, we continue the work to switch bootstrap operation to use NODE_OPS_CMD. The benefits: - It is more reliable to detect pending node operations, to avoid multiple topology changes at the same time, than using gossip status. - The cluster reverts to a state before the bootstrap operation automatically in case of error much faster than gossip. - Allows users to pass a list of dead nodes to ignore during bootstrap explicitly. - The BOOTSTRAP gossip status is not needed any more. This is one step closer to achieve gossip-less topology change. Fixes #8472	2021-04-28 09:53:04 +08:00
Dejan Mircevski	84fa370415	cql3: Specialize partition-range computation for EQ Save a couple of allocations per request by treating all-EQ cases specially during the computation of partition ranges. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-27 20:06:57 -04:00
Raphael S. Carvalho	1d5cf2cc5d	repair: Wire offstrategy compaction to repair-based removenode removenode_with_repair() runs on all the nodes that need to sync data from other nodes, so offstrategy compaction can be easily wired by notifying tables when removenode completes. From now on, when user runs removenode, new sstables produced in receiving nodes will be added to table's maintenance set, and when the operation completes, offstrategy compacted will be started to reshape those new ssts before integrating them into the main set. Refs #5226. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-27 12:26:53 -03:00
Piotr Dulikowski	0db45d1df5	repair: introduce abort_source for repair abort Adds an abort_source for repair_tracker which is triggered when all repairs are asked to be stopped. It is currently used by the "wait for all hints to be sent" operation - now, it aborts when repairs are requested to be aborted.	2021-04-27 16:16:57 +02:00
Piotr Dulikowski	3a2d09b644	repair: introduce abort_source for shutdown Adds an abort_source to repair_tracker. The abort source is triggered when the repair subsystem is shut down. Its purpose is to allow operations such like waiting for hints to be sent to be able to abort themselves.	2021-04-27 16:16:57 +02:00
Piotr Dulikowski	958a13577c	storage_proxy: add abort_source to wait_for_hints_to_be_replayed Now, the function wait_for_hints_to_be_replayed will take an abort source and will stop when abort is requested.	2021-04-27 16:16:54 +02:00
Piotr Dulikowski	22e06ace2c	storage_proxy: stop waiting for hints replay when node goes down Now, when repair coordinator is waiting for hints to be replayed on some remote node and that node goes down, it stops waiting for it.	2021-04-27 16:11:47 +02:00
Piotr Dulikowski	9d68824327	hints: dismiss segment waiters when hint queue can't send When a hint queue becomes stuck due to not being able to send to its destination (e.g. destination node is no longer UP, or we failed to send some hints from a file), then it's better to immediately dismiss anybody who waits for hint replay instead of letting them wait until timeout.	2021-04-27 15:58:15 +02:00
Piotr Dulikowski	49f4a2f968	repair: plug in waiting for hints to be sent before repair Now, the repair coordinator will wait before hints are sent between participating nodes before continuing with repair.	2021-04-27 15:58:11 +02:00
Piotr Dulikowski	d9ba743ba1	repair: add get_hosts_participating_in_repair Adds the `get_hosts_participating_in_repair` function which returns a list of hosts participating in repair. This list will be used by repair coordinator to tell other nodes to wait until they replay their hints towards the nodes from the list.	2021-04-27 15:32:03 +02:00
Piotr Dulikowski	46075af7c4	storage_proxy: coordinate waiting for hints to be sent Adds a `wait_for_hints_to_be_replayed` function which waits until all hints between specified endpoints are replayed. For each node, a hint sync point is created. Then, repair coordinator waits until the hint sync point is reached on every node, or timeout occurs. This is done by querying each host participating in repair every second in order if the sync point is still there.	2021-04-27 15:31:42 +02:00
Piotr Dulikowski	86d831b319	config: add wait_for_hint_replay_before_repair option Adds the `wait_for_hint_replay_before_repair` configuration option. If set to true, the repair coordinator will first wait until the cluster replays its hints towards the nodes participating in repair. It is set to true by default, and is live-updateable. It will be used in subsequent commits from the same PR.	2021-04-27 15:16:26 +02:00
Piotr Dulikowski	485036ac33	storage_proxy: implement verbs for hint sync points Implements HINT_QUEUE_MARK and HINT_QUEUE_SYNC verb handlers in `storage_proxy`.	2021-04-27 15:06:39 +02:00
Piotr Dulikowski	82c419870a	messaging_service: add verbs for hint sync points Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK. Those will make it possible to create a sync point and regularly poll to check its existence.	2021-04-27 15:06:39 +02:00
Piotr Dulikowski	244738b0d5	storage_proxy: add functions for syncing with hints queue Adds two methods to `storage_proxy`: - `create_hint_queue_sync_point` - creates a "hint sync point" which is kept present in storage_proxy until all hint queues on the local node reach their curent end. It will also disappear if given deadline is reached first. - `check_hint_queue_sync_point` - checks if given hint sync point still exists. The created sync point waits for hint queues in all hint managers, on all shards.	2021-04-27 15:06:39 +02:00
Peter Veentjer	c255903fb0	dist: Added r5b to ena instance_class. The r5b instances also have ena support. For a confirmation that all r5b instances have ena, go to the following page: https://instances.vantage.sh/ Select the r5b and add the 'enhanced networking' column. Then it will show that for every r5b type there is ena support Closes #8546	2021-04-27 15:39:24 +03:00
Nadav Har'El	6de04bbed5	Merge 'Forward-port service level fixes' from Piotr Sarna The original series which forward-ported the service levels into open-source omitted important fixes to their infrastructure. The fixes are hereby ported. Tests: unit(release) Closes #8540 * github.com:scylladb/scylla: workload prioritization: Fix configuration change detection workload prioritization: add exception protection in configuration polling	2021-04-27 13:40:21 +03:00
Eliran Sinvani	02d37cb133	workload prioritization: Fix configuration change detection The configuration detection is based on a loop that advances two iterators and compares the two collection for deducing the configuration change. In order to correctly deduce the changes the iteration have to be according to the key (service level name) order for both of the collections. If it doesn't happen the results are undefined and in some cases can lead to a crash of the system. The bug is that the _service_level_db field was implemented using an unordered_map which obviously don't guarantie the configuration change detection assumption. The fix was simply to change the field type to a map instead of unordered_map. Another problem is that when a static service level (i.e default) is at the end of the keys list, it is repeatedly being deleted - which doesn't really do anything since deleting a static service level is just retaining it's defult values but it is stil wrong.	2021-04-27 12:29:31 +02:00
Eliran Sinvani	946fc6af08	workload prioritization: add exception protection in configuration polling Exceptions around the loop polling were not handled properly. This is an issue due to the fact that if an unhandled exception slips out to the configuration polling loop itself it will break it. When the configuration polling loop is broken, any further change to the configuration will not be acted uppon in the nodes where the loop is broken until the node is restarted. The chances for exceptions are now greater than before since in one of the previous commits we started quering the workload prioritization configuration table with a sensible, shorter timeout. This change also adds a logger for the workload prioritization module and some logging mainly arround the configuration polling loop. Most logs are added in the info level since they are not expected to happen frequently but when they do we would like to have some information by default regarding what broke the loop.	2021-04-27 12:29:31 +02:00
Avi Kivity	7a6b678044	Update tools/java submodule for EveryWhere compaction strategy * tools/java 57eb143119...fd92603b99 (1): > Add EverywhereStrategy class	2021-04-27 12:23:23 +03:00
Nadav Har'El	f50db50d10	test/cql-pytest: test for "WHERE v=NULL" in restrictions Issues #4476 and #8489, and also Cassandra's CASSANDRA-10715, all request that filtering with "WHERE v=NULL" should return the rows where the column v is unset. However, we made a deliberate decision to do something else: That "WHERE v=NULL" should match no row. Exactly like it does in SQL. This is what this test verifies - that "WHERE v=NULL" never matches any row - not even rows where "v" is unset. This test is expected to fail on Cassandra (so marked cassandra_bug), because in Cassandra the "WHERE v=NULL" restriction is forbidden, instead of succeeding and returning nothing. Although we differ here from Cassandra, after a lot of deliberation we decided that Scylla's behavior is the correct one, so this test verifies it. Refs #4776. Refs #8489. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210426183145.323301-1-nyh@scylladb.com>	2021-04-27 09:26:33 +03:00
Dejan Mircevski	e2c8ff6bf2	gdb: Fix heapprof() dereferencing of backtrace Seastar seems to have added another layer of indirection to alloc_site_list_head/backtrace, so scylla_heapprof() can't find the members it's looking for, resulting in errors. Fix it by dereferencing the added layer. Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8551	2021-04-27 01:49:26 +02:00
Kamil Braun	4c95277619	raft: fsm: fix assertion failure on stray rejects When probes are sent over a slow network, the leader would send multiple probes to a lagging follower before it would get a reject response to the first probe back. After getting a reject, the leader will be able to correctly position `next_idx` for that follower and switch to pipeline mode. Then, an out of order reject to a now irrelevant probe could crash the leader, since it would effectively request it to "rewind" its `match_idx` for that follower, and the code asserts this never happens. We fix the problem by strengthening `is_stray_reject`. The check that was previously only made in `PIPELINE` case (`rejected.non_matching_idx <= match_idx`) is now always performed and we add a new check: `rejected.last_idx < match_idx`. We also strengthen the assert. The commit improves the documentation by explaining that `is_stray_reject` may return false negatives. We also precisely state the preconditions and postconditions of `is_stray_reject`, give a more precise definition of `progress.match_idx`, argue how the postconditions of `is_stray_reject` follow from its preconditions and Raft invariants, and argue why the (strengthened) assert must always pass. Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>	2021-04-27 01:07:22 +02:00
Pavel Solodovnikov	fba1910770	raft: fix incorrect rpc setup in `server_impl::start()` RPC configuration was updated only when an instance was started with an initial snapshot. In case we don't have an initial snapshot, but do have a non-empty log with a configuration entry, the RPC instance isn't set up correctly. Fix that by moving RPC setup code outside the check for snapshot id and look at `_log.get_configuration()` instead. Also, set up RPC mappings both for `current` and `previous` components, since in case the last configuration index points to an entry from the log, it can happen to be a joint configuration entry. For example, this can happen if a leader made an attempt to change configuration, but failed shortly afterwards without being able to commit the new configuration. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com> Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>	2021-04-26 20:46:50 +02:00
Nadav Har'El	f17de6ca45	test/cql-pytest: test that "!=" not supported in WHERE Our documentation of SELECT https://docs.scylladb.com//getting-started/dml suggests that like a "=" operator exists, there is also a "!=" operator. However, this is not true: The != operator (which is recognized by the parser) is not allowed in WHERE clauses. This test verifies that this is indeed the case - neither Cassandra nor Scylla allow this operator in WHERE clauses. Refs https://github.com/scylladb/scylla-doc-issues/issues/732 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210426165511.318066-1-nyh@scylladb.com>	2021-04-26 20:23:21 +03:00
Piotr Sarna	8aaa3a7bb8	Merge 'reader_permit: always forward resources to the semaphore' from Botond This series is a conceptual revert of `4c8ab10`, which turned out to be a misguided defense mechanism that proved to be a hotbed for bugs. This protection was superseded by `0fe75571d9` which guarantees forward progress at all times without all the gotchas and bad interactions introduced by `4c8ab10`. The latest instance of bad interaction that triggered this series is a case of resource units being leaked when a previously evicted reader is re-admitted, leaking already owned resources on each re-admission. To prove that neither the resource leak, nor the deadlock `4c8ab10` was supposed to guard against exists after this series, it includes two unit tests stressing the respective areas: readmission and admission on a highly contested semaphore. Fixes: #8493 Also on: https://github.com/denesb/scylla.git reader-permit-resource-leak-v2 Changelog v2: * Rebase over the recently merged reader close series. Fix merge conflicts and an exposed bug. * 'reader-permit-resource-leak-v2' of https://github.com/denesb/scylla: test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units reader_concurrency_semaphore: add dump_diagnostics() reader_permit: always forward resources reader_concurrency_semaphore: inactive_read_handle: abandon(): close reader	2021-04-26 16:30:18 +02:00
Nadav Har'El	732fc9ba00	Merge 'Add username to alternator tracing' from Piotr Sarna This series adds filling the `username` column in alternator tracing info, if the username is available. When alternator is enforcing authorization, each request contains a username in its headers. The difference is as follows. A tracing entry excerpt before the series: ``` { (...) 'source_ip': '::', 'table_names': 'alternator_Pets.Pets', 'username': '<unauthenticated request>' } ``` and after the series: ``` { (...) 'source_ip': '::', 'table_names': 'alternator_Pets.Pets', 'username': 'alternator' } ``` This series also modifies one of the tests to check the username column. Fixes #8547 Closes #8548 * github.com:scylladb/scylla: test: add username verification to alternator tracing tests alternator: add user context to tracing alternator: return username when verifying signature	2021-04-26 16:30:15 +02:00
Botond Dénes	45d580f056	test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress This unit test checks that the semaphore doesn't get into a deadlock when contended, in the presence of many memory-only reads (that don't wait for admission). This is tested by simulating the 3 kind of reads we currently have in the system: * memory-only: reads that don't pass admission and only own memory. * admitted: reads that pass admission. * evictable: admitted reads that are furthermore evictable. The test creates and runs a large number of these reads in parallel, read kinds being selected randomly, then creates a watchdog which kills the test if no progress is being made.	2021-04-26 15:57:17 +03:00
Botond Dénes	cadc26de38	test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units This unit test passes a read through admission again-and-again, just like an evictable reader would be during its lifetime. When readmitted the read sometimes has to wait and sometimes not. This is to check that the readmitting a previously admitted reader doesn't leak any units.	2021-04-26 15:57:17 +03:00
Botond Dénes	d246e2df0a	reader_concurrency_semaphore: add dump_diagnostics() Allow semaphore related tests to include a diagnostics printout in error messages to help determine why the test failed.	2021-04-26 15:56:56 +03:00
Botond Dénes	caaa8ef59a	reader_permit: always forward resources This commit conceptually reverts `4c8ab10`. Said commit was meant to prevent the scenario where memory-only permits -- those that don't pass admission but still consume memory -- completely prevent the admission of reads, possibly even causing a deadlock because a permit might even blocks its own admission. The protection introduced by said commit however proved to be very problematic. It made the status of resources on the permit very hard to reason about and created loopholes via which permits could accumulate without tracking or they could even leak resources. Instead of continuing to patch this broken system, this commit does away with this "protection" based on the observation that deadlocks are now prevented anyway by the admission criteria introduced by `0fe75571d9`, which admits a read anyway when all the initial count resources are available (meaning no admitted reader is alive), regardless of availability of memory. The benefits of this revert is that the semaphore now knows about all the resources and is able to do its job better as it is not "lied to" about resource by the permits. Furthermore the status of a permit's resources is much simpler to reason about, there are no more loopholes in unexpected state transitions to swallow/leak resources. To prove that this revert is indeed safe, in the next commit we add robust tests that stress test admission on a highly contested semaphore. This patch also does away with the registered/admitted differentiation of permits, as this doesn't make much sense anymore, instead these two are unified into a single "active" state. One can always tell whether a permit was admitted or not from whether it owns count resources anyway.	2021-04-26 15:56:56 +03:00
Botond Dénes	2b66f7222e	reader_concurrency_semaphore: inactive_read_handle: abandon(): close reader `fa43d7680` recently introduced mandatory closing of readers before they are destroyed. One reader destroy path that was left not closing the reader before destruction is `inactive_reader_handle::abandon()`. This path is executed when the handle is destroyed while still referring to a non-evicted inactive read. This patch fixes it up to close the reader and adds a small unit test which checks that this happens.	2021-04-26 15:56:54 +03:00
Piotr Dulikowski	427bbf6d86	db/hints: make it possible to wait until current hints are sent Implements `wait_until_hints_are_replayed` method returning a future which blocks until either all current hint segments are replayed (returns success in this case), or when provided timeout is reached (returns a timeout exception in this case).	2021-04-26 13:57:03 +02:00
Piotr Sarna	0779fa8428	test: add username verification to alternator tracing tests The test case now additionally checks if the username entry from found tracing events matches the username used by the test suite.	2021-04-26 11:54:02 +02:00
Piotr Sarna	1b400b07b9	alternator: add user context to tracing Before this patch, each entry in alternator tracing included an "<unauthenticated request>" field. It's not really true, because most of alternator requests are actually performed by authenticated users (unless auth is disabled).	2021-04-26 11:54:01 +02:00
Piotr Sarna	ddd9c2f2d7	alternator: return username when verifying signature The username will be used later for tracing purposes. It will also very likely be useful later when we decide to add ACL support.	2021-04-26 11:53:19 +02:00
Avi Kivity	5801c93715	utils: rjson: convert enable_if to concept Simpler and easier to understand. Vague comment about enable_if removed. Closes #8405	2021-04-25 21:53:46 +03:00
Botond Dénes	f7f5fca5a8	Add very basic coverage report generation support This patch introduces the most basic bare infrastructure to generate coverage report as well as a guide on how to manually generate them. Although this barely qualifies as "support", it already allows one to generate a coverage report with the help of this guide. One immediate limitation of this patch is that it only supports clang, which is not a terrible problem, given that its our main compiler currently. Future patches will build on this to incrementally improve and automate this: * Provide script to automatically merge profraw files and generate html report from it. * Integrate into test.py, adding a flag which causes it to generate a coverage report after a run. * Support GCC too, but at least auto-detect whether clang is used or not. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210423140100.659452-1-bdenes@scylladb.com>	2021-04-25 15:59:20 +03:00
Avi Kivity	fa43d7680c	Merge "Close flat mutation readers" from Benny " This patchset adds future-returning close methods to all flat_mutation_reader-s and makes sure that all readers are explicitly closed and waited for. The main motivation for doing so is for providing a path for cancelling outstanding i/o requests via a the input_stream close (See https://github.com/scylladb/seastar/issues/859) and wait until they complete. Also, this series also introduces a stop method to reader_concurrency_semaphore to be used when shutting down the database, instead of calling clear_inactive_readers in the database destructor. The series does not change microbenchmarks performance in a significant way. It looks like the results are within the tests' jitter. - perf_simple_query: (in transactions per second, more is better) before: median 184701.83 tps (90 allocs/op, 20 tasks/op) after: median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%) - perf_mutation_readers: (in time per iteration, less is better) combined.one_row 65.042ns -> 57.961ns (-10.9%) combined.single_active 46.634us -> 46.216us ( -0.9%) combined.many_overlapping 364.752us -> 371.507us ( +1.9%) combined.disjoint_interleaved 43.634us -> 43.448us ( -0.4%) combined.disjoint_ranges 43.011us -> 42.991us ( -0.0%) combined.overlapping_partitions_disjoint_rows 57.609us -> 58.820us ( +2.1%) clustering_combined.ranges_generic 93.464ns -> 96.236ns ( +3.0%) clustering_combined.ranges_specialized 86.537ns -> 87.645ns ( +1.3%) memtable.one_partition_one_row 903.546ns -> 957.639ns ( +6.0%) memtable.one_partition_many_rows 6.474us -> 6.444us ( -0.5%) memtable.one_large_partition 905.593us -> 878.271us ( -3.0%) memtable.many_partitions_one_row 13.815us -> 14.718us ( +6.5%) memtable.many_partitions_many_rows 161.250us -> 158.590us ( -1.6%) memtable.many_large_partitions 24.237ms -> 23.348ms ( -3.7%) average -0.02% Fixes #1076 Refs #2927 Test: unit(release, debug) Perf: perf_mutation_readers, perf_simple_query (release) Dtest: next-gating(release), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug) " * tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits) mutation_reader: shard_reader: get rid of stop mutation_reader: multishard_combining_reader: get rid of destructor flat_mutation_reader: abort if not closed before destroyed flat_mutation_reader: require close repair: row_level_repair: run: close repair_meta when done repair: repair_reader: close underlying reader on_end_of_stream perf: everywhere: close flat_mutation_reader when done test: everywhere: close flat_mutation_reader when done mutation_partition: counter_write_query: close reader when done index: built_indexes_reader: implement close mutation_writer: multishard_writer: close readers when done mutation_writer: feed_writer: close reader when done table: for_all_partitions_slow: close iteration_step reader when done view_builder: stop: close all build_step readers stream_transfer_task: execute: close send_info reader when done view_update_generator: start: close staging_sstable_reader when done view: build_progress_virtual_reader: implement close method view: generate_view_updates: close builder readers when done view_builder: initialize_reader_at_current_token: close reader before reassigning it view_builder: do_build_step: close build_step reader when done ...	2021-04-25 13:53:11 +03:00
Avi Kivity	54b76e82bc	Merge "Make migration manager main-local" from Pavel " There are few places left that call for migration manager by global reference. This set patches all those places and makes the migration manager a service that locally lives in main(). Surprisingly, the largest changes are to get rid of global migration manager calls from ... the migration manager itself. Two tricks here. First, repair code gets its private global migration manager pointer. That's not nice, but it aligned with current repair design -- all its references are now "global". Some day they all will be moved into sharded repair service, for now these globals just describe the real dependencies of the repair code. Second is storage proxy that needs to call migration manager to get schema. Proper layering makes migration manager sit on top of storage proxy, so the direct back-reference is not nice. To overcome this the proxy gets migration manager's shared_from_this() pointer and drops all of them on stop. This makes sure that by the time migration manager stops no references from proxy exist. tests: unit(dev), start-stop, start-drain-stop " * 'br-turn-migration-manager-local' of https://github.com/xemul/scylla: (21 commits) migration_manager: Make it main-local tests: Have own migration manager instances tests: Use migration_manager from cql_test_env migration_manager: Call maybe_sync from this migration_manager: Make get_schema_for_... methods migration_manager: Hide get_schema_definition streaming: Keep migration_manager ptr in rpc lambdas storage_proxy: Keep migration_manager ptr in rpc lambdas streaming: Get migration_manager shared_ptr in messaging storage_proxy: Get migration_manager shared_ptr in messaging migration_manager: Make maybe_sync a method migration_manager: Open-code merge lambda migration_manager: Turn do_announce_new_type non-static migration_manager: Make announce() non-static method storage_servive: Use local migration manager storage_service: Keep migration manager on board migration_manager: Use 'this' where appropriate repair: Use private migration manager pointer repair: Keep private sharded migration manager pointer redis: Carry sharded migration manager over init ...	2021-04-25 13:29:16 +03:00
Nadav Har'El	de938eba8c	Reduce dependency on header utils/rjson.hh If utils/rjson.hh is modified, 300 (!) source files get recompiled. This is frustrating for anyone working on this header file (like me). Moreover - utils/rjson.hh includes the large rapidjson header files (rapidjson is a header-only library!), slowing the compilation all these 300 files. It turns out most includers utils/rjson.hh get it because column_computation.hh includes it. But the fact that column computations are serialized as JSON are an internal implementation detail that the users of this header don't need to know - and they care even less that this JSON implementation uses utils/rjson.hh. So in this patch column_computation.hh no longer includes rjson.hh, and no longer exposes a method taking a rjson::value that was never used outside the implementation. After this patch, touching utils/rjson.hh only recompiles 21 files. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210422183526.114366-1-nyh@scylladb.com>	2021-04-25 13:20:51 +03:00
Benny Halevy	5ca8f28297	storage_service: load_new_sstables: log success message as info, not warning Success is important, but nothing to be warned about. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210425070909.476226-1-bhalevy@scylladb.com>	2021-04-25 12:39:47 +03:00
Benny Halevy	6e62ec8c24	mutation_reader: shard_reader: get rid of stop Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	fc5e4688db	mutation_reader: multishard_combining_reader: get rid of destructor Now that the multishard_combining_reader is guaranteed to be called there is no longer need for stopping the shard readers in the destructor. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	b134640829	flat_mutation_reader: abort if not closed before destroyed The motivation to abort if the reader is not closed before its destroyed is mainly driven by: 1. Aborting will force us find and fix missing closes. Otherwise, log warnings can easily be lost in the noise. (ERRORs however are caught by dtests but won't be necessarily caught in SCT / production environments) 2. Following patches remove existing cleanup code in destructors that is not needed once close() is mandated. If we don't abort on missing close we'll have to keep maintaining both cleanup paths forever. 3. Not enforcing close exposes us to leaks and potential use-after-free from background tasks that are left behind. We want to stop guranteeing the safety of the background tasks post close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	5b22731f9a	flat_mutation_reader: require close Make flat_mutation_reader::impl::close pure virtual so that all implementations are required to implemnt it. With that, provide a trivial implementation to all implementations that currently use the default, trivial close implementation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	77cc694a08	repair: row_level_repair: run: close repair_meta when done Always close repair_meta (that closes its reader). Proper closing is done via the repair_meta::stop path. Ignore any errors when auto-closing in a deferred action since there is nothing else we can do at this point. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	0c19f788e5	repair: repair_reader: close underlying reader on_end_of_stream Need to close the reader before reassigning it with an empty f_m_r. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	391f942b2a	perf: everywhere: close flat_mutation_reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	aa5289f255	test: everywhere: close flat_mutation_reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	7427b60caf	mutation_partition: counter_write_query: close reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2fa8b3b84e	index: built_indexes_reader: implement close Close underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e453f890f2	mutation_writer: multishard_writer: close readers when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	64c5b7fda6	mutation_writer: feed_writer: close reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	825acd4031	table: for_all_partitions_slow: close iteration_step reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	dad6c94476	view_builder: stop: close all build_step readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	6082d854f9	stream_transfer_task: execute: close send_info reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	02d74e1530	view_update_generator: start: close staging_sstable_reader when done The staging_sstable_reader has to be closed before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	1e1c8ea824	view: build_progress_virtual_reader: implement close method Close underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2d8b00f2d8	view: generate_view_updates: close builder readers when done Make sure to close the builder's _updates and optional _existings readers before they are destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	652ba714fe	view_builder: initialize_reader_at_current_token: close reader before reassigning it Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	7093610931	view_builder: do_build_step: close build_step reader when done Make sure to close the build_step reader before destroying it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	51c96d405d	mutation_reader: evictable_reader: fill_buffer: make sure to close the reader If reader.fill_buffer() fails, we will not call `maybe_pause` and the reader will be destroyed, so make sure to close it. Otherwise, the reader is std:move'ed to `maybe_pause` that either paused using register_inactive_read or further std::move'ed to _reader, in both cases it doesn't need to be closed. `with_closeable` can safely try to close the moved-from reader and do nothing in this case, as the f_m_r::impl was already moved away. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	7c7569f0ad	querier_cache: implement stop Close the _closing_gate to wait on background close of dropped queries, and close all remaining queriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	87c62b5f59	test: querier_cache_test: close looked up querier Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	3e7075a739	compaction: setup: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	90a7a8ff0e	compaction: close reader when done consuming Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	07f34b4a32	querier_cache: lookup_querier: close the querier before dropping it Make sure to close the dropped querier before it's destroyed. The operation is moved to the background so not to penelize the common path. A following patch will add a querier_cache::close() method that will close _closing_gate to wait on the querier close (among other things it needs to wait on :)). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	4a0abc7b9c	querier_cache: lookup_querier: define as a private method In preparation to closing the querier in the background before dropping it. With that, stats need not be passed as a parameter, but rather the _stats member can be used directly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	fa6d6c17f2	mutation_partition: mark query_result_builder constructor noexcept It is trivially so. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	320cb67b08	table: query, mutation_query: close querier when done Make sure to close the querier and subsequently its reader before destroying it (unless it was moved). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	8b8c721431	querier: add close method Depening on the variant _reader contents, either close the reader or unregister the inactive reader and close it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	3f00c21481	querier_cache: evict: close evicted reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	0d8d56c36f	querier: coroutinize evict methods Instead of calling a lambda function for each index simply iterate over all indices and use co_await / co_return in the inner loop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2f9cf01aa7	querier_cache: futurize evict api Prepare for futurizing the lower-level inactive reads api. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	57f921de4f	database: streaming_reader_lifecycle_policy: destroy_reader: close inactive reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	29b2b1f8dd	reader_lifecycle_policy: close inactive_read Make sure to close the unregistered inactive_read before it's destroyed, if the unregistered reader_opt is engaged. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	a144819683	reader_concurrency_semaphore: unregister_inactive_read: close reader also on internal error "forward" the unregister to the other semaphore in case on_internal_error throws rather than aborting. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	c8e30db5db	reader_concurrency_semaphore: close evicted reader Close readers in the background: - evicted based on ttl, or - those that weren't admitted by register_inactive_read - those that are destoryed in clear_inactive_reads. Use a gate for waiting on these background closes in stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	be1cafc1a5	reader_concurrency_semaphore: do_wait_admission: close evicted readers enqueue_waiter before evicting readers and start a loop in the background to dequeue and close inactive_readers until either the _wait_list is empty or there are no more inactive_readers to evict. We admit the read synchronously only if the wait_list is empty and the semaphore has_available_units to statisfy admission. We need to enqueue the reader before starting to evict readers to make sure any evicted resources are assigned to the waiter at the head of the queue and not "stolen" in case we yield and some other caller grabs them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	43bf0f9356	reader_concurrency_semaphore: add stop method In addition to clear_inactive_reads, that's currently called when the database object is destroyed, introduce a stop() method that will: 1. wait on all background closes of inactive_reads. 2. close all present inactive_reads and waits on their close. 3. signal waiters on the wait_list via broken() with a proper exception indicating that the semaphore was closed. In addition, assert in the semaphore's destructor that it has no remaining inactive reads. Stop must be called from whoever owns the r_c_s. Mainly, from database::stop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2f4134e1cc	reader_concurrency_semaphore: broken: make broken_semaphore the default exception Rather than explcitily generating it by all callers and then not using the argument at all. Prepare for providing a different exception_ptr from a stop() path to be introduced in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	cd0991f28d	multishard_mutation_query: read_context::stop: properly close unregistered inactive_reads Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	93d6dcdbcf	multishard_mutation_query: read_context: stop: wait on unregistering inactive reads Currently unregister_inactive_read for other shards is moved to the background with nothing keep the respective reader_concurrency_semaphore around. This change runs the loop in parallel_for_each so that we don't have to serially wait on all of them but rather they can run in parallel on all shards, but all are waited on via the returned future<>. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	8421e1f61e	mutlishard_mutation_query: read_context: close: unregister all inactive reads Currently only if the reader_meta is in the saved state we unregister its inactive_read, yet it is possible that it will hold an inactive_read also in the lookup state. To cover all cases, rather than testing the reader_state, unregister if the inactive_read_handle is engaged. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	53889ef9b0	multishard_mutation_query: read_page: close reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	afa2fe0b76	multishard_mutation_query: read_page: make compaction_state first To simplify error handling for always closing the reader in this function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e2a767bef7	multishard_mutation_query: page_consume_result: mark constructor noexcept As it can't throw. This is needed to simplify the following patch that will always close the reader in read_page. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	a3f9dc6e0b	mutation_reader: multishard_combining_reader: implement close Close all underlying shard readers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	58b1da8cf5	mutation_reader: shard_reader: implement close return reader lifecycle policy's destroy_reader future so it can be waited on by caller (multishard_combining_reader::close). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2c1edb1a94	mutation_reader: reader_lifecycle_policy: return future from destroy_reader So we can wait on it from to-be-introduced shard_reader::close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	bfe56fd99c	mutation_reader: shard_reader: get rid of _stopped It's unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e1ec401bb6	mutation_reader: evictable_reader: implement close If there's an active reader then close it, else, try to resume the paused reader, and close it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	84206501ae	mutation_reader: foreign_reader: wait for readahead and close underlying reader Move the logic in ~foreign_reader to close() to wait on the read_ahead future and close the underlying reader on the remote shard. Still call close in the background in ~foreign_reader if destroyed without closing to keep the current behavior, but warn about it, until it's proved to be unneeded. Also, added on_iternal_error in close if _read_ahead_future is engaged but _reader is not, since this must never happen and we wait on the _read_ahead_future without the _reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	ea3f2a6536	mutation_reader: restricting_mutation_reader: close underlying reader If a reader was admitted, close it in close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	f9daceda87	test: mutation_reader_test: multi_partition_reader: close underlying readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	db66a39b3e	test: row_cache_test: close readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	f9ae50483f	mutation_reader: merging_reader: close underlying merger Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	dccdbdff95	mutation_reader: mutation_fragment_merger: close underlying producer This will be needed by the merging_reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	761a38ce21	mutation_reader: mutation_reader_merger: make sure to close underlying readers These will be called by merging_reader::close via mutation_fragment_merger::close in the following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	7d42a71310	mutation_reader: position_reader_queue: add close method Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	b140ea6df2	mutation_reader: compacting_reader: implement close Close underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	32ab957f82	mutation_reader: filtering_reader: implement close method Close underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	38e48bb462	size_estimates_reader: close partition_reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	13dfc41d8c	row_cache: cache_flat_mutation_reader: close underlying readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	0a2670c9ec	row_cache: hold read_context as unique_ptr Such that the holder, that is responsible for closing the read_context before destroying it, holds it uniquely. cache_flat_mutation_reader may be constructed either with a read_context&, where it knows that the read_context is owned externally, by the caller, or it could be constructed with a std::unique_ptr<read_context> in which case it assumes ownership of the read_context and it is now responsible for closing it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	8531eaaacf	row_cache: make_reader: make read_context only when needed So we can have better control on who's responsible to close it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	9944586480	row_cache: make_reader: use range directly Not via ctx, so we can delay the making of the read_context, as needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	4c969756ac	row_cache: scanning_and_populating_reader: make sure to close underlying readers Note that scanning_and_populating_reader::read_next_partition now closes the current reader unconditionally and before assigning a new reader. This should be an improvement since we want to release resources the reader resources as early as possible, certainly before allocating new resources. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e34ed3d3e4	row_cache: range_populating_reader: add close method To close the undelying _reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	c707ff27a4	row_cache: single_partition_populating_reader: add close method To close the optional underlying _reader and _read_context. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	63522361f2	row_cache: read_context: add close method To close the underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	4b0fcc7d99	row_cache: autoupdating_underlying_reader: add close method To close the undelying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	3853d7a376	row_cache: autoupdating_underlying_reader: close reader before updating it use the newly introduced reassign method to first close the flat_mutation_reader_opt before assigning it with a new reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	593bc9806d	memtable: memtable_snapshot_source: make sure to close readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	574759bf95	memtable: flush_reader: make sure to close partition reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	b13f6e817c	test: row_cache_stress_test: close reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	93b5d7d4c2	memtable: scanning_reader: make sure to close underlying reader Close _delegate if it's engaged both in the close() method and when ever it is currently reset by _delegate = {}. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	efe938cf1f	flat_mutation_reader: make sure to close reader passed to read_mutation_from_flat_mutation_reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	4b8dc7ac7e	flat_mutation_reader: make sure to close flat_mutation_reader_from_mutations Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:25:47 +03:00
Benny Halevy	0da2eea211	flat_mutation_reader: flat_multi_range_mutation_reader: close underlying reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	18268ab474	flat_mutation_reader: forwardable_empty_mutation_reader: close optional underlying reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	e2e642b1b1	flat_mutation_reader: make_forwardable, make_nonforwardable: close underlying reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	978501c336	flat_mutation_reader: partition_reversing_mutation_reader: implement no-op close We don't own _source therefore do not close it. That said, we still need to make sure that the reversing reader itself is closed to calm down the check when it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	f4dfaaa6c9	flat_mutation_reader: delegating_reader: close reader when moved to it The underlying reader is owned by the caller if it is moved to it, but not if it was constructed with a reference to the underlying reader. Close the underlying reader on close() only in the former case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	0e0edef8d8	flat_mutation_reader: transforming_reader: close underlying reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	3c05529329	sstables: scrub_compaction: reader: close underlying reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	75eed563bc	sstables: write_components: close reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	8c585ccb5c	sstables: sstable_mutation_reader: implement close Close both the _index_reader and _context, if they are engaged. Warn and ignore any erros from close as it may be called either from the destructor or from f_m_r close. Call close() for closing in the background if needed when destroyed and warn about. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	6a82e9f4be	sstables: index_reader: mark close noexcept We'd like that to simplify the soon-to-be-introduced sstable_mutation_reader::close error handling path. close_index_list can be marked noexcept since parallel_for_each is, with that index_reader::close can be marked noexcept too. Note that since reader close can not fail both lower and upper bounds are closed (since closing lower_bound cannot fail). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	5dce9997ff	test/lib: mutation_source_test: close readers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	266d060aef	test/lib: flat_reader_assertions: close reader in destructor Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	844bc40060	everywhere: use with_closeable to close flat_mutation_reader `with_closeable` simplifies scoped use of flat_mutation_reader, making sure to always close the reader after use. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	ca06d3c92a	flat_mutation_reader: log a warning if destroyed without closing We cannot close in the background since there are use cases that require the impl to be destroyed synchronously. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	81391b845f	reader_permit: expose description method Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Benny Halevy	a471579bd7	flat_mutation_reader: introduce close Allow closing readers before destorying them. This way, outstanding background operations such as read-aheads can be gently canceled and be waited upon. Note that similar to destructors, close must not fail. There is nothing to do about errors after the f_m_r is done. Enforce that in flat_mutation_reader::close() so if the f_m_r implementation did return a failure, report it and abort as internal error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Dejan Mircevski	962373a0a7	cql3: Replace some bounds_ranges calls We will remove bounds_ranges when we kill the restrictions class hierarchy. Of the several call sites, two can be easily modified to avoid it. Others are more complicated and will be modified in a subsequent commit. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-23 15:01:39 -04:00
Dejan Mircevski	b432bdb24e	cql3: Get partition range from expr::expression ... instead of a restrictions subclass, which will soon be eliminated. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-23 15:01:39 -04:00
Pavel Emelyanov	13d264d6bd	migration_manager: Make it main-local Now everybody is patched to use component-local instance of migration manager and its global instance can be moved into main() scope. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	b7a4fb0cf0	tests: Have own migration manager instances No more global migration manager usage left, so all the tests can be patched to use local migration manager instance. In fact, it's only the cql_test_env that's such. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	37c91c4c5c	tests: Use migration_manager from cql_test_env All the tests that need migration manager are run inside cql_test_env context and can use the migration manager from the env. For now this is still the global one, but next patch will change this. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	3c3535b4d8	migration_manager: Call maybe_sync from this The only caller of maybe_sync() method is now the method itself and can stop using global migration manager instance and switch to using 'this'. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	6b31c47a75	migration_manager: Make get_schema_for_... methods These two helpers are now namespace-scoped methods, but both need the migration manager instance inside. All their callers are now patched to have the migration manager at hands, so the helpers can be turned into methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	1021a180e7	migration_manager: Hide get_schema_definition This method is exclusively used inside migration manager code, so (for now) no use in keeping it exposed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	e0ca3ccc1c	streaming: Keep migration_manager ptr in rpc lambdas Same as previous patch, but for streaming. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	d76ff4b32f	storage_proxy: Keep migration_manager ptr in rpc lambdas This patch is the bridge between the previous one and the next one and is quite messy to be merged with either. No heavy changes -- just copy the migration manager's ptr onto rpc lambdas. Will be used in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	423d0baa65	streaming: Get migration_manager shared_ptr in messaging Same as in previous patch, but for streaming code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	a4569a30f3	storage_proxy: Get migration_manager shared_ptr in messaging The proxy's messaging code uses migration manager to obtain schema. Since proxy is more low-level service than migration manager, it's incorrect to make proxy reference the manager directly. Instead, push the shared_ptr into proxy's messaging code. This kills two birst with one stone: 1: let proxy use migration manager 2: makes sure that by the time migration manager is stopped the proxy's use of this pointer is gone (unregistered from rpc) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	b73a93dab7	migration_manager: Make maybe_sync a method Right now the maybe_sync is namespace-scope function. Turn it into a migration_manager method so that it can use 'this' instead of get_local_migration_manager(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	46bf6872d5	migration_manager: Open-code merge lambda This lambda uses global migration manager instance. Open-coding this short lambda makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	ef20d4ee59	migration_manager: Turn do_announce_new_type non-static It's the only place that calls recently patched .announce() method, so instead of grabbing global migration manager, use 'this'. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	7aa1b5d395	migration_manager: Make announce() non-static method This method needs to get migration manager instance to call methods on it, so turn it non-static to have the instance in 'this'. Caller (yes, only one) gets local migration manager itself, but will be patched soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	877ad36424	storage_servive: Use local migration manager Now when the migration manager is on board storage service can use it insted of global instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	b7fe191e3d	storage_service: Keep migration manager on board The storage service needs migration manager to sync schema on lifecycle notifiers and to stop the guy on drain. So this patch just pushes the migration manager reference all the way through the storage service constructor. Few words about tests. Since now storage service needs the migration manager in constructor, some tests should take it from somewhere. The cql_test_env already has (and uses) it, all the others can just provide a not-started sharded one, it won't be in use in _those_ tests anyway. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	6d1eede472	migration_manager: Use 'this' where appropriate Some its non-static method call get_local_migration_manager instead of using 'this'. None of these places use this to get cross-shard instance, so it's safe to use 'this' there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	0223644ac5	repair: Use private migration manager pointer Nothing special here, just replace the code-wide global with repair-wide global. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	4c30556b8e	repair: Keep private sharded migration manager pointer It's nowadays standard for repair to keep global pointers on the needed services. Keep the migration manager there too to avoid explicit call to get_local_migration_manager. Later this pile of global pointers will be encapsulated on redis service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	2e74dc5fd7	redis: Carry sharded migration manager over init The only place in redis that needs migration manager is the ::init method that's called on start. It's possible to pass the migration manager as an argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Pavel Emelyanov	e7dc059917	migration_manager: Merge migration_task in The migration_task is the class with the single static method that's called from a single place in migration manager and this method calls migration manager back right at once. There's no much sense in keeping this abstraction, merge it into the migration manager. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-23 17:13:24 +03:00
Benny Halevy	f7e00e781c	repair: row_level: run: row_level_stop_finished incorrectly set too early Should set_repair_state to row_level_stop_started before calling repair_row_level_stop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210422111723.401719-1-bhalevy@scylladb.com>	2021-04-23 11:25:02 +02:00
Piotr Dulikowski	5a49fe74bb	db/hints: add a metric for counting processed files Adds a field to `end_point_hints_manager::sender`: `_total_replayed_segments_count` which keeps track of how many segments were replayed so far. This metric will be used to calculate the sequence number of the last current hint segments in the queue - so that we can implement waiting for current segments to be replayed.	2021-04-22 18:45:34 +02:00
Avi Kivity	0af7a22c21	repair: remove partition_checksum and related code `80ebedd242` made row-level repair mandatory, so there remain no callers to partition_checksum. Remove it. Closes #8537	2021-04-22 18:56:53 +03:00
Dejan Mircevski	da844a4b59	cql3: Track partition-range expressions Add a statement_restrictions member that tracks expressions that together define the partition range. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-04-22 11:35:37 -04:00
Piotr Dulikowski	e48739a6da	db/hints: allow to forcefully update segment list on flush Endpoint hints manager keeps a list of segments to replay. New segments are appended to it lazily - only when a hint flush occurs (hints commitlog instance is re-created) and the list is empty. Because of that, this list cannot be currently used to tell how many segments are on disk. This commit allows to trigger hints flush and forcefully update the list of segments to replay. In later commits, a mechanism will be implemented which will allow to wait until a given number of hint segments is replayed. Triggering a hints flush with segment list update will allow us to properly synchronize and determine up to which segment we need to wait.	2021-04-22 17:34:04 +02:00
Avi Kivity	c36549b22e	Merge 'rjson: Add throwing allocator' from Piotr Sarna This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory. This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors. Fixes #8521 Refs #8515 Tests: * unit(release) * YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust relevant YCSB workload config for using 1.5MB objects: ```yaml fieldcount=150 fieldlength=10000 ``` Closes #8529 * github.com:scylladb/scylla: test: add a test for rjson allocation test: rename alternator_base64_test to alternator_unit_test rjson: add a throwing allocator	2021-04-22 17:12:02 +03:00
Piotr Sarna	83a45adbb7	test: add a test for rjson allocation The test cases check if the new rjson allocator throws when it fails to allocate/reallocate memory.	2021-04-22 15:59:13 +02:00
Avi Kivity	34b57688b9	tools: toolchain: dbuild: define die() earlier die() is called before it is defined, so it doesn't work. Move it eariler. Ref #8520. Closes #8523	2021-04-22 15:38:10 +02:00
Eliran Sinvani	480a12d7b3	Materialized views: fix possibly old views comming from other nodes Migration manager has a function to get a schema (for read or write), this function queries a peer node and retrieves the schema from it. One scenario where it can happen is if an old node, queries an old not fixed index. This makes a hole through which views that are only adjusted for reading can slip through. Here we plug the hole by fixing such views before they are registered. Closes #8509	2021-04-22 15:38:10 +02:00
Kamil Braun	8e9a9f8bd3	raft: fsm: include config entries in output.committed Otherwise waiters on committed configuration changes (e.g. `server::set_configuration`) would never get notified. Also if we tried to send another entry concurrently we would get replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed. (not sure if this commit also fixes whatever caused that). Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>	2021-04-22 15:38:10 +02:00
Avi Kivity	350f79c8ce	Merge 'sstables: remove large allocations when parsing cells' from Wojciech Mitros sstable cells are parsed into temporary_buffers, which causes large contiguous allocations for some cells. This is fixed by storing fragments of the cell value in a fragmented_temporary_buffer instead. To achieve this, this patch also adds new methods to the fragmented_temporary_buffer(size(), ostream& operator<<()) and adds methods to the underlying parser(primitive_consumer) for parsing byte strings into fragmented buffers. Fixes #7457 Fixes #6376 Closes #8182 * github.com:scylladb/scylla: primitive_consumer: keep fragments of parsed buffer in a small_vector sstables: add parsing of cell values into fragmented buffers sstables: add non-contiguous parsing of byte strings to the primitive_consumer utils: add ostream operator<<() for fragmented_temporary_buffer::view compound_type: extend serialize_value for all FragmentedView types	2021-04-22 15:38:10 +02:00
Nadav Har'El	fc2da8058c	Merge 'qos: make sure to wait for service level updates on shutdown' from Piotr Sarna The service level controller spawns an updating thread, which wasn't properly waited for during shutdown. This behavior is now fixed. Tests: manual Fixes #8468 Closes #8470 * github.com:scylladb/scylla: qos: make sure to wait for sl updates on shutdown db: stop using infinite timeout for service level updates	2021-04-22 15:38:09 +02:00
Pekka Enberg	0ddbed2513	dist: Add support for disabling writeback cache This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: #7341 Tests: dtest (next-gating) Closes #8526	2021-04-22 11:24:49 +03:00
Asias He	b6104e5f44	doc: Update bootstrap with everywhere_topology Document how we choose node to sync with if everywhere_topology is used. Refs #8503 Closes #8518	2021-04-22 11:24:49 +03:00
Avi Kivity	a063173ace	Merge "Fix unbounded memory usage and high write amplification in TWCS reshape" from Raphael " Memory usage is considerably reduced by making reshape switch to partitioned set, given that input sstables are disjoint. This will benefit reshape for all strategies, not only TWCS. Write amplification is reduced a lot by compacting all input sstables at once, which is possible given that unbounded memory usage is fixed too. With both these issues fixed, TWCS reshape will be much more efficient. tests: mode(dev). " * 'twcs_reshape_fixes' of github.com:raphaelsc/scylla: tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently TWCS: Reshape all sstables in a time window at once if they're disjoint sstables: Extract code to count amount of overlapping into a function LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint compaction: Make reshape compaction always use partitioned_sstable_set compaction: Allow a compaction type to override the sstable_set for input sstables	2021-04-22 11:24:49 +03:00
Piotr Sarna	55ae110774	qos: make sure to wait for sl updates on shutdown The service level controller spawns an updating thread, which wasn't properly waited for during shutdown. This behavior is now fixed. In order to make the shutdown order more standardized, the operation is split into two phases - draining and stopping. Tests: manual Fixes #8468	2021-04-22 09:58:27 +02:00
Piotr Sarna	ad661561c8	db: stop using infinite timeout for service level updates Due to a porting bug, the routines for updating service levels used the default infinite timeout for internal CQL queries, which causes Scylla to hang on shutdown. The behavior is now fixed and the routines use the same timeout as the other similar functions - 10s at the time of writing this message.	2021-04-22 09:03:21 +02:00
Raphael S. Carvalho	394b9ddb31	tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-21 11:03:16 -03:00
Raphael S. Carvalho	d5fc2f3839	TWCS: Reshape all sstables in a time window at once if they're disjoint With repair-based operations, each window will have 256 disjoint sstables due to data segregation which produces N sstables for each vnode range, where N = # of existing windows. So each window ends up with one sstable per vnode range = 256. Given that reshape now unconditionally uses partitioned set's incremental selector, all the 256 sstables can be compacted at once as compaction essentially becomes a copy operation, where only one sstable will be opened at a time, making its memory usage very efficient. By compacting all sstables at once, write amplification is a lot reduced because each byte is now only rewritten once. Previously, with the initial set of 256 sstables, write amp could be up to 8, which makes reshape for TWCS very slow. Refs #8449. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-21 11:03:16 -03:00
Raphael S. Carvalho	0f7774a6f8	sstables: Extract code to count amount of overlapping into a function This function will be reused by TWCS reshape when checking if all sstables in a window are disjoint and can be all compacted together. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-21 11:03:16 -03:00
Raphael S. Carvalho	39ecddbd34	LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint Wrong comparison operator is used when checking for overlapping. It would miss overlapping when last key of a sstable is equal to the first key of another sstable that comes next in the set, which is sorted by first key. Fixes #8531. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-21 11:03:07 -03:00
Asias He	1513de633b	repair: Switch to use NODE_OPS_CMD for decommission operation In commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation), we switched replace operation to use the new NODE_OPS_CMD infrastructure. In this patch, we continue the work to switch decommission operation to use NODE_OPS_CMD. The benefits: - A UUID is used to identify each node operation across the cluster. - It is more reliable to detect pending node operations, to avoid multiple topology changes at the same time. - The cluster reverts to a state before the decommission operation automatically in case of error. Without this patch, the node to be decommissioned will be stuck in decommission status forever until it is restarted and goes back to normal status. - Allows users to pass a list of dead nodes to ignore for decommission explicitly. - The LEAVING gossip status is not needed any more. This is one step closer to achieve gossip-less topology change. - Allows us to trigger of off-strategy easily on the node receiving the ranges Fixes #8471	2021-04-21 20:35:54 +08:00
Piotr Sarna	dfd1ea6b92	test: rename alternator_base64_test to alternator_unit_test With the more generic name, I would no longer feel bad adding non-base64 test cases to it.	2021-04-21 14:26:40 +02:00
Piotr Sarna	45d7144529	rjson: add a throwing allocator The default rapidjson allocator returns nullptr from a failed allocation or reallocation. It's not a bug by itself, but rapidjson internals usually don't check for these return values and happily use nullptr as a valid pointer, which leads to segmentation faults and memory corruptions. In order to prevent these bugs, the default allocator is wrapped with a class which simply throws once it fails to allocate or reallocate memory, thus preventing the use of nullptr in the code. One exception is Malloc/Realloc with size 0, which is expected to return nullptr by rapidjson code.	2021-04-21 14:26:38 +02:00
Takuya ASADA	00dcaf2896	dist/debian: rename .default file correctly On 'product != scylla' environment, we have a bug with .default file (sysconfig file) handling. Since .default file should be install original name, package name can be doesn't match with .default filename. (ex: default file is /etc/default/scylla-node-exporter, but package name is scylla-enterprise-node-exporter) When filename doesn't match with package name, it should be renamed with as follows: <package name>.<filename>.default We already do this on .service file, but mistakenly haven't handled .default file, so let's add it too. Related scylladb/scylla-enterprise#1718 Fixes #8527 Closes #8528	2021-04-21 14:24:21 +03:00
Piotr Sarna	2ad09d0bf8	Merge 'treewide: remove inclusions of storage_proxy.hh from headers' from Avi Kivity Reduce rebuilds and build time by removing unnecessary includes. Along the way, improve header sanity. Ref #1. Test: dev-headers, unit(dev). Closes #8524 * github.com:scylladb/scylla: treewide: remove inclusions of storage_proxy.hh from headers storage_proxy: unnest coordinator_query_result treewide: make headers self-sufficient utils: intrusive_btree: add missing #pragma once	2021-04-21 08:22:52 +02:00
Avi Kivity	09819a4c62	Update seastar submodule * seastar 0b2c25d133...980a29fb70 (1): > Merge "Assorted set of improvements over io-queue" from Pavel E Fixes #8378	2021-04-21 08:22:52 +02:00
Benny Halevy	7130e2e7ff	sstables: harden unlink Make sure that sstable::unlink will never fail. It will terminate in the unlikely case toc_filename throws (e,g, on bad_alloc), otherwise it ignores any other error and juts warns about it. Make unlink a coroutine to simplify the implementation without introducing additional allocations. Note that remove_by_toc_name and maybe_delete_large_data_entries are executed asynchronously and concurrently. Waiting for them to finish is serialized by co_await, making sure that both are being waited on so not to leave abandoned futures behind. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>	2021-04-21 08:22:52 +02:00
Raphael S. Carvalho	678e4c0bb9	compaction: Make reshape compaction always use partitioned_sstable_set Reshape compaction potentially works with disjoint sstables, so it will benefit a lot from using partitioned_sstable_set, which is able to incrementally open the disjoint sstables. Without it, all sstables are opened at once, which means unbounded memory usage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-20 15:39:51 -03:00
Avi Kivity	daeddda7cc	treewide: remove inclusions of storage_proxy.hh from headers storage_proxy.hh is huge and includes many headers itself, so remove its inclusions from headers and re-add smaller headers where needed (and storage_proxy.hh itself in source files that need it). Ref #1.	2021-04-20 21:23:00 +03:00
Avi Kivity	cdf30524f3	storage_proxy: unnest coordinator_query_result Nested classes cannot be forward declared, and storage_proxy::coordinator_query_result is used in pagers, where we'd like to forward-declare it. Unnest it and introduce an alias for compatibility.	2021-04-20 21:23:00 +03:00
Avi Kivity	14a4173f50	treewide: make headers self-sufficient In preparation for some large header changes, fix up any headers that aren't self-sufficient by adding needed includes or forward declarations.	2021-04-20 21:23:00 +03:00
Avi Kivity	6db1a71775	utils: intrusive_btree: add missing #pragma once Interferes with making headers self-sufficient, so add it now.	2021-04-20 21:23:00 +03:00
Raphael S. Carvalho	ad9bc808b9	compaction: Allow a compaction type to override the sstable_set for input sstables By default, compaction will pick a implementation of sstable_set as defined by the underlying compaction strategy. However, reshape compaction potentially works with disjoint sstables and will benefit a lot from always using partitioned set. For example, when reshaping a TWCS table, it's better to use the partitioned set rather than the time window set, as the former will be much more memory efficient by incrementally selecting sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-20 12:03:44 -03:00
Nadav Har'El	50f3201ee2	alternator: fix inequality check of two sets In issue #5021 we noted that Alternator's equality operator needs to be fixed for the case of comparing two sets, because the equality check needs to take into account the possibility of different element order. Unfortunately, we fixed only the equality check operator, but forgot there is also an inequality operator! So in this patch we fix the inequality operator, and also add a test for it that was previously missing. The implementation of the inequality operator is trivial - it's just the negation of the equality test. Our pre-existing tests verify that this is the correct implementation (e.g., if attribute x doesn't exist, then "x = 3" is false but "x <> 3" is true). Refs #5021 Fixes #8513 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419141450.464968-1-nyh@scylladb.com>	2021-04-20 13:14:19 +02:00
Nadav Har'El	dae7528fe5	alternator: fix equality check of nested document containing a set In issue #5021 we noticed that the equality check in Alternator's condition expressions needs to handle sets differently - we need to compare the set's elements ignoring their order. But the implementation we added to fix that issue was only correct when the entire attribute was a set... In the general case, an attribute can be a nested document, with only some inner set. The equality-checking function needs to tranverse this nested document, and compare the sets inside it as appropriate. This is what we do in this patch. This patch also adds a new test comparing equality of a nested document with some inner sets. This test passes on DynamoDB, failed on Alternator before this patch, and passes with this patch. Refs #5021 Fixes #8514 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419184840.471858-1-nyh@scylladb.com>	2021-04-20 13:14:10 +02:00
Nadav Har'El	46448b0983	alternator: fix equality check of two unset attributes When a condition expression (ConditionExpression, FilterExpression, etc.) checks for equality of two item attributes, i.e., "x = y", and when one of these attributes was missing we correctly returned false. However, we also need to return false when both attributes are missing in the item, because this is what DynamoDB does in this case. In other words an unset attribute is never equal to anything - not even to another unset attribute. This was not happening before this patch: When x and y were both missing attributes, Alternator incorrectly returned true for "x = y", and this patch fixes this case. It also fixes "x <> y" which should to be true when both x and y are unset (but was false before this patch). The other comparison operators - <, <=, >, >=, BETWEEN, were all implemented correctly even before this patch. This patch also includes tests for all the two-unset-attribute cases of all the operators listed above. As usual, we check that these tests pass on both DynamoDB and Alternator to confirm our new behavior is the correct one - before this patch, two of the new tests failed on Alternator and passed on DynamoDB. Fixes #8511 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210419123911.462579-1-nyh@scylladb.com>	2021-04-20 13:14:00 +02:00
Botond Dénes	4c3454dd07	database: get_reader_concurrency_semaphore(): make the user semaphore the catch-all Currently said method uses the system semaphore as a catch-all for all scheduling groups it doesn't know about. This is incompatible with the recent forward-porting of the service-level infrastructure as it means that all service level related scheduling groups will fall back to the system scheduling group, which causes two problems: * They will experience much limited concurrency, as the system semaphore is assigned much less count units, to match the much more limited internal traffic. * They compete with internal reads, severely impacting the respective internal processes, potentially causing extreme slowdown, or even deadlock in the case of an internal query executed on behalf of a user query being blocked on the latter. Even if we don't have any custom service level scheduling groups at the moment, it is better to change this such that unknown scheduling groups fall-back to using the user semaphore. We don't expect any new internal scheduling group to pop up any time soon (and if they do we can adjust get_reader_concurrency_semaphore() accordingly), but we do expect user scheduling groups to be created in the future, even dynamically. To minimize the chance of the wrong workload being associated with the user semaphore, all statically created scheduling groups are now explicitly listed in `get_reader_concurrency_semaphore()`, to make their association with the respective semaphore explicit and documented. Added a unit test which also checks the correct association for all these scheduling groups. Fixes: #8508 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210420105156.94002-1-bdenes@scylladb.com>	2021-04-20 14:06:25 +03:00
Piotr Sarna	ec750e5f49	rjson: make the max nested level configurable Back when rjson was only part of alternator, there was a hardcoded limit of nested levels - 78. The number was calculated as: - taking the DynamoDB limit (32) - adding 7 to it to make alternator support more cases - doubling it because rjson internals bump the level twice for each alternator object (because the alternator object is represented as a 2-level JSON object). Since rjson is no longer specific to alternator, this limit is now configurable, and the original default value is explained in a comment. Message-Id: <51952951a7cd17f2f06ab36211f74086e1b60d2d.1618916299.git.sarna@scylladb.com>	2021-04-20 14:05:03 +03:00
Nadav Har'El	c29f55e801	Merge 'Unify CQL and Redis server code' from Pekka Enberg The Redis server started as a copy of the CQL server, but did not receive all the fixes of the CQL server over time. For example, commit `1a8630e` ("transport: silence "broken pipe" and "connection reset by peer" errors") was only done on the CQL server. To remedy the situation, this pull request unifies code between the CQL and Redis servers by introducing a "generic_server" component, and switching CQL and Redis to use it. Test: dtest(dev) Closes #8388 * github.com:scylladb/scylla: generic_server: Rename "maybe_idle" to "maybe_stop" generic_server: API documentation for connection and server classes transport, redis: Use generic server::listen() transport/server: Remove "redis_server" prefix from logging transport/server: Remove "cql_server" prefix from logging generic_server: Remove unneeded static_pointer_cast<> transport, redis: Use generic server::do_accepts() transport, redis: Use generic server::process() redis: Move Redis specific code to handle_error() transport: Move CQL specific error handling to handle_error() transport, redis: Move connection tracking to generic_server::server class transport, redis: Move _stopped and _connections_list to generic_server::server class transport, redis: Move total_connections to generic_server::server class transport, redis: Use generic server::maybe_idle() transport, redis: Move list_base_hook<> inheritance to generic_server::connection transport, redis: Use generic connection::shutdown()	2021-04-20 12:20:25 +03:00
Tomasz Grabiec	dc7beec382	Merge "Tweak cache_flat_mutation_reader" from Pavel Emelyanov The set recycles 16 bytes from the reader class, makes use of rows collection sugar, generalizes range tombstones emission and adds an invariant-check. tests: unit(dev) * xemul/br-cache-reader-cleanups-1.2: cache_flat_mutation_reader: Generalize range tombstones emission cache_flat_mutation_reader: Tune forward progress check cache_flat_mutation_reader: Use rows insertion sugar cache_flat_mutation_reader: Move state field cache_flat_mutation_reader: Remove raiish comparator cache_flat_mutation_reader: Remove unused captured variable cache_flat_mutation_reader: Fix trace message text	2021-04-19 21:21:49 +02:00
Benny Halevy	a57459e983	compaction: cleanup_compaction: no need to filter tokens belonging to other shards As sstables are always resharded if needed when loaded. Refs #6807 Test: unit(release,debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210419142743.265729-1-bhalevy@scylladb.com>	2021-04-19 17:22:53 +02:00
Benny Halevy	9c89702fb2	perf_simple_query: use tests::random::get_int for reproducible results Support for random-seed was added in `4ad06c7eeb` but the program still uses std::rand() to draw random keys. Use tests::random::get_int instead so we can get reprodicible sequence of keys given a particular random-seed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210418104455.82086-1-bhalevy@scylladb.com>	2021-04-19 17:22:53 +02:00
Piotr Sarna	2591cbb62e	main: add a debug symbol for service level controller It's notoriously hard to find the service level controller symbol (possible by guessing the offset based on system_distributed_keyspace address, but it's very cumbersome). To make the debugging process easier, the symbol is exported via the `debug` namespace. Closes #8506	2021-04-19 11:29:01 +03:00
Kamil Braun	617813ba66	sys_dist_ks: new keyspace for system tables with Everywhere strategy `system_distributed_everywhere` is a new keyspace that uses Everywhere replication strategy. This is useful, for example, when we want to store internal data that should be accessible by every node; the data can be written using CL=ALL (e.g. during node operations such as node bootstrap, which require all nodes to be alive - at least currently) and then read by each node locally using CL=ONE (e.g. during node restarts). Closes #8457	2021-04-19 11:22:57 +03:00
Nadav Har'El	13104bd7e2	Merge 'repair: Handle everywhere_topology in bootstrap_with_repair ' from Asias He repair: Handle everywhere_topology in bootstrap_with_repair The everywhere_topology returns the number of nodes in the cluster as RF. This makes only streaming from the node losing the range impossible since no node is losing the range after bootstrap. Shortcut to stream from all nodes in local dc in case the keyspace is everywhere_topology. Fixes #8503 Closes #8505 * github.com:scylladb/scylla: repair: Make the log more accurate in bootstrap_with_repair repair: Handle everywhere_topology in bootstrap_with_repair	2021-04-19 11:19:01 +03:00
Asias He	4c4334e912	repair: Make the log more accurate in bootstrap_with_repair We have logs expected 1 node losing range but found more nodes However, we can find zero node as well. Drop the word more in the log. In addition, print the number of nodes found. Refs #8503	2021-04-19 15:15:05 +08:00
Takuya ASADA	0b01e1a167	dist: add DefaultDependencies=no to .mount units To avoid ordering cycle error on Ubuntu, add DefaultDependencies=no on .mount units. Fixes #8482 Closes #8495	2021-04-19 09:06:42 +03:00
Botond Dénes	8287cdb2ff	scripts/build-help.sh: extend help text with more targets Mention executables (scylla, tools and tests) as well as how to build individual object files and how to verify individual headers. Also mention the not-at-all obvious trick of how to build tests with debug symbols. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210416131950.175413-1-bdenes@scylladb.com>	2021-04-19 06:33:01 +02:00
Asias He	3c36517598	repair: Handle everywhere_topology in bootstrap_with_repair The everywhere_topology returns the number of nodes in the cluster as RF. This makes only streaming from the node losing the range impossible since no node is losing the range after bootstrap. Shortcut to stream from all nodes in local dc in case the keyspace is everywhere_topology. Fixes #8503	2021-04-19 10:47:36 +08:00
Tomasz Grabiec	dbd0b9a3ef	gdb: Fix miscalculation of small pool memory usage "scylla memory" It should not count free pages which used to belong to a given pool. Message-Id: <20210415175923.683555-1-tgrabiec@scylladb.com>	2021-04-18 14:03:17 +03:00
Tomasz Grabiec	68cde23912	gdb: Fix --size option of "scylla task_histogram" By default, argparse will provide the value of the option as str. Later, we compare it with int, which will be always False. Fix by telling argparse to provide as int. Message-Id: <20210415182149.686355-1-tgrabiec@scylladb.com>	2021-04-18 14:03:17 +03:00
Botond Dénes	8a43a11f7b	scylla-gdb.py: get_base_class_offset(): make sure offset is returned as int Looks like in python 3, division automatically yields a double/float, even if both operands are integers. This results in get_base_class_offset() returning a double/float, which breaks pointer arithmetics (which is what the returned value is used for), because now instead of decrementing/incrementing the pointer, the pointer will be converted to a double itself silently, then back to some corrupt pointer value. One user visible effect is `intrusive_list` being broken, as it uses the above method to calculate the member type pointer from the node pointers. Fix by coercing the returned value to int. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210415080034.167762-1-bdenes@scylladb.com>	2021-04-18 14:03:17 +03:00
Pavel Emelyanov	5ecbc33be5	database.*: Remove unused headers The database.hh is the central recursive-headers knot -- it has ~50 includes. This patch leaves only 34 (it remains the champion though). Similar thing for database.cc. Both changes help the latter compile ~4% faster :) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210414183107.30374-1-xemul@scylladb.com>	2021-04-18 14:03:17 +03:00
Pavel Emelyanov	2a7171110d	cache_flat_mutation_reader: Generalize range tombstones emission The range tombstone can be added-to-buffer from two places: when it was found in cache and when it was read from the underlying reader. Both adders can now be generalized. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:46 +03:00
Pavel Emelyanov	2e98cfbf1d	cache_flat_mutation_reader: Tune forward progress check When adding a range tombstone to the buffer the need to stop stuffing the already full one is only done if this particular range timbstone changes the lower_bound. This check can be tuned -- if the lower bound changed _at_ _all_ after a range tombstone was added, we may still abort the loop. This change will allow to generalize range tombstone emission by the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:46 +03:00
Pavel Emelyanov	a35de6ea3e	cache_flat_mutation_reader: Use rows insertion sugar When inserting a rows_entry via unique_ptr the ptr inquestion can be pushed as is, the intrusive btree code releases the pointer (to be exception safe) itself. This makes the code a bit shorter and simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:46 +03:00
Pavel Emelyanov	df488dd8ac	cache_flat_mutation_reader: Move state field There are two alignment gaps in the middle of the c_f_m_r -- one after the state and another one after the set of bools. Keeping them togethers allows the compiler to pack the c_f_m_r better. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:46 +03:00
Pavel Emelyanov	bc3f910fc1	cache_flat_mutation_reader: Remove raiish comparator The instance of position_in_partition::tri_compare sits on the reader itself and just occupies memory. It can be created on demand all the more so it's only one place that needs it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:46 +03:00
Pavel Emelyanov	41352334ba	cache_flat_mutation_reader: Remove unused captured variable The captured timeout is not used in lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:41 +03:00
Pavel Emelyanov	eb65f8ed6b	cache_flat_mutation_reader: Fix trace message text The entry inserted in this branch is not dummy, but an empty row. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-16 17:55:22 +03:00
Nadav Har'El	8751728314	Merge 'Improve validation of "enable", "postimage" and "ttl" CDC options' from Piotr Grabowski First commit: In the first commit, add validation of `enable` and `postimage` CDC options. Both options are boolean options, but previously they were not validated, meaning you could issue a query: ``` CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'}; ``` and it would be executed without any errors, silently interpreting `dsfdsd` as false. The first commit narrows possible values of those boolean CDC options to `false`, `true`, `0`, `1`. After applying this change, issuing the query above would result in this error message: ``` ConfigurationException: Invalid value for CDC option "enabled": dsfdsd ``` I actually encountered this lacking validation myself, as I mistakenly issued a query: ``` CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'preimage': true, 'postimage': 'full'}; ``` incorrectly assigning `full` to `postimage`, instead of `preimage`. However, before this commit, this query ran correctly and it interpreted `full` as `false` and disabled postimages altogether. Second commit: The second commit improves the error message of invalid `ttl` CDC option: Before: ``` CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'}; ServerError: stoi ``` After: ``` CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'}; ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd ``` ``` CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'}; ConfigurationException: Invalid CDC option: ttl too large ``` Closes #8486 * github.com:scylladb/scylla: cdc: improve exception message of invalid "ttl" cdc: add validation of "enable" and "postimage"	2021-04-15 11:59:41 +02:00
Takuya ASADA	cbbd5b2b6f	unified: abort install when non-bash shell detected On Debian variants, sh -x ./install.sh will fail since our script in written in bash, and /bin/sh in Debian variants is dash, not bash. So detect non-bash shell and print error message, let users to run in bash. Fixes #8479 Closes #8484	2021-04-15 11:59:41 +02:00
Avi Kivity	935378fa53	main: start background reclaim before bootstrap We start background reclaim after we bootstrap, so bootstrap doesn't benefit from it, and sees long stalls. Fix by moving background reclaim initialization early, before storage_service::join_cluster(). (storage_service::join_cluster() is quite odd in that main waits for it synchronously, compared to everything else which is just a background service that is only initialized in main). Fixes #8473. Closes #8474	2021-04-15 11:59:41 +02:00
Raphael S. Carvalho	84f7ae2c82	table: remove unneeded code as sstables are not shared anymore given that resharding is now a synchronous mandatory step, before table is populated, snapshot() can now get rid of code which takes into account whether or not a sstable is shared. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>	2021-04-15 11:59:41 +02:00
Avi Kivity	b19d318701	Update seastar submodule * seastar d2dcda96bb...0b2c25d133 (4): > reactor: reactor_backend_epoll: stop using signals for high resolution timers > reactor: move task_quota_timer_thread_fn from reactor to reactor_backend_epoll > Merge "Report maximum IO lenghts via file API" from Pavel E > Merge "Improve efficiency of io-tester" from Pavel E	2021-04-15 11:59:41 +02:00
Piotr Grabowski	61c8e196be	cdc: improve exception message of invalid "ttl" Improve the exception message of providing invalid "ttl" value to the table. Previously, if you executed a CREATE TABLE query with invalid "ttl" value, you would get a non-descriptive error message: CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'invalid'}; ServerError: stoi This commit adds more descriptive exception messages: CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': 'kgjhfkjd'}; ConfigurationException: Invalid value for CDC option "ttl": kgjhfkjd CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': true, 'ttl': '75747885787487'}; ConfigurationException: Invalid CDC option: ttl too large	2021-04-14 17:40:23 +02:00
Piotr Grabowski	10390afc10	cdc: add validation of "enable" and "postimage" Add validation of "enable" and "postimage" CDC options. Both options are boolean options, but previously they were not validated, meaning you could issue a query: CREATE TABLE ks.t(pk int, PRIMARY KEY(pk)) WITH cdc = {'enabled': 'dsfdsd'}; and it would be executed without any errors, silently interpreting "dsfdsd" as false. This commit narrows possible values of those boolean CDC options to false, true, 0, 1. After applying this change, issuing the query above would result in this error message: ConfigurationException: Invalid value for CDC option "enabled": dsfdsd	2021-04-14 17:36:38 +02:00
Nadav Har'El	4cf21f3a0f	cql-pytest: update run-cassandra script for Java 11 This patch fixes cql-pytest/run-cassandra to work on systems which default to Java 11, including Fedora 33. Recent versions of Cassandra can run on Java 11 fine, but requires a bunch of weird JVM options to work around its JPMS (Java Platform Module System) feature. Cassandra's start scripts require these options to be listd in conf/jvm11-server.options, which is read by the startup script cassandra.in.sh. Because our "run-cassandra" builds its own "conf" directory, we need to create a jvm11-server.options file in that directory. This is ugly, but unfortunately necessary if cql-pytest/run-cassandra is to run with on systems defaulting to Java 11. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210406220039.195796-1-nyh@scylladb.com>	2021-04-14 13:16:00 +02:00
Asias He	9ea57dff21	gossip: Relax failure detector update We currently only update the failure detector for a node when a higher version of application state is received. Since gossip syn messages do not contain application state, so this means we do not update the failure detector upon receiving gossip syn messages, even if a message from peer node is received which implies the peer node is alive. This patch relaxes the failure detector update rule to update the failure detector for the sender of gossip messages directly. Refs #8296 Closes #8476	2021-04-14 13:16:00 +02:00
Tomasz Grabiec	320f6bf220	Merge 'test: perf: perf_simple_query: collect allocation and task statistics' from Avi Kivity Calculate and display the number of memory allocations and tasks executed per operation. Sample results (--smp 1): 180022.46 tps (90 allocs/op, 20 tasks/op) 178963.44 tps (90 allocs/op, 20 tasks/op) 178702.41 tps (90 allocs/op, 20 tasks/op) 177679.74 tps (90 allocs/op, 20 tasks/op) 179539.36 tps (90 allocs/op, 20 tasks/op) median 178963.44 tps (90 allocs/op, 20 tasks/op) median absolute deviation: 575.92 maximum: 180022.46 minimum: 177679.74 This allows less noisy tracking of how some changes impact performance. Closes #8425 * github.com:scylladb/scylla: test: perf: perf_simple_query: collect allocation and task statistics perf: deinline some functions in perf.hh	2021-04-14 13:16:00 +02:00
Kamil Braun	5c7ed7a83f	time_series_sstable_set: return partition start if some sstables were ck-filtered out When a particular partition exists in at least one sstable, the cache expects any single-partition query to this partition to return a `partition_start` fragment, even if the result is empty. In `time_series_sstable_set::create_single_key_sstable_reader` it could happen that all sstables containing data for the given query get filtered out and only sstables without the relevant partition are left, resulting in a reader which immediately returns end-of-stream (while it should return a `partition_start` and if not in forwarding mode, a `partition_end`). This commit fixes that. We do it by extending the reader queue (used by the clustering reader merger) with a `dummy_reader` which will be returned by the queue as the very first reader. This reader only emits a `partition_start` and, if not in forwarding mode, a `partition_end` fragment. Fixes #8447. Closes #8448	2021-04-14 13:16:00 +02:00
Calle Wilund	03590c8254	commitlog_test: Add test for deadlock in shutdown w. segment wait Refs #8438 Ensures shutting down (well behaved) works even if an allocating path is stuck waiting for a new segment - i.e. other aspect of Closes #8475	2021-04-14 13:16:00 +02:00
Michael Livshin	4ccb1b3a2f	build: add nix-shell support Support native building & unit testing in the Nix ecosystem under nix-shell. Actual dist packaging for Nixpkgs/NixOS is not there (yet?), because: * Does not exactly seem like a huge priority. * I don't even have a firm idea of how much work it would entail (it certainly does not need the ld.so trickery, so there's that. But at least some work would be needed, seeing how ScyllaDB needs to integrate with its environment and NixOS is a little unorthodox). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210413110508.5901-4-michael.livshin@scylladb.com>	2021-04-14 13:15:59 +02:00
Michael Livshin	d87e751182	build: add a structural way to distro-extend configure.py For now just for additional cflags, ldflags & cmake arguments. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210413110508.5901-3-michael.livshin@scylladb.com>	2021-04-14 13:15:59 +02:00
Michael Livshin	5cb4005e84	build: extend configure.py's subprocess environment properly The `env` parameter to `subprocess.Popen()` and friends, when it is not `None`, is not an addition to the subprocess environment but the _whole_ subprocess environment. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210413110508.5901-2-michael.livshin@scylladb.com>	2021-04-14 13:15:59 +02:00
Avi Kivity	b756693e64	Merge "mutation_query: move query methods into table" from Botond " These methods are generic ways to query a mutation source. At least they used to be, but nowadays they are pretty specific to how tables are queried -- they use a querier cache to lookup queriers from and save them into. With the coming changes to how permits are obtained, they are about to get even more specific to tables. Instead of forcing the genericity and keep adding new parameters, this patchset bites the bullet and moves them to table. `data_query()` is inlined into `table::query()`, while `mutation_query()` is replaced with `table::mutation_query()`. The only other users besides table are tests and they are adjusted to use similarly named local methods that just combine the right querier with the right result builder. This combination is what the tests really want to test, as this is also what is used by the table methods behind the scenes. Tests: unit(release, debug) " * 'mutation-query-move-query-methods-into-table/v1' of https://github.com/denesb/scylla: mutation_query: remove now unused mutation_query() test: mutation_query_test: use local mutation_query() implementation database: mutation_query(): use table::mutation_query() table: add mutation_query() query: remove the now unused data_query() test: mutation_query_test: use local data_query() implementation table: query(): inline data_query() code into query() table: make query() a coroutine	2021-04-14 13:15:59 +02:00
Pekka Enberg	2b6438c044	generic_server: Rename "maybe_idle" to "maybe_stop"	2021-04-13 14:13:24 +03:00
Pekka Enberg	66276d6636	generic_server: API documentation for connection and server classes	2021-04-13 14:13:24 +03:00
Pekka Enberg	16f262b852	transport, redis: Use generic server::listen() Let's pull up cql_server listen() to generic_server::server base class and convert redis_server to use it.	2021-04-13 14:13:24 +03:00
Pekka Enberg	6c619e4462	transport/server: Remove "redis_server" prefix from logging The logger itself has the name "redis_server" that appears in the logs.	2021-04-13 13:57:22 +03:00
Pekka Enberg	7ef3c60864	transport/server: Remove "cql_server" prefix from logging The logger itself has the name "cql_server" that appears in the logs.	2021-04-13 13:57:22 +03:00
Pekka Enberg	f560b3daa3	generic_server: Remove unneeded static_pointer_cast<> Now that do_accepts() is in generic_server, we can get rid of the static_pointer_cast<>.	2021-04-13 13:57:22 +03:00
Pekka Enberg	ac90a8ea50	transport, redis: Use generic server::do_accepts() The cql_server and redis_server share the same ancestor of do_accepts(). Let's pull up the cql_server version of do_accept() (that has more functionality) to generic_server::server and use it in the redis_server too.	2021-04-13 13:57:21 +03:00
Pekka Enberg	3689db26fc	transport, redis: Use generic server::process() Pull up the cql_server process() to base class and convert redis_server to use it. Please note that this fixes EPIPE and connection reset issue in the Redis server, which was fixed in the CQL server in commit `1a8630e6a` ("transport: silence "broken pipe" and "connection reset by peer" errors").	2021-04-13 13:56:45 +03:00
Pekka Enberg	ef39216667	redis: Move Redis specific code to handle_error() This moves the Redis specific error handling to handle_error() to make process() more generic in preparation for move to generic_server.	2021-04-13 13:56:45 +03:00
Pekka Enberg	66d6899727	transport: Move CQL specific error handling to handle_error() This moves the CQL specific error handling to handle_error() to make process() more generic in preparation for move to generic_server.	2021-04-13 13:56:45 +03:00
Pekka Enberg	ab339cfaf7	transport, redis: Move connection tracking to generic_server::server class The cql_server and redis_server classes have identical connection tracking code. Pull it up to the generic_server::server base class.	2021-04-13 13:56:45 +03:00
Pekka Enberg	deac5b1810	transport, redis: Move _stopped and _connections_list to generic_server::server class The cql_server and redis_server both have the same "_stopped" and "_connections_list" member variables. Pull them up to the generic_server::server base class.	2021-04-13 13:56:45 +03:00
Pekka Enberg	1af73bec7b	transport, redis: Move total_connections to generic_server::server class Both cql_server and redis_server have the same "total_connections" member variable so pull that up to the generic_server::server base class.	2021-04-13 13:56:45 +03:00
Pekka Enberg	7b46c2da53	transport, redis: Use generic server::maybe_idle() The cql_server and redis_server classes have a maybe_idle() method, which sets the _all_connections_stopped promise if server wants to stop and can be stopped. Pull up the duplicated code to generic_server::server class.	2021-04-13 13:56:45 +03:00
Pekka Enberg	4664a55e05	transport, redis: Move list_base_hook<> inheritance to generic_server::connection Both cql_server::connection and redis_server::connection inherit boost::intrusive::list_base_hook<>, so let's pull up that to the generic_server::connection class that both inherit.	2021-04-13 13:56:45 +03:00
Pekka Enberg	19507bb7ea	transport, redis: Use generic connection::shutdown() This patch moves the duplicated connection::shutdown() method to to a new generic_server::connection base class that is now inherited by cql_server and redis_server.	2021-04-13 13:56:44 +03:00
Tomasz Grabiec	163f2be277	Merge 'Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader' from Piotr Jastrzębski It is possible that a partition is in cache but is not present in sstables that are underneath. In such case: 1. cache_flat_mutation_reader will fast forward underlying reader to that partition 2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true 3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader 4. This PR fixes that Test: unit(dev) Fixes #8435 Fixes #8411 Closes #8437 * github.com:scylladb/scylla: row_cache: remove redundant check in make_reader cache_flat_mutation_reader: fix do_fill_buffer read_context: add _partition_exists read_context: remove skip_first_fragment arg from create_underlying read_context: skip first fragment in ensure_underlying	2021-04-13 00:45:10 +02:00
Piotr Jastrzebski	cb3dbb1a4b	row_cache: remove redundant check in make_reader This check is always true because a dummy entry is added at the end of each cache entry. If that wasn't true, the check in else-if would be an UB. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 21:12:33 +02:00
Piotr Jastrzebski	1f644df09d	cache_flat_mutation_reader: fix do_fill_buffer Make sure that when a partition does not exist in underlying, do_fill_buffer does not try to fast forward withing this nonexistent partition. Test: unit(dev) Fixes #8435 Fixes #8411 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 21:08:40 +02:00
Piotr Jastrzebski	ceab5f026d	read_context: add _partition_exists This new state stores the information whether current partition represented by _key is present in underlying. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 20:57:20 +02:00
Piotr Jastrzebski	b3b68dc662	read_context: remove skip_first_fragment arg from create_underlying All callers pass false for its value so no need to keep it around. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 19:51:06 +02:00
Piotr Jastrzebski	088a02aafd	read_context: skip first fragment in ensure_underlying This was previously done in create_underlying but ensure_underlying is a better place because we will add more related logic to this consumption in the following patches. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 19:46:04 +02:00
Avi Kivity	fcc17d43a6	treewide: correct mislicensed source files alternator/expressions.g had both AGPL and proprietary licensing. The proprietary one is removed. gms/inet_address_serializer.hh had only a proprietary license; it is replaced by the AGPL. Fixes #8465. Closes #8466	2021-04-12 17:42:59 +03:00
Avi Kivity	e3db889057	Merge 'Introduce service levels' from Piotr Sarna This series introduces service level syntax borrowed from https://docs.scylladb.com/using-scylla/workload-prioritization/ , but without workload prioritization itself - just for the sake of using identical syntax to provide different parameters later. The new parameters may include: * per-service-level timeouts * oltp/olap declaration, which may change the way Scylla treats long requests - e.g. time them out (the oltp way) or keep them sustained with empty pages (the olap way) Refs #7617 Closes #7867 * github.com:scylladb/scylla: transport: initialize query state with service level controller main: add initializing service level data accessor service: make enable_shared_from_this inheritance public cql3: add SERVICE LEVEL syntax (without an underscore) unit test: Add unit test for per user sla syntax cql: Add support for service level cql queries auth: Add service_level resource for supporting in authorization of cql service_level cql: Support accessing service_level_controller from query state instantiate and initialize the service_level_controller qos: Add a standard implementation for service level data accessor qos: add waiting for the updater future service/qos: adding service level controller service_levels: Add documentation for distributed tables service/qos: adding service level table to the distributed keyspace service/qos: add common definitions auth: add support for role attributes	2021-04-12 17:34:43 +03:00
Piotr Sarna	26ee6aa1e9	transport: initialize query state with service level controller Query state should be aware of the service level controller in order to properly serve service-level-related CQL queries.	2021-04-12 16:31:27 +02:00
Piotr Sarna	32bcbe59ad	main: add initializing service level data accessor The accessor must be set up in order to be able to use statement related to service level management.	2021-04-12 16:31:27 +02:00
Piotr Sarna	3626bc253d	service: make enable_shared_from_this inheritance public Without being public, making shared pointer from the service level accessor is not accessible outside of the class.	2021-04-12 16:31:27 +02:00
Piotr Sarna	c7f66d6fdd	cql3: add SERVICE LEVEL syntax (without an underscore) In order for the syntax to be more natural, it's now possible to use SERVICE LEVEL instead of SERVICE_LEVEL in all appropriate places. The old syntax is supported as well.	2021-04-12 16:31:27 +02:00
Eliran Sinvani	144fe02c23	unit test: Add unit test for per user sla syntax This commit adds the infrastructure needed to test per user sla, more specificaly, a service level accessor that triggers the update_service_levels_from_distributed_data function uppon any change to the dystributed sla data. A test was added that indirectly consumes this infrastructure by changing the distributed service level data with cql queries. Message-Id: <23b2211e409446c4f4e3e57b00f78d9ff75fc978.1609249294.git.sarna@scylladb.com>	2021-04-12 16:31:26 +02:00
Eliran Sinvani	2701481cbc	cql: Add support for service level cql queries This patch adds support for new service level cql queries. The queries implemented are: CREATE SERVICE_LEVEL [IF NOT EXISTS] <service_level_name> ALTER SERVICE_LEVEL <service_level_name> WITH param = <something> DROP SERVICE_LEVEL [IF EXISTS] <service_level_name> ATTACH SERVICE_LEVEL <service_level_name> TO <role_name> DETACH SERVICE_LEVEL FROM <role_name> LIST SERVICE_LEVEL <service_level_name> LIST ALL SERVICE_LEVELS LIST ATTACHED SERVICE_LEVEL OF <role_name> LIST ALL ATTACHED SERVICE_LEVELS	2021-04-12 16:30:01 +02:00
Eliran Sinvani	a88929da15	auth: Add service_level resource for supporting in authorization of cql service_level queries In order to be able to manage service_level configuration one must be authorized to do so, or to be a superuser. This commit adds the support for service_levels resource. Since service_levels are relative, reconfiguring one service level is not locallized only to that service level and will affect the QOS for all of the service levels, so there is not much sense of granting permissions to manage individual service_levels. This is why only root resource named service_levels that represents all service levels is used. This commit also implements the unit test additions for the newly introduced resource. Message-Id: <81ab16fa813b61be117155feea405da6266921e3.1609237687.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Eliran Sinvani	f78707d3fb	cql: Support accessing service_level_controller from query state In order to implement service level cql queries, the queries objects needs access to the service_level_controller object when processing. This patch adds this access by embedding it into the query state object. In order to accomplish the above the query processor object needs an access to service_level_controller in order to instantiate the query state. Message-Id: <68f5a7796068a49d9cd004f1cbf34bdf93b418bc.1609234193.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Eliran Sinvani	e173eaa032	instantiate and initialize the service_level_controller This patch adds the initialization of service_level_controller. It constructs the distributed service and start the watch loop for distributed data changes. Message-Id: <e97661194833d576aa39b3e7886366590f272612.1609175402.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Eliran Sinvani	8493e19840	qos: Add a standard implementation for service level data accessor service_level_controller defines an interface for accessing the service level distributed data, this patch implements a standard implementation of the interface that delegates to the system distributed keyspace. Message-Id: <25e68302f6f4d4fe5fcb66ea19159ad68506ba64.1609175314.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Piotr Sarna	41951d34ad	qos: add waiting for the updater future The distributed data updated used to spawn a future without waiting for it. It was quite safe, since the future had its own abort source, but it's better to remember it and wait for it during stop() anyway.	2021-04-12 16:01:04 +02:00
Eliran Sinvani	a54ea4667b	service/qos: adding service level controller adding the service level controller implementation. The implementation follows the design in: https://docs.google.com/document/d/1RrSTZ3ZX86-YDt2POwAVwFeKN9uX8frEvATJda5n1FU/edit?usp=sharing Some interfaces were added for registration with system componnents. The method of registration is chosen over a constructor parameter, due to the componnets being initialized prior to the service level controller being created. Message-Id: <e9c4e7d5b411062b6a553f5c6861e7875cd71d2c.1609171761.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Eliran Sinvani	3ecdab30a1	service_levels: Add documentation for distributed tables This patch adds documentation for the distributed tables used for service_level feature and their meaning and usage. Message-Id: <5b7d2be166c2381ed33094b4545fafe0f142583f.1609170862.git.sarna@scylladb.com>	2021-04-12 16:01:03 +02:00
Eliran Sinvani	dd74556ad9	service/qos: adding service level table to the distributed keyspace This patch adds the service level table and functions to manipulate it to the distributed keyspace. Message-Id: <b6cb7f311ac1ee6802d8f3d78eac9cf40fe21f68.1609161341.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00
Eliran Sinvani	4fea0762c2	service/qos: add common definitions Adding common definitions that will be used by the performance isolation classes. Mainly defines the common ground for configuring a service level through the service level options structure. Message-Id: <12476f4a8e21af3a4c7a892683940698f3beacce.1609160860.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00
Eliran Sinvani	23e889d710	auth: add support for role attributes In the general case roles might come with attributes attached to them these attributes can originate in mechanisms such as LDAP where in the undelying directory each entity can have a key:value data structure. This patch add support for such attributes in the role manager interface, it also implements the attribute support in the standard role manager in the form of a table with an attribute map in the distributed system keyspace. Message-Id: <f53c74a7ac315c4460ff370ea6dbb1597821edc2.1609158013.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00
Ivan Prisyazhnyy	0836efd830	tracing: test/boost/tracing: fix use after free fixes AddressSanitizer: stack-buffer-underflow on address 0x7ffd9a375820 at pc 0x555ac9721b4e bp 0x7ffd9a374e70 sp 0x7ffd9a374620 Backend registry holds a unique pointer to the backend implementation that must outlive the whole tracing lifetime until the shutdown call. So it must be catched/moved before the program exits its scope by passing out the lambda chain. Regarding deletion of the default destructor: moving object requires a move constructor (for do_with) that is not implicitly provided if there is a user-defined object destructor defined even tho its impl is default. Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com> Closes #8461	2021-04-12 16:44:07 +03:00
Avi Kivity	bad4924868	Merge 'Add a ninja help build target' from Pekka Enberg This pull request adds a "ninja help" build target in hopes of making the different build targets more discoverable to developers. Closes #8454 * github.com:scylladb/scylla: building.md: Document "ninja help" target configure.py: "ninja help" target building.md: Document "ninja <mode>-dist" target configure.py: Add <mode>-dist target as alias for dist-<mode>	2021-04-12 16:30:37 +03:00
Avi Kivity	80529f7097	Revert "nonroot: generate scylla_sysconfdir.py correctly" This reverts commit `e991e01f2e`. It breaks installation on CentOS 7. Fixes #8456.	2021-04-12 16:19:39 +03:00
Gleb Natapov	9fdb3d3d98	raft: stop using seastar::pipe to pass log entries to apply_fiber Stop use seastar::pipe and use seastar::queue directly to pass log entries to apply_fiber. The pipe is a layer above queue anyway and it adds functionality that we do not need (EOS) and hinds functionality that we do (been able to abort()). This fixes a crash during abort where the pipe was uses after been destroyed. Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>	2021-04-12 13:18:03 +02:00
Avi Kivity	a24771125e	Update seastar submodule * seastar 1c1f610ceb...d2dcda96bb (3): > closeable: add with_closeable and with_stoppable helpers > circleci: relax concurrency of the build process > logger: failed_to_log: print source location and format string	2021-04-12 12:52:01 +03:00
Raphael S. Carvalho	224120f7df	sstables: rewrite compound_sstable_set::all() Procedure is rewritten using std::partition, making it easier to maintain and it also fixes a theoretical quadratic behavior because list is entirely copied when extending it, which isn't harmful because maintenance set will be rarely populated and there are only 2 sets at most. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210409171412.57729-1-raphaelsc@scylladb.com>	2021-04-12 12:45:43 +03:00
Piotr Sarna	d77eb39076	Merge 'cdc: log: avoid linearizations' from Michał Chojnowski CDC log uses `bytes` to deal with cells and their values, and linearizes all values indiscriminately. This series makes a switch from `bytes` to `managed_bytes` to avoid that linearization. Fixes #7506. Closes #8429 * github.com:scylladb/scylla: cdc: log: change yet another occurence of `bytes` to `managed_bytes` cdc: log: switch the remaining usages of `bytes` to `managed_bytes` in collection_visitor cdc: log: change `deleted_elements` in log_mutation_builder from bytes to managed_bytes cdc: log: rewrite collection merge to use managed_bytes instead of bytes cdc: log: don't linearize collections in get_preimage_col_value cdc: log: change return type of get_preimage_col_value to managed_bytes cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell cdc: log: switch cell_map from bytes to managed_bytes cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view cdc: log: don't linearize the primary key in log_mutation_builder atomic_cell: add yet another variant of make_live for managed_bytes_view compound: add explode_fragmented	2021-04-12 10:56:12 +02:00
Avi Kivity	bd16e98019	expr: give a name to a tuple of columns Right now, binary_operator::lhs is a variant<column_value, std::vector<column_value>, token>. The role of the second branch (a vector of column values) is to represent a tuple of columns e.g. "WHERE (a, b, c) = ?"), but this is not clear from the type name. Inroduce a wrapper type around the vector, column_value_tuple, to make it clear we're dealing with tuples of CQL references (a column_value is really a column_ref, since it doesn't actually contain any value). Closes #8208	2021-04-12 09:40:16 +02:00
Pekka Enberg	d34571dfd9	building.md: Document "ninja help" target	2021-04-12 10:35:02 +03:00
Pekka Enberg	698710598a	configure.py: "ninja help" target This adds a "help" build target, which prints out important build targets. The printing is done in a separate shell script, becaue "ninja" insists on print out the "command" before executing it, which makes the help text unreadable.	2021-04-12 10:35:02 +03:00
Kamil Braun	7ffb0d826b	clustering_order_reader_merger: handle empty readers The merger could return end-of-stream if some (but not all) of the underlying readers were empty (i.e. not even returning a `partition_start`). This could happen in places where it was used (`time_series_sstable_set::create_single_key_sstable_reader`) if we opened an sstable which did not have the queried partition but passed all the filters (specifically, the bloom filter returned a false positive for this sstable). The commit also extends the random tests for the merger to include empty readers and adds an explicit test case that catches this bug (in a limited scope: when we merge a single empty reader). It also modifies `test_twcs_single_key_reader_filtering` (regression test for #8432) because the time where the clustering key filter is invoked changes (some invocations move from the constructor of the merger to operator()). I checked manually that it still catches the bug when I reintroduce it. Fixes #8445. Closes #8446	2021-04-12 10:34:52 +03:00
Pekka Enberg	e77c7f4543	building.md: Document "ninja <mode>-dist" target Let's document the new "dist-<mode>" to encourage people to use it.	2021-04-12 10:31:46 +03:00
Pekka Enberg	e959c90af8	configure.py: Add <mode>-dist target as alias for dist-<mode> The build and test build targets put "mode" as prefix, so let's unify the dist target too in preparation for "ninja help".	2021-04-12 10:29:54 +03:00
Michael Livshin	09f221203f	build: tolerate ./build being a symbolic link Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210411122951.14196-1-michael.livshin@scylladb.com>	2021-04-12 10:08:56 +03:00
Avi Kivity	9bc45d9243	build: drop lld from install-dependencies.sh on s390x lld is not available any more on s390x. Since it's optional, we can just drop it on that platform. Closes #8430	2021-04-12 09:46:33 +03:00
Nadav Har'El	2932f20b40	cql-pytest: translate Cassandra's reproducers for issue #2963 This is a translation of Cassandra's CQL unit test source file validation/entities/SecondaryIndexOnStaticColumnTest.java into our our cql-pytest framework. This test file checks various features of indexing (with secondary index) static rows. All these tests pass on Cassandra, but fail on Scylla because of issue #2963 - we do not yet support indexing of a static row. The failing test currently fail as soon as they try to create the index, with the message: "Indexing static columns is not implemented yet." Refs #2963. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210411153014.311090-1-nyh@scylladb.com>	2021-04-12 08:11:35 +02:00
Nadav Har'El	989589b570	test/cql-pytest,alternator,redis: avoid an annoying warning This patch avoids an annoying warning Warning: Unknown config ini key: flake8-ignore when running one of the pytest-based test projects (cql-pytest, alternator and redis) on recent versions of pytest. In commit `2022da2405`, we added to the toplevel Scylla directory a "tox.ini" file with some intention to configure Python syntax checking. One of the configurations in this tox.ini is: [pytest] flake8-ignore = E501 It turns out that pytest, if a certain test directory does not have its own pytest.ini file, looks up in ancestor directory for various configuration files (the configuration file precedence is described in https://docs.pytest.org/en/stable/customize.html), and this includes this tox.ini configuration section. Recent versions of pytest complain about the "flake8-ignore" configuration parameter, which they don't recognize. This parameter may be ok (?) if you install a flake8 pytest plugin, but we do not require users to do this for running these tests. Moreover, whatever noble intentions this commit and its tox.ini had, nobody ever followed up on it. The three pytest-based test directories never adhered to flake8's recommended syntax, and never intended to do so. None of the developers of these tests use flake8, or seem to wish to do so. If this ever changes, we can change the pytest.ini or undo this commit and go back to a top-level tox.ini, but I don't see this happening anytime soon. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210411085708.300851-1-nyh@scylladb.com>	2021-04-12 08:04:06 +02:00
Avi Kivity	35a3d65ee7	install.sh: document pathname components install.sh supports two different ways of redirecting paths: --root for creating a chroot-style tree, and --prefix for changing the installed file location. Document them. Closes #8389	2021-04-11 21:03:57 +03:00
Avi Kivity	ec3db140cb	utils: data_input: replace enable_if with tightened concept std::is_fundamental isn't a good constraint since it include nullptr_t and void. Replace with std::integral which is sufficient. Use a concept instead of enable_if to simplify the code. Closes #8450	2021-04-11 18:56:21 +03:00
Nadav Har'El	d5121d1476	scripts/refresh-submodules.sh: allow choosing which submodule to refresh Currently, scripts/refresh-submodules.sh always refreshes all submodules, i.e., takes the latest version of all of all of them and commits it. But sometimes, a committer only wants to refresh a specific submodule, and doesn't want to deal with the implications of updating a different one. As a recent example, for issue #8230, I wanted to update the tools/java submodule, which included a fix for sstableloader, without updating the Seastar submodule - which contained completely irrelevant changes. So in this patch we add the ability to override the default list of submodules that refresh-submodules.sh uses, with one or more command line parameters. For example: scripts/refresh-submodules.sh tools/java will update only tools/java. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210411151421.309483-1-nyh@scylladb.com>	2021-04-11 18:35:04 +03:00
Benny Halevy	705f9c4f79	commitlog: segment_manager: max_size must be aligned This was triggered by the test_total_space_limit_of_commitlog dtest. When it passes a very large commitlog_segment_size_in_mb (1/6th of the free memory size, in mb), segment_manager constructor limits max_size to std::numeric_limits<position_type>::max() which is 0xffffffff. This causes allocate_segment_ex to loop forever when writing the segment file since `dma_write` returns 0 when the count is unaligned (seen 4095). The fix here is to select a sligtly small maxsize that is aligned down to a multiple of 1MB. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>	2021-04-11 13:17:50 +03:00
Avi Kivity	3814f74a74	Update seastar submodule * seastar caba9fda34...1c1f610ceb (3): > scripts/perftune.py: allow configuring disks write cache mode > test: file_utils: tmp_dir_do_with_fail_remove_test: rename inner tmp_dir to trigger error > circleci: switch to dedicated machine	2021-04-11 13:13:53 +03:00
Raphael S. Carvalho	5c630f405a	table: introduce trigger_offstrategy_compaction() this function will be used on repair-based operation completion, to notify table about the need to start offstrategy compaction process on the maintenance sstables produced by the operation. Function which notifies about bootstrap and replace completion is changed to use this new function. Removenode and decommission will reuse this function. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-09 14:53:14 -03:00
Raphael S. Carvalho	f60f32f7fa	repair/row_level: make operations_supported static const Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-04-09 14:42:10 -03:00
Tomasz Grabiec	305372820d	Merge "Make position_in_partition::tri_compare use strong_ordering" from Pavel Emelyanov There are some users of that tri_comparator which are also converted to strong_ordering. Most of the code using those is, in turn, already handling return values interchangeably. The bound_view::tri_compare, which's used by the guy, is still returning int. tests: unit(dev) * xemul/br-position-tri-compare: code: Relax position_in_partition::tri_compare users position_in_partition: Convert tri_compare to strong_ordering test: Convert clustering_fragment_summary::tri_cmp to strong_ordering repair: Convert repair_sync_boundary::tri_compare to strong_ordering view: Don't expect int from position_in_partition::tri_compare	2021-04-09 17:54:38 +02:00
Pavel Emelyanov	64074f45ce	code: Relax position_in_partition::tri_compare users There are some pieces left doing res <=> 0 with the res now being a strong_ordering itself. All these can be just dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Pavel Emelyanov	92e72c62dc	position_in_partition: Convert tri_compare to strong_ordering All its users are now ready to accept both - int and the strong_ordering value, so the change is pretty straightforward. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Pavel Emelyanov	a15f158661	test: Convert clustering_fragment_summary::tri_cmp to strong_ordering Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Pavel Emelyanov	ba4699ffca	repair: Convert repair_sync_boundary::tri_compare to strong_ordering The change partially reverts `37855641` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Pavel Emelyanov	70c851e69b	view: Don't expect int from position_in_partition::tri_compare Now it's int, but soon will be std::strong_ordering. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Tomasz Grabiec	20af98c61b	Merge "Tweak partition_snapshot_row_cursor" from Pavel Emelyanov The set puts the partition_snapshot_row_cursor on a diet: 320 -> 224 bytes, makes use of btree API sugar to save some CPU cycles and compacts the code. tests: unit(dev) * xemul/br-row-cursor-cleanups-2: partition_snapshot_row_cursor: Rewrite row() with consume_row() partition_snapshot_row-cursor: Add const consume_row() version partition_snapshot_row_cursor: Add concept to .consume_row() partition_snapshot_row_cursor: Don't carry end iterators partition_snapshot_row_cursor: Move cells hash creation to reader partition_snapshot_row_cursor: Move read_partition into test partition_snapshot_row_cursor: Move is_in_latest_version inline partition_snapshot_row_cursor: Use is_in_latest_version where appropriate partition_snapshot_row_cursor: Less dereferences in key() method partition_snapshot_row_cursor: Update change mark in prepare_heap partition_snapshot_row_cursor: Clear current row when recreating partition_snapshot_row_cursor: Use btree::lower_bound sugar partition_snapshot_row_cursor: Factor out next() and erase_and_advance() partition_snapshot_row_cursor: Relax vector of iterators btree: Add operator bool() clustering_row: Add new .apply() overload	2021-04-09 14:51:24 +02:00
Botond Dénes	54edd613c8	mutation_query: remove now unused mutation_query() If somebody wants to query a generic mutation source in the future, they can still do it via `mutation_querier::consume_page()` and the right result builder.	2021-04-09 13:40:27 +03:00
Botond Dénes	3dbb456fba	test: mutation_query_test: use local mutation_query() implementation Add a local `mutation_query()` variant, which only contains the pieces of logic the test really wants to test: invoking `mutation_querier::consume_page()` with a `reconcilable_result_builder`. This allows us to get rid of the now otherwise unused `mutation_query()`.	2021-04-09 13:40:27 +03:00
Botond Dénes	80a03826e3	database: mutation_query(): use table::mutation_query() Instead of `mutation_query()` from `mutation_query.hh`. The latter is about to be retired as we want to migrate all users to `table::mutation_query()`. As part of this change, move away from `mutation_query_stage` too. This brings the code paths of the two query variants closer together, as they both have an execution stage declared in `database`.	2021-04-09 13:40:27 +03:00
Botond Dénes	5c8f142fe5	table: add mutation_query() We want to migrate `database::mutation_query()` off `mutation_query()` to use `table::mutation_query()` instead. The reason is the same as for making `table::query()` standalone: the `mutation_query()` implementation increasingly became specific to how tables are queried and is about to became even more specific due to impending changes to how permits are obtained. As no-one in the codebase is doing generic mutation queries on generic mutation sources we can just make this a member of table. This patch just adds `table::mutation_query()`, no user exists yet. `table::mutation_query()` is identical to `mutation_query()`, except that it is a coroutine.	2021-04-09 13:40:27 +03:00
Botond Dénes	a4facf316d	query: remove the now unused data_query() If somebody wants to query a generic mutation source in the future, they can still do it via `data_querier::consume_page()` and the right result builder.	2021-04-09 13:40:27 +03:00
Botond Dénes	59ea36731b	test: mutation_query_test: use local data_query() implementation The test only wants to test result size calculation so it doesn't need the whole `data_query()` logic. Replace the call to `data_query()` with one to a local alternative which contains just the necessary bits -- invoking `data_querier::consume_page()` with the right result builder. This allows us get rid of the now otherwise unused `data_query()`.	2021-04-09 13:40:27 +03:00
Botond Dénes	c3f0681011	table: query(): inline data_query() code into query() `data_query()` is now just a thin wrapper over `data_querier::consume_page()`. Furthermore, contrary to the old data query method, it is not a generic way of querying a mutation source, it is now closely tied to how we query tables. It does a querier lookup and save. In the future we plan on tying it even closer to the table in how permits are obtained. For this reason it is better to just inline it into the `query()` method which invokes it.	2021-04-09 13:40:27 +03:00
Pavel Emelyanov	89eece3aca	partition_snapshot_row_cursor: Rewrite row() with consume_row() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	ae6b677f9a	partition_snapshot_row-cursor: Add const consume_row() version It's the same as the existing one, but doesn't modify anything (cursor and pointing rows_entry's) and calls consumer with const row reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	5e28075ec0	partition_snapshot_row_cursor: Add concept to .consume_row() Nothing special here Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	d891cfe6cd	partition_snapshot_row_cursor: Don't carry end iterators The btree's iterator can be checked to reach the tree's end without holding the ending iterator itself. This makes the whole p_s_r_c 20% smaller (288 bytes -> 224 bytes) since it now keeps 4 extra iterators on-board -- inside small vectors for heap and current_row. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	4558eb3afc	partition_snapshot_row_cursor: Move cells hash creation to reader Right now call to .row() method may create hash on row's cells. It's counterintuitive to see a const method that transparently changes something it points to. Since the only caller of a row() who knows whether the hash creation is required is the cache reader, it's better to move the call to prepare_hash() into it. Other than making the .row() less surprising this also helps to get rid of the whole method by the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	00caf5f219	partition_snapshot_row_cursor: Move read_partition into test The method in question is test-only helper, there's no need in keeping it as a part of the API. Another reason to move is that the method is O(number of rows) and doesn't preempt while looping, but cursor code users try hard not to stall the reactor. So even though this method has a meaningful semantics within the class, it will better be reinvented if needed in core code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:16:13 +03:00
Avi Kivity	ad10a6a220	Update seastar submodule * seastar fcd46c138...caba9fda3 (5): > file: mark overlayfs as not supporting RWF_NOWAIT > dns: fix tcp sendv return value to c-ares Fixes #8442. > test: closeable: allocate variables accessed by continuations using do_with > test: Fix leak in io_queue_test > test: rpc_test: reduce memory usage in compression tests	2021-04-09 11:48:50 +03:00
Pavel Emelyanov	9f323355a6	partition_snapshot_row_cursor: Move is_in_latest_version inline The method is currently defined outside of the class which gives compiler less chances to really inline it when needed. Also, keeping this simple piece of code inline is less code to read (and compile). Mark the guy noexcept while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	cc57e35c6a	partition_snapshot_row_cursor: Use is_in_latest_version where appropriate Checking for _current_row[0].version being 0 (or not being 0) is better understood if done with a well named existing helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	353a8f66a2	partition_snapshot_row_cursor: Less dereferences in key() method The valid cursor's key is kept on the _position as well, but getting it from there is 1 defererence less: _current_row -()-> row -> key _position -()-> std::optional -> key iterator's -> is pointer dereference ** std::optional is designed not to be a pointer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	353a1306ce	partition_snapshot_row_cursor: Update change mark in prepare_heap The heap's iterators validity is checked with the change mark, which is updated every time heap is recreated. Factor these updates out and keep the mark together with the heap it protects. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	1a1f05f50b	partition_snapshot_row_cursor: Clear current row when recreating The cursor keeps current row in a separate vector of iterators and reconstructs it in a dedicated method, which _expects_ that the vector is empty on entry. It's better to keep the logic of current row construction in one place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	2edd072d27	partition_snapshot_row_cursor: Use btree::lower_bound sugar When checking if the lower-bound entry matched the search key it's possible to avoid extra comparison with the help of the collection used to store the rows (btree). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	9aee0ad8b3	partition_snapshot_row_cursor: Factor out next() and erase_and_advance() Both helpers do the same -- advance the cursor to the next row. The latter may additionally remove the row from the uniquely owned version. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Pavel Emelyanov	2fb0f7315c	partition_snapshot_row_cursor: Relax vector of iterators The cursor maintains a vector of iterators that correspond to each of the versions scanned. However, only the iterator in the latest one is really needed, so the whole vector can be reduced down to an optional. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 11:45:45 +03:00
Botond Dénes	b03f360bb0	table: make query() a coroutine This method is very hard to read or modify in its current form due to all the continuation-chain boilerplate. Make it a coroutine to facilitate future changes in the next patches but not just.	2021-04-09 11:04:35 +03:00
Pavel Emelyanov	26e27e27e8	btree: Add operator bool() The btree's iterators allow for simple checking for '== tree.end()' condition. For this check neither the tree itself, nor the ending iterator is required. One just need to check if the _idx value is the npos. One additional change to make it work is required -- when removing an entry from the inline node the _idx should be set to npos. This change is, well, a bugfix. An iterator left with 0 in _idx is treated as a valid one. However, the bug is non-triggerable. If such an "invalid" iterator is compared against tree.end() the check would return true, because the tree pointers would conside. So this patch adds an operator bool() to btree iterator to facilitate simpler checking if it reached the end of the collection or not. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 10:05:47 +03:00
Pavel Emelyanov	772fe2b089	clustering_row: Add new .apply() overload The clustering_row is a wrapper over the deletable_row and facilitates the apply-creation of the latter from some other objects. Soon it will accept the deletable_row itself for apply()-ing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 10:05:47 +03:00
Benny Halevy	830128cd95	streaming: stream_session: do not log err.c_str verbatim It is dangerous to print a formatted string as is, like sslog.warn(err.c_str()) since it might hold curly braces ('{}') and those require respective runtime args. Instead, it should be logged as e.g. sslog.warn("{}", err.c_str()). This will prevent issues like #8436. Refs #8436 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210408173048.124417-2-bhalevy@scylladb.com>	2021-04-09 08:36:49 +03:00
Benny Halevy	76cd315c42	streaming: stream_session: do not escape curly braces in format strings Those turn into '{}' in the formatted strings and trigger a logger error in the following sstlog.warn(err.c_str()) call. Fixes #8436 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210408173048.124417-1-bhalevy@scylladb.com>	2021-04-09 08:36:49 +03:00
Gleb Natapov	b9175edea4	raft: test: check that a server with id zero cannot be neither created nor added to a config Message-Id: <20210407134853.1964226-2-gleb@scylladb.com>	2021-04-08 17:07:18 +02:00
Gleb Natapov	fb938a36d4	raft: disallow adding and creating servers with id zero Id zero has special meaning in the code and cannot be valid server id. Message-Id: <20210407134853.1964226-1-gleb@scylladb.com>	2021-04-08 17:07:18 +02:00
Kamil Braun	3687757115	sstables: fix TWCS single key reader sstable filter The filter passed to `min_position_reader_queue`, which was used by `clustering_order_reader_merger`, would incorrectly include sstables as soon as they passed through the PK (bloom) filter, and would include sstables which didn't pass the PK filter (if they passed the CK filter). Fortunately this wouldn't cause incorrect data to be returned, but it would cause sstables to be opened unnecessarily (these sstables would immediately return eof), resulting in a performance drop. This commit fixes the filter and adds a regression test which uses statistics to check how many times the CK filter was invoked. Fixes #8432. Closes #8433	2021-04-08 18:03:49 +03:00
Avi Kivity	3a58985674	Merge 'scylla_ntp_setup: detect already installed ntp client' from Takuya ASADA On current implementation, we may re-run ntp configuration even it already configured. Also, the system may configured with non-default ntp client, we just ignoring that and configure with default ntp client. This patch minimize unnecessary re-configuration of ntp client. It run in following order: 1. Check NTP client is already running. If it running, skip setup 2. Check NTP client is alrady installed. If it installed, use it 3. If there is non of NTP client package installed, - if it's CentOS, install chrony - if it's on other distributions, install systemd-timesyncd Closes #8431 * github.com:scylladb/scylla: scylla_ntp_setup: detect already installed ntp client scylla_util.py: return bool value on systemd_unit.is_active()	2021-04-08 17:27:15 +03:00
Takuya ASADA	735c83b27f	scylla_ntp_setup: detect already installed ntp client On current implementation, we may re-run ntp configuration even it already configured. Also, the system may configured with non-default ntp client, we just ignoring that and configure with default ntp client. This patch minimize unnecessary re-configuration of ntp client. It run in following order: 1. Check NTP client is already running. If it running, skip setup 2. Check NTP client is alrady installed. If it installed, use it 3. If there is non of NTP client package installed, - if it's CentOS, install chrony - if it's on other distributions, install systemd-timesyncd Related with #8344, #8339	2021-04-08 22:52:02 +09:00
Botond Dénes	32ae51dc2c	table: query(): fix typo (short_read_allwoed) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210408133018.65692-1-bdenes@scylladb.com>	2021-04-08 16:34:08 +03:00
Tomasz Grabiec	6d6f39a7b3	Merge "fixes for stepdown and quorum check" from Gleb The series contains code cleanups and fixes for stepdown process and quorum check code. Note this is re-send of already posted patches lumped together for convenience. * scylla-dev/raft-fixes-v1: raft: add test for check quorum on a leader raft: fix quorum check code for joint config and non-voting members raft: do not hang on waiting for entries on a leader that was removed from a cluster raft: add more tracing to stepdown code raft: use existing election_elapsed() function instead of redo the calculation raft: test: add test case for stepdown process raft: check that a node is still the leader after initiating stepdown process	2021-04-08 15:18:52 +02:00
Takuya ASADA	2545d7fd43	scylla_util.py: return bool value on systemd_unit.is_active() Currently, 'if unit.is_active():' is always True since is_active() returns result in string (active, inactive, unknown). To avoid such scripting bug, change return value in bool.	2021-04-08 21:54:05 +09:00
Michał Chojnowski	6b31f73987	cdc: log: change yet another occurence of `bytes` to `managed_bytes`	2021-04-08 10:16:21 +02:00
Michał Chojnowski	061f72166c	cdc: log: switch the remaining usages of `bytes` to `managed_bytes` in collection_visitor	2021-04-08 10:16:21 +02:00
Michał Chojnowski	2760382a68	cdc: log: change `deleted_elements` in log_mutation_builder from bytes to managed_bytes	2021-04-08 10:16:21 +02:00
Michał Chojnowski	ba53c85829	cdc: log: rewrite collection merge to use managed_bytes instead of bytes	2021-04-08 10:16:21 +02:00
Michał Chojnowski	42acdc4d09	cdc: log: don't linearize collections in get_preimage_col_value	2021-04-08 10:16:21 +02:00
Michał Chojnowski	70a2bed70b	cdc: log: change return type of get_preimage_col_value to managed_bytes	2021-04-08 10:16:21 +02:00
Michał Chojnowski	4214e74678	cdc: log: remove an unnecessary copy in process_row_visitor::live_atomic_cell	2021-04-08 10:16:11 +02:00
Michał Chojnowski	c2b43c8daf	cdc: log: switch cell_map from bytes to managed_bytes	2021-04-08 10:05:30 +02:00
Michał Chojnowski	4e8eb07de4	cdc: log: change the argument of log_mutation_builder::set_value to managed_bytes_view	2021-04-08 10:05:00 +02:00
Michał Chojnowski	f18b74eee5	cdc: log: don't linearize the primary key in log_mutation_builder	2021-04-08 10:04:31 +02:00
Michał Chojnowski	890e6377ab	atomic_cell: add yet another variant of make_live for managed_bytes_view We will use it in the next patches of this series.	2021-04-08 10:04:23 +02:00
Michał Chojnowski	5a2b492f09	compound: add explode_fragmented We will use it in the next patches in this series.	2021-04-08 10:02:54 +02:00
Asias He	a8c90a5848	storage_service: Reject replacing a node that has left the ring 1) start n1, n2, n3 2) decommission n3 3) remove /var/lib/scylla for n3 4) start n4 with the same ip address as n3 to replace n3 5) replace will be successful If a node has left the ring, we should reject the replace operation. This patch makes the check during replace operation more strict and rejects the replace if the node has left the ring. After the patch, we will see ERROR 2021-04-07 08:02:14,099 [shard 0] init - Startup failed: std::runtime_error (Cannot replace_adddress 127.0.0.3 because it has left the ring, status=LEFT) Fixes #8419 Closes #8420	2021-04-07 19:42:28 +03:00
Avi Kivity	202c631dee	test: perf: perf_simple_query: collect allocation and task statistics Calculate and display the number of memory allocations and tasks executed per operation. Sample results (--smp 1): 180022.46 tps (90 allocs/op, 20 tasks/op) 178963.44 tps (90 allocs/op, 20 tasks/op) 178702.41 tps (90 allocs/op, 20 tasks/op) 177679.74 tps (90 allocs/op, 20 tasks/op) 179539.36 tps (90 allocs/op, 20 tasks/op) median 178963.44 tps (90 allocs/op, 20 tasks/op) median absolute deviation: 575.92 maximum: 180022.46 minimum: 177679.74 This allows less noisy tracking of how some changes impact performance.	2021-04-07 17:54:48 +03:00
Avi Kivity	3a90df39c5	perf: deinline some functions in perf.hh Those functions were defined in a header, but not marked inline. This made including the header from two source files impossible, as the linker would complain about duplicate symbols. Rather than making them inline, put them in a new source file perf.cc as they don't need to be inline.	2021-04-07 17:51:58 +03:00
Avi Kivity	29a674cd94	test: perf: perf_fast_forward: report allocation rate and tasks These are more stable than cpu consumed across runs, and impact performance directly. Closes #8422	2021-04-07 15:41:43 +02:00
Piotr Sarna	8e808a56d2	Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund Fixes #8363 Fixes #8376 Delete segements has two issues when running with size-limited commit log and strict adherence to said limit. 1.) It uses parallel processing, with deferral. This means that the disk usage variables it looks at might not be fully valid - i.e. we might have already issued a file delete that will reduce disk footprint such that a segment could instead be recycled, but since vars are (and should) only updated _post_ delete, we don't know. 2.) It does not take into account edge conditions, when we only delete a single segment, and this segment is the border segment - i.e. the one pushing us over the limit, yet allocation is desperately waiting for recycling. In this case we should allow it to live on, and assume that next delete will reduce footprint. Note: to ensure exact size limit, make sure total size is a multiple of segment size. if we had an error in recycling (disk rename?), and no elements are available, we could have waiters hoping they will get segements. abort the queue (not permanent, but wakes up waiters), and let them retry. Since we did deletions instead, disk footprint should allow for new allocs at least. Or more likely, everything is broken, but we will at least make more noise. Closes #8372 * github.com:scylladb/scylla: commitlog: Add signalling to recycle queue iff we fail to recycle commitlog: Fix race and edge condition in delete_segments commitlog: coroutinize delete_segments commitlog_test: Add test for deadlock in recycle waiter	2021-04-07 15:13:25 +02:00
Nadav Har'El	0dd6f2db8f	Merge 'CDC generations: refactors and improvements' from Kamil Braun The "most important" major changes are: 1. storage_service: simplify CDC generation management during node replace Previously, when node A replaced node B, it would obtain B's generation timestamp from its application state (gossiped by other nodes) and start gossiping it immediately on bootstrap. But that's not necessary: - if this is the timestamp of the last (current) generation, we would obtain it from other nodes anyway (every node gossips the last known timestamp), - if this is the timestamp of an earlier generation, we would forget it immediately and start gossiping the last timestamp (obtained from other nodes). This commit simplifies the bootstrap code (in node-replace case) a bit: the replacing node no longer attempts to retrieve the CDC generation timestamp from the node being replaced. 2. tree-wide: introduce cdc::generation_id type Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: - if a piece of code doesn't care about the timestamp, it just passes the ID around - if it does care, it can access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. 3. cdc: handle missing generation case in check_and_repair_cdc_streams check_and_repair_cdc_streams assumed that there is always at least one generation being gossiped by at least one of the nodes. Otherwise it would enter undefined behavior. I'm not aware of any "real" scenario where this assumption wouldn't be satisfied at the moment where check_and_repair_cdc_streams makes it except perhaps some theoretical races. But it's best to stay on the safe side. --- Additionally the PR does some simplifications, stylistic improvements, removes some dead code, coroutinizes some functions, uncoroutinizes others (due to miscompiles), adds additional logging, updates some stale comments. Read commit messages for more details. Closes #8283 * github.com:scylladb/scylla: cdc: log a message when creating a new CDC generation cdc: handle missing generation case in check_and_repair_cdc_streams tree-wide: introduce cdc::generation_id type tree-wide: rename "cdc streams timestamp" to "cdc generation id" cdc: remove some functions from generation.hh storage_service: make set_gossip_tokens a static free-function db: system_keyspace: group cdc functions in single place cdc: get rid of "get_local_streams_timestamp" sys_dist_ks: update comment at quorum_if_many storage_service: simplify CDC generation management during node replace	2021-04-07 14:49:02 +03:00
Kamil Braun	6525111d21	cdc: log a message when creating a new CDC generation	2021-04-07 13:47:16 +02:00
Kamil Braun	0978155bec	cdc: handle missing generation case in check_and_repair_cdc_streams check_and_repair_cdc_streams assumed that there is always at least one generation being gossiped by at least one of the nodes. Otherwise it would enter undefined behavior. I'm not aware of any "real" scenario where this assumption wouldn't be satisfied at the moment where check_and_repair_cdc_streams makes it except perhaps some theoretical races. But it's best to stay on the safe side.	2021-04-07 13:47:16 +02:00
Kamil Braun	99fd2244a3	tree-wide: introduce cdc::generation_id type This is a follow-up to the previous commit. Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: 1. if a piece of code doesn't care about the timestamp, it just passes the ID around 2. if it does care, it can simply access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. Using the occasion, we change the `do_handle_cdc_generation_intercept...` function to be a standard function, not a coroutine. It turns out that - depending on the shape of the passed-in argument - the function would sometimes miscompile (the compiled code would not copy the argument to the coroutine frame).	2021-04-07 13:47:13 +02:00
Raphael S. Carvalho	8e0a1ca866	sstable_set: Implement compound_sstable_set's create_single_key_sstable_reader() compound set isn't overriding create_single_key_sstable_reader(), so default implementation is always called. Although default impl will provide correct behavior, specialized ones which provides better perf, which currently is only available for TWCS, were being ignored. compound set impl of single key reader will basically combine single key readers of all sets managed by it. Fixes #8415. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>	2021-04-07 12:36:30 +03:00
Nadav Har'El	da11cd99f7	Merge 'Add a (failing) test for picking secondary indexes in order' from Piotr Sarna Currently the heuristics for picking an index for a query are not very well defined. It would be best if we used statistics to pick the index which is likely to perform the fastest, but for starters we should at least let the user decide which index to pick by picking the first one by the order of restrictions passed to the query. The (failing) test case from this patch shows the expected results. Ref: #7969 Closes #8414 * github.com:scylladb/scylla: cql-pytest: add a failing test for index picking order cql3: add tracing used secondary index	2021-04-07 11:40:37 +03:00
Piotr Sarna	1f7b972db7	cql-pytest: add a failing test for index picking order Currently the heuristics for picking an index for a query are not very well defined. It would be best if we used statistics to pick the index which is likely to perform the fastest, but for starters we should at least let the user decide which index to pick by picking the first one by the order of restrictions passed to the query. The (failing) test case from this patch shows the expected results. Ref: #7969	2021-04-07 10:05:00 +02:00
Gleb Natapov	68d73bd4c8	raft: add test for check quorum on a leader	2021-04-07 10:15:33 +03:00
Gleb Natapov	b3cb4f3966	raft: fix quorum check code for joint config and non-voting members Current leader code check for most nodes to be alive, but this is incorrect since some nodes may be non-voting and hence should not cause a leader to stepdown if dead. It also incorrect with joint config since quorum is calculated differently there. Fix it by introducing activity_tracker class that knows how to handle all the above details.	2021-04-07 10:15:33 +03:00
Gleb Natapov	a48a2c454b	raft: do not hang on waiting for entries on a leader that was removed from a cluster If a leader is removed from a cluster it will never know when entries that it did not committed yet will be committed, so abort the wait in this case with uncertainty error.	2021-04-07 10:15:33 +03:00
Gleb Natapov	db03c94692	raft: add more tracing to stepdown code	2021-04-07 10:15:33 +03:00
Gleb Natapov	7dec56721c	raft: use existing election_elapsed() function instead of redo the calculation	2021-04-07 10:15:33 +03:00
Gleb Natapov	bdb59307d3	raft: test: add test case for stepdown process Add the test for the case where C_new entry is not the last one in a leader that is been removed from a cluster. In this case a leader will continue replication even after committing C_new and will start stepdown process later, when at least one follower is fully synchronized.	2021-04-07 10:15:33 +03:00
Gleb Natapov	3bcd3212e2	raft: check that a node is still the leader after initiating stepdown process Usually initiation of stepdown process does not immediately depose the current leader, but if the current leader is no longer part of the cluster it will happen. We were missing the check after initiating stepdown process in append reply handling.	2021-04-07 10:15:33 +03:00
Avi Kivity	5109bf8b99	config: relax batch size warning and failure thresholds We inherited very low threshold for warning and failing multi-partition batches, but these warnings aren't useful. The size of a batch in bytes as no impact on node stability. In fact the warnings can cause more problems if they flood the log. Fix by raising the warning threshold to 128 kiB (our magic size) and the fail threshold to 1 MiB. Fixes #8416. Closes #8417	2021-04-06 20:56:06 +03:00
Calle Wilund	d734f85280	commitlog: Add signalling to recycle queue iff we fail to recycle Fixes #8376 If a recycle should fail, we will sort of handle it by deleting the segment, so no leaks. But if we have waiter(s) on the recycle queue, we could end up deadlocked/starved because nothing is incoming there. This adds an abort of the queue iff we failed and no objects are available. This will wake up any waiter, and he should retry, and hopefully at least be able to create a new segment. We then reset the queue to a new one. So we can go on. v2: * Forgot to reset queue v3: * Nicer exception handling in allocate_segment_ex	2021-04-06 16:38:14 +00:00
Calle Wilund	15dd76f0c2	commitlog: Fix race and edge condition in delete_segments Fixes #8363 Delete segements has two issues when running with size-limited commit log and strict adherence to said limit. 1.) It uses parallel processing, with deferral. This means that the disk usage variables it looks at might not be fully valid - i.e. we might have already issued a file delete that will reduce disk footprint such that a segment could instead be recycled, but since vars are (and should) only updated _post_ delete, we don't know. 2.) It does not take into account edge conditions, when we only delete a single segment, and this segment is the border segment - i.e. the one pushing us over the limit, yet allocation is desperately waiting for recycling. In this case we should allow it to live on, and assume that next delete will reduce footprint. Note: to ensure exact size limit, make sure total size is a multiple of segment size. Fixed by a.) Doing delete serialized. It is not like being parallel here will win us speed awards. And now we can know exact footprint, and how many segments we have left to delete b.) Check if we are a block across the footprint boundry, and people might be waiting for a segment. If so, don't delete segment, but recycle. As a follow-up, we should probably instead adjust the commitlog size limit (per shard) to be a multiple of segment sizes, but there is risks in that too.	2021-04-06 16:38:14 +00:00
Calle Wilund	d9a9897892	commitlog: coroutinize delete_segments Because we like cow routines.	2021-04-06 16:38:14 +00:00
Calle Wilund	813694b617	commitlog_test: Add test for deadlock in recycle waiter Not a very good test, mind you. Nothing to verify, just see if the test times out. But try to make it at least complete for failure report.	2021-04-06 16:38:14 +00:00
Piotr Sarna	1c99ed6ced	cql3: add tracing used secondary index The indexed queries will now record which index was chosen for fetching the base table keys. Example output: activity ------------------------------------------------------------------------------------------------------------------------ Parsing a statement Processing a statement Consulting index my_v2_idx for a single slice of keys Creating read executor for token -3248873570005575792 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE read_data: querying locally Start querying singular range {{-3248873570005575792, pk{000400000002}}} Querying cache for range {{-3248873570005575792, pk{000400000002}}} and slice {(-inf, +inf)} Querying is done Done processing - preparing a result	2021-04-06 17:16:29 +02:00
Tomasz Grabiec	4b10247a4f	Merge "raft: do not assert when receiving unexpected messages in a leader state" from Gleb * scylla-dev/raft-cleanup-v2: raft: test: add test that leader behaves as expected when it gets unexpended messages raft: do not assert when receiving unexpected messages in a leader state raft: use existing function to check if election timeout elapsed	2021-04-06 16:52:23 +02:00
Konstantin Osipov	c83cf1f965	uuid: switch the API to use std::chrono A follow up for the patch for #7611. This change was requested during review and moved out of #7611 to reduce its scope. The patch switches UUID_gen API from using plain integers to hold time units to units from std::chrono. For one, we plan to switch the entire code base to std::chrono units, to ensure type safety. Secondly, using std::chrono units allows to increase code reuse with template metaprogramming and remove a few of UUID_gen functions that beceme redundant as a result. * switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch min_time_UUID(), max_time_UUID(), create_time_safe() to std::chrono * remove unused variant of from_unix_timestamp() * remove unused get_time_UUID_bytes(), create_time_unsafe(), redundant get_adjusted_timestamp() * inline get_raw_UUID_bytes() * collapse to similar implementations of get_time_UUID() * switch internal constants to std::chrono * remove unnecessary unique_ptr from UUID_gen::_instance Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>	2021-04-06 17:12:54 +03:00
Nadav Har'El	91249e9683	Update tools/java submodule * tools/java 5756445ec7...57eb143119 (1): > sstableloader: Handle non-prepared batches with ":" in identifier names Fixes #8230.	2021-04-06 16:37:03 +03:00
Nadav Har'El	0d0db05cf3	test/alternator: speed up two slow xfailing tests By far the two slowest Alternator tests when running a development build on my laptop are test_gsi.py::test_gsi_projection_include and test_gsi.py::test_gsi_projection_keys_only Each of those takes around 3.2, and the sum of just these two tests is as much as 10% (!) of all other 600 tests. The reason why these tests are slow is that they check scanning a GSI with projection. Scylla currently ignores the projection, so the scan returns the wrong value. Because this is a GSI, which supports only eventually- consistent reads, we need to retry the read - and did it for up to 3 seconds! But this retry only makes sense if the GSI read did not yet return the expected data. But in these xfailing test, we read a wrong item (with too many attributes) almost immediately, and this should indicate an immediate failure - no amount of retry would help. So in this patch we detect this case and fail the test immediately instead of wasting 3 seconds in retries. On my laptop with dev build, this patch reduces the time to run the entire Alternator test suite from 70 seconds to 63 seconds. Also, now that we never just waste time until the timeout, we can increase it to any number, and in this patch we increase it from 3 seconds to 5. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210317183918.1775383-1-nyh@scylladb.com>	2021-04-06 14:49:15 +02:00
Nadav Har'El	15cab90f7b	test/alternator: switch some fixture scopes from "session" to "module" In conftest.py we have several fixtures creating shared tables which many test files can share, so they are marked with the "session" scope - all the tests in the testing session may share the same instance. This is fine. Some of test files have additional fixtures for creating special tables needed only in those files. Those were also, unnecessarily, marked "session" scope as well. This means that these temporary tables are only deleted at the very end of test suite, event though they can be deleted at the end of the test file which needed them. This is exactly what the "module" fixture scope is, so this patch changes all the fixtures private to one test file to be "module". After this patch, the teardown of the last test in the suite goes down from 4 seconds to just 1.5 seconds (it's still long because there are still plenty of session-scoped fixtures in conftest.py). Another small benefit is that the peak disk usage of the test suite is lower, because some of the temporary tables are deleted sooner. This patch does not change any test functionality, and also does not make any test faster - it just changes the order of the fixture teardowns. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210317175036.1773774-1-nyh@scylladb.com>	2021-04-06 14:43:36 +02:00
Takuya ASADA	0b2c1edddc	scylla_ntp_setup: support systemd-timesyncd On Ubuntu/Debian systemd-timesyncd is default NTP client, and installed by default. So use it instead of installing chrony. Fixes #8339 Closes #8344	2021-04-06 15:28:34 +03:00
Kamil Braun	e486e0f759	tree-wide: rename "cdc streams timestamp" to "cdc generation id" Each CDC generation always has a timestamp, but the fact that the timestamp identifies the generation is an implementation detail. We abstract away from this detail by using a more generic naming scheme: a generation "identifier" (whatever that is - a timestamp or something else). It's possible that a CDC generation will be identified by more than a timestamp in the (near) future. The actual string gossiped by nodes in their application state is left as "CDC_STREAMS_TIMESTAMP" for backward compatibility. Some stale comments have been updated.	2021-04-06 13:15:31 +02:00
Kamil Braun	0cb2f58514	cdc: remove some functions from generation.hh They are not used outside of the generation module.	2021-04-06 13:15:31 +02:00
Kamil Braun	deae0aa8ba	storage_service: make set_gossip_tokens a static free-function It's always good to make the storage_service class smaller.	2021-04-06 13:15:31 +02:00
Kamil Braun	1019ff07cb	db: system_keyspace: group cdc functions in single place	2021-04-06 13:15:31 +02:00
Kamil Braun	2e2d51cf2b	cdc: get rid of "get_local_streams_timestamp" This function retrieves the persisted timestamp of the last known CDC generation (which this node is currently gossiping to other nodes). It checks that the timestamp is present; if not, it throws an error. The check is unnecessary. It's used only in a quite esoteric place (start_gossiping, which implements an almost-never-used API call), and it's fine if the timestamp is gone - in start_gossiping, we can start gossiping the tokens without the CDC generation timestamp (well, if the timestamp is not present in system tables, something weird must have happened, but that doesn't mean we can't resume gossiping - fixing CDC generation management in such a case is a separate problem).	2021-04-06 13:15:31 +02:00
Kamil Braun	3cebe99613	sys_dist_ks: update comment at quorum_if_many The comment mentioned tables that no longer exist: their names have changed some time ago. Update the comment to be name-agnostic. Furthemore, the second part of the comment related to a case of "joining a node without bootstrapping". Fortunately this operation is no longer possible (after #6848 which became part of Scylla 4.3) so we can shorten the comment.	2021-04-06 13:15:31 +02:00
Kamil Braun	bb477b9bb4	storage_service: simplify CDC generation management during node replace Previously, when node A replaced node B, it would obtain B's generation timestamp from its application state (gossiped by other nodes) and start gossiping it immediately on bootstrap. But that's not necessary: 1. if this is the timestamp of the last (current) generation, we would obtain it from other nodes anyway (every node gossips the last known timestamp), 2. if this is the timestamp of an earlier generation, we would forget it immediately and start gossiping the last timestamp (obtained from other nodes). This commit simplifies the bootstrap code (in node-replace case) a bit: the replacing node no longer attempts to retrieve the CDC generation timestamp from the node being replaced.	2021-04-06 13:15:31 +02:00
Takuya ASADA	e991e01f2e	nonroot: generate scylla_sysconfdir.py correctly We have scripting bug, when /var/log/journal exists, install.sh does not generate scylla_sysconfdir.py. Stop generating scylla_sysconfdir.py in if else condition, do that unconditionally in install.sh, also drop pre-generated scylla_sysconfdir.py from dist/common/scripts. Also, $rsysconfdir is correct path to point nonroot mode sysconfdir, instead of $sysconfdir. Fixes #8385 Closes #8386	2021-04-05 15:31:12 +03:00
Avi Kivity	56cd058b34	config: correct description of listen_address - it does not support using interface names - listen_interface is not supported - 0.0.0.0 will work (and is reasonable) if you set broadcast_address - empty setting is not supported Fixes #8381. Closes #8409	2021-04-05 14:06:48 +03:00
Gleb Natapov	cd24dfc7e5	storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication If a table that is not replicated to a certain DC (rf=0) is accessed with LOCAL_QUORUM on that DC the current code will crash since the 'targets' array will be empty and read executor does not handle it. Fix it by replying with empty result. Fixes #8354 Message-Id: <YGro+l2En3fF80CO@scylladb.com>	2021-04-05 14:04:58 +03:00
Avi Kivity	b2f0a9d05c	caching_options.hh: move code to .cc caching_options is by no means performance sensitive, but it is included in many places (via schema.hh), and it turn it pulls in other includes. Reduce include load by moving deinlining it. Ref #1. Closes #8408	2021-04-05 13:05:43 +03:00
Avi Kivity	a9835ec128	caching_options: detemplate from_map() I wanted to move caching_option.hh's content to .cc (so that it doesn't pull in rjson.hh everywhere), but for that, we must first make from_map() a non-template function. Luckily, it is only called with one parameter type, so just substitute that type for the template parameter. Closes #8406	2021-04-04 21:29:25 +03:00
Avi Kivity	832117c6d9	types: convert has_empty predicate to a concept has_empty is a textbook example of a concept: it checks whether a type has an empty() method that returns bool. It is now implemented with enable_if, simplify it to a concept. I verified that the debug build doesn't contain any incorrect emtpyable<T> (e.g. for strings). Closes #8404	2021-04-04 21:24:05 +03:00
Michał Chojnowski	f23a47e365	utils: fragment_range: fix FragmentedView utils for views with empty fragments The copying and comparing utilities for FragmentedView are not prepared to deal with empty fragments in non-empty views, and will fall into an infinite loop in such case. But data coming in result_row_view can contain such fragments, so we need to fix that. Fixes #8398. Closes #8397	2021-04-04 15:31:51 +03:00
Avi Kivity	82c76832df	treewide: don't include "db/system_distributed_keyspace.hh" from headers This just causes unneeded and slower recompliations. Instead replace with forward declarations, or includes of smaller headers that were incidentally brought in by the one removed. The .cc files that really need it gain the include, but they are few. Ref #1. Closes #8403	2021-04-04 14:00:26 +03:00
Avi Kivity	9853e07821	composite: replace enable_if with constraints Easier to read. Closes #8399	2021-04-04 13:56:51 +03:00
Kamil Braun	641040d465	sys_dist_ks: remove dead code (expire_cdc_* functions) These functions were not used anywhere but had to be maintained anyway. When (if) the expiration algorithm actually gets implemented (see issue #7300), the functions can be added back (perhaps they will need to look differently at that time, and it's likely that the `expire` column won't be used in the expiration algorithm in the end anyway).	2021-04-04 13:12:12 +03:00
Kamil Braun	4f3f245188	sys_dist_ks: coroutinize system_distributed_keyspace::start	2021-04-04 13:10:44 +03:00
Avi Kivity	40b60e8f09	Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem, we can do the replacing operation in multiple stages. One solution is to introduce a new gossip status state as proposed here: gossip: Introduce STATUS_PREPARE_REPLACE #7416 1) replacing node does not respond echo message 2) replacing node advertises prepare_replace state (Remove replacing node from natural endpoint, but do not put in pending list yet) 3) replacing node responds echo message 4) replacing node advertises hibernate state (Put replacing node in pending list) Since we now have the node ops verb introduced in `829b4c1438` (repair: Make removenode safe by default), we can do the multiple stage without introducing a new gossip status state. This patch uses the NODE_OPS_CMD infrastructure to implement replace operation. Improvements: 1) It solves the race between marking replacing node alive and sending writes to replacing node 2) The cluster reverts to a state before the replace operation automatically in case of error. As a result, it solves when the replacing node fails in the middle of the operation, the repacing node will be in HIBERNATE status forever issue. 3) The gossip status of the node to be replaced is not changed until the replace operation is successful. HIBERNATE gossip status is not used anymore. 4) Users can now pass a list of dead nodes to ignore explicitly. Fixes #8013 Closes #8330 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for replace operation gossip: Add advertise_to_nodes gossip: Add helper to wait for a node to be up gossip: Add is_normal_ring_member helper	2021-04-04 12:54:09 +03:00
Gleb Natapov	10781037f5	raft: test: add test that leader behaves as expected when it gets unexpended messages	2021-04-04 11:33:35 +03:00
Gleb Natapov	28add88a1f	raft: do not assert when receiving unexpected messages in a leader state Current code assert when it gets InstallSnapshot/AppendRequest in a leader state and the term in the message is equal current term. It is true that such messages cannot be received if the protocol works correctly, but we should not crash on a network input nonetheless.	2021-04-04 11:33:35 +03:00
Gleb Natapov	995cd1c8a7	raft: use existing function to check if election timeout elapsed is_past_election_timeout() repeats the calculation that election_elapsed() is doing. Use existing function instead.	2021-04-04 11:33:35 +03:00
Piotr Dulikowski	f186de909d	storage_service/removenode: update gossiper state before excise In `storage_service::removenode`, in "Step 5", services which implement `endpoint_lifecycle_subscriber` are first notified about the node leaving the cluster, and only after that the gossiper state is updated (comments added by me): // This function indirectly notifies subscribers ss.excise(std::move(tmp), endpoint); // This function updates the gossiper state ss._gossiper.advertise_token_removed(endpoint, host_id).get(); This order is confusing for those subscribers which expect the fact that the node is leaving to be reflected in the gossiper state - more specifically, for hints manager. The hints manager has a function `can_send()` which determines if it is OK for it to try send hints. More specifically, it looks at the gossiper state to see if the destination node is ALIVE or if it has left the ring. The first case is obvious as the destination node will be able to receive the hints as writes, while the other means that the hints will be sent with CL=ALL to its new replicas. When a node leaves the cluster, all hint queues either to or from that node enter the "drain" mode - the queue will attempt to send out all hints and will drop those hints which failed to be sent. This mode is triggered by a notification from the storage_service (hints manager is a lifecycle subscriber). The core drain logic for a queue looks as follows: manager_logger.trace("Draining for {}: start", end_point_key()); set_draining(); send_hints_maybe(); _ep_manager.flush_current_hints().handle_exception([] (auto e) { manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e); }).get(); send_hints_maybe(); manager_logger.trace("Draining for {}: end", end_point_key()); And `send_hints_maybe` contains the following loop: while (replay_allowed() && have_segments() && can_send()) { if (!send_one_file(*_segments_to_replay.begin())) { break; } _segments_to_replay.pop_front(); ++replayed_segments_count; } Coming back to the `storage_service::removenode` - because of the order of `excise` and `advertise_token_removed`, draining starts before the node which is being removed is removed from gossiper state. In turn, it might happen that the drain logic calls `send_hints_maybe` twice and does not send any hints - the loop in that function will immediately stop because `can_send()` is false because the gossiper state still reports that the target node is not alive. The logic expects `can_send` to be true here because the node has left the ring. This patch changes the order of `excise` and `advertise_token_removed` in `storage_service::removenode` - now, the first one is called after the other. This ensures that the gossiper state is updated before listeners are called, and the race descrbed in the commit message does not happen anymore - `can_send` is true when the node is being drained. The race described here was exposed by the following commit: `77a0f1a153` Fixes: #5087 Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) - dtest(topology_test.py) Closes #8284	2021-04-02 11:05:16 +02:00
Avi Kivity	fb890889cc	version: prepare for the 4.6 cycle	2021-04-01 20:40:52 +03:00
Avi Kivity	eeaceb4bff	Update seastar submodule * seastar 398f1c3274...fcd46c1387 (1): > cmake: tighten check for -fstack-clash-protection	2021-04-01 18:49:16 +03:00
Wojciech Mitros	201b86b042	primitive_consumer: keep fragments of parsed buffer in a small_vector When we want to parse a linearized buffer of bytes, we're copying them into the first and only element of the _read_bytes vector. Thus _read_bytes often contains only one element, which makes a small_vector a better alternative. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-04-01 16:05:52 +02:00
Tomasz Grabiec	307bd354d2	Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function. Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node. Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) - dtest(topology_test.py) Closes #8387 * github.com:scylladb/scylla: hints: clarify docstring comment for can_send hints: use token_metadata to tell if node is in the ring hints: slightly reogranize "if" statement in can_send storage_service: release token_metadata lock before notify_left storage_service: notify_left after token_metadata is replicated	2021-04-01 15:51:46 +02:00
Avi Kivity	e45466ed07	Update seastar submodule * seastar 72e3baed9c...398f1c3274 (4): > coroutine: Remove return_value for future<void> > tls: preserve exact error state so repeated calls generate same message Fixes #8391. > Add deferred_close and deferred_stop > httpd: add status_types 406, 415, 422	2021-04-01 16:41:59 +03:00
Wojciech Mitros	599cfe586f	sstables: add parsing of cell values into fragmented buffers The entire sstable cell value is currently stored in a single temporary_buffer. Cells may be very large, so to avoid large contiguous allocations, the buffer is changed to a fragmented_temporary_buffer. Fixes #7457 Fixes #6376 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-04-01 15:36:58 +02:00
Avi Kivity	bfed9c15d5	Update tools/java submodule * tools/java fb21784b91...5756445ec7 (1): > sstableloader: fix handling of rewritten partition Ref scylladb/scylla-tools-java#238.	2021-04-01 16:07:46 +03:00
Avi Kivity	4739df2cb1	Merge 'cql3: remove linearizations in the write path' from Michał Chojnowski As a part of the effort of removing big, contiguous buffers from the codebase, cql3::raw_value should be made fragmented. Unfortunately a straightforward rewrite to a fragmented buffer type is not possible, because we want cql3::raw_value to be compatible with cql3::raw_value_view, and we want that view to be based on fragmented_temporary_buffer::view, so that it can be used to view data coming directly from seastar without copying. This patch makes cql3::raw_value fragmented by making cql3::raw_value_view a `variant` of managed_bytes_view and fragmented_temporary_buffer::view. Code users which depended on `cql3::raw_value` being `bytes`, and cql::raw_value_view being `fragmented_temporary_buffer::view` underneath were adjusted to the new, dual representation, mainly through the `cql3::raw_value_view::with_value` visitor and deserialization/validation helpers added to `cql3::raw_value_view`. The second part of this series gets rid of linearizations occuring when processing compound types in the CQL layer. This is achieved by storing their elements in `managed_bytes` instead of `bytes` in the partially deserialized form (`lists::value` `tuples::value`, etc.) outputting `managed_bytes` instead of `bytes` in functions which go from the partially deserialized form to the atomic cell format (for frozen types), and avoiding calling deserialize/serialize on individual elements when it's not necessary. (It's only necessary for CQLv2, because since CQLv3 the format on the wire is the same as our internal one). The above also forces some changes to `expression.cc`, and `restrictions`, mainly because `IN` clauses store their arguments as `lists` and `tuples`, and the code which handled this clause expected `bytes`. After this series, the path from prepared CQL statements to `atomic_cell_or_collection` is almost completely linearization-free. The last remaining place is `collection_mutation_description`, where map keys are linearized to `bytes`. Closes #8160 * github.com:scylladb/scylla: cql3: update_parameters: remove unused version of make_cell for bytes_view types: collection: remove an unused version of pack_fragmented cql3: optimize the deserialization of collections cql3: maps, sets: switch the element type from bytes to managed_bytes cql3: expression: use managed_bytes instead of bytes where possible cql3: expr: expression: make the argument of to_range a forwarding reference cql3: don't linearize elements of lists, tuples, and user types cql3: values: add const managed_bytes& constructor to raw_value_view cql3: output managed_bytes instead of bytes in get_with_protocol_version types: collection: add versions of pack for fragmented buffers types: add write_collection_{value,size} for managed_bytes_mutable_view cql3: tuples, user_types: avoid linearization in from_serialized() and get() types: tuple: add build_value_fragmented cql3: update_parameters: add make_cell version for managed_bytes_view cql3: remove operation::make_*cell cql3: values: make raw_value fragmented cql3: values: remove raw_value_view::operator== cql3: switch users of cql3::raw_value_view to internals-independent API cql3: values: add an internals-independent API to raw_value_view utils: managed_bytes: add a managed_bytes constructor from FragmentedView utils: managed_bytes: add operator<< and to_hex for managed_bytes utils: fragment_range: add to_hex configure: remove unused link dependencies from UUID_test	2021-04-01 15:21:32 +03:00
Takuya ASADA	3af31eebeb	scylla_setup: stop hardcode product name on scylla_setup Stop hardcode product name on scylla_setup, dynamically generate scylla_product.py in install.sh. Fixes #8367 Closes #8384	2021-04-01 15:07:58 +03:00
Avi Kivity	ecc5b57183	Merge "reader_concurrency_semaphore: refactor do_wait_admission() to facilitate changes to admission conditions" from Botond " This small patchset restructures the do_wait_admission() to facilitate future changes to the wait/admission conditions. The changes we want to facilitate are Benny's flat mutation reader close series and my stalled readers series. As an added benefit the code is more readable and a small theoretical corner-case bug is fixed. No logical changes (besides the small bug-fix). Tests: unit(dev) " * 'reader-concurrency-semaphore-refactor/v1' of https://github.com/denesb/scylla: reader_concurrency_semaphore: remove now unused may_proceed() reader_concurrency_semaphore: restructure do_wait_admission() reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter() reader_concurrency_semaphore: make admission conditions consistent	2021-04-01 13:50:32 +03:00
Raphael S. Carvalho	bb9a109c1a	distributed_loader: inform which table is being resharded Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210330163956.60585-1-raphaelsc@scylladb.com>	2021-04-01 13:08:59 +03:00
Pavel Emelyanov	8bbe2eae5e	btree: Convert comparator to <=> It turned out that all the users of btree can already be converted to use safer std::strong_ordering. The only meaningful change here is the btree code itself -- no more ints there. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210330153648.27049-1-xemul@scylladb.com>	2021-04-01 12:56:08 +03:00
Pavel Emelyanov	ccc1f24097	row_cache: Remove mentionings of cache_streamed_mutation This class was replaced by cache_flat_mutation_reader long ago and doesn't exist. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210330153942.27222-1-xemul@scylladb.com>	2021-04-01 12:54:45 +03:00
Michał Chojnowski	c555e84a77	cql3: update_parameters: remove unused version of make_cell for bytes_view It became unused after previous patches in this series changed the representation of collections in cql3 from bytes_view to managed_bytes_view.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	472f0eb932	types: collection: remove an unused version of pack_fragmented It was made unused by previous patches in this series.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	458878a414	cql3: optimize the deserialization of collections Before this patch, deserializing a collection from a (prepared) CQL request involved deserializing every element and serializing it again. Originally this was a hacky method of validation, and it was also needed to reserialize nested frozen collections from the CQLv2 format to the CQLv3 format. But since then we started doing validation separately (before calls to from_serialized) and CQLv2 became irrelevant, making reserialization of elements (which, among other things, involves a memory alocation for every element) pure waste. This patch adds a faster path for collections in the v3 format, which does not involve linearizing or reserializing the elements (since v3 is the same as our internal format). After this patch, the path from prepared CQL statements to atomic_cell_or_collection is almost completely linearization-free. The last remaining place is collection_mutation_description, where map keys are linearized.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	a0f12b8d63	cql3: maps, sets: switch the element type from bytes to managed_bytes	2021-04-01 10:44:21 +02:00
Michał Chojnowski	979666075f	cql3: expression: use managed_bytes instead of bytes where possible	2021-04-01 10:44:21 +02:00
Michał Chojnowski	6e7e795dfd	cql3: expr: expression: make the argument of to_range a forwarding reference Make to_range able to handle rvalues. We will pass managed_bytes&& to it in the next patch to avoid pointless copying. The public declaration of to_range is changed to a concrete function to avoid having to explicitly instantiate to_range for all possible reference types of clustering_key_prefix.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	0bb959e890	cql3: don't linearize elements of lists, tuples, and user types This patch switches the type used to store collection elements inside the intermediate form used in lists::value, tuples::value etc. from bytes to managed_bytes. After this patch, tuple and list elements are only linearized in from_serialized, which will be corrected soon. This commit introduces some additional copies in expression.cc, which will be dealt with in a future commit.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	fa2749c2a0	cql3: values: add const managed_bytes& constructor to raw_value_view Will be used in the next patch. Separated for clarity.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	8927aaf225	cql3: output managed_bytes instead of bytes in get_with_protocol_version	2021-04-01 10:44:21 +02:00
Michał Chojnowski	aab9509775	types: collection: add versions of pack for fragmented buffers We will need them to port the representation of collection types in cql3/ from bytes to managed_bytes. The version which takes an iterator of `bytes` as an argument will be removed after that transition is complete.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	e9c05582a4	types: add write_collection_{value,size} for managed_bytes_mutable_view We will use them to avoid linearization when going from the intermediate std::vector<bytes> form in cql3/ to the atomic_cell format, by outputting managed_bytes instead of bytes in get_with_protocol_version.	2021-04-01 10:44:21 +02:00
Michał Chojnowski	3387d43a34	cql3: tuples, user_types: avoid linearization in from_serialized() and get() Deserialize from raw_value_view without linearizing and output managed_bytes instead of bytes.	2021-04-01 10:44:20 +02:00
Michał Chojnowski	a10a82da30	types: tuple: add build_value_fragmented A version of build_value which produces fragmented output. We will use it to avoid linearization in tuples::value and user_types::value.	2021-04-01 10:42:07 +02:00
Michał Chojnowski	9777026e71	cql3: update_parameters: add make_cell version for managed_bytes_view We will use it to port the representation of collections in cql3/ from bytes to managed_bytes. The duplicate version for bytes_view will be removed after that transition is complete.	2021-04-01 10:42:07 +02:00
Michał Chojnowski	c2c6b2abfa	cql3: remove operation::make_cell The operation::make_cell functions are useless aliases to methods of update_parameters, and are used interchangeably with them throughout the code. Remove them. Also, remove the now-unused update_parameters::make_cell version for fragmented_temporary_buffer::view.	2021-04-01 10:42:07 +02:00
Michał Chojnowski	463ec1b082	cql3: values: make raw_value fragmented As a part of the effort of removing big, contiguous buffers from the codebase, cql3::raw_value should be made fragmented. Unfortunately the change involves some nontrivial work, because raw_value must be viewable with raw_value_view, and raw_value_view must accomodate both raw_value (that's where we store values in prepared queries) and fragmented_temporary_buffer::view (because that's the type of values coming from the wire). This patch makes raw_value fragmented, by changing the backing type from bytes to managed_bytes. raw_value_view is modified accordingly by changing the backing type from fragmented_temporary_buffer::view to a variant of fragmented_temporary_buffer::view and managed_bytes_view. We have prepared the users of raw_value{_view} for this change in preceding commits.	2021-04-01 10:42:07 +02:00
Michał Chojnowski	5984d6b2ce	cql3: values: remove raw_value_view::operator== It's only used in a single test, and there is no reason why it should ever be used anywhere else. So let's remove it from the public header and move it to that test.	2021-04-01 10:42:07 +02:00
Michał Chojnowski	b9322a6b71	cql3: switch users of cql3::raw_value_view to internals-independent API We want to change the internals of cql3::raw_value{_view}. However, users of cql3::raw_value and cql3::raw_value_view often use them by extracting the internal representation, which will be different after the planned change. This commit prepares us for the change by making all accesses to the value inside cql3::raw_value(_view) be done through helper methods which don't expose the internal representation publicly. After this commit we are free to change the internal representation of raw_value_{view} without messing up their users.	2021-04-01 10:42:04 +02:00
Michał Chojnowski	b3167ac0a6	cql3: values: add an internals-independent API to raw_value_view Currently, raw_value_view is backed by a fragmented_temporary_buffer::view, and many users of this type use it by extracting that internal representation. However, we want to change raw_value_view so that it can be created both from fragmented_temporary_buffer and from managed_bytes, so that we can switch the internals of raw_value from bytes to managed_bytes. To do that we need to prepare all users for that more general representation. This commit adds an API which allow using raw_value_view without accessing its internal representation. In the next commits of this series we will switch all callers who currently depend on that representation to the new API, and then we will remove the old accessors and change the internals.	2021-04-01 10:39:42 +02:00
Michał Chojnowski	45e0ef26d3	utils: managed_bytes: add a managed_bytes constructor from FragmentedView Just for convenience. We will use it in an upcoming patch where we switch the inner representation of cql3::raw_value from bytes to managed_bytes, and we will want to construct managed_bytes from fragmented_temporary_buffer::view.	2021-04-01 10:39:42 +02:00
Michał Chojnowski	4715268e30	utils: managed_bytes: add operator<< and to_hex for managed_bytes We will need them to replace bytes with managed_bytes in some places in an upcoming patch. The change to configure.py is necessary because opearator<< links to to_hex in bytes.cc.	2021-04-01 10:39:42 +02:00
Michał Chojnowski	14c4639994	utils: fragment_range: add to_hex	2021-04-01 10:39:42 +02:00
Michał Chojnowski	b6740a01ac	configure: remove unused link dependencies from UUID_test	2021-04-01 10:39:42 +02:00
Piotr Dulikowski	6a1152ea9b	hints: clarify docstring comment for can_send Now, the docstring comment next to can_send better represents the condition that is checked inside that function. The statement about returning true when destination left the NORMAL state is replaced with a statement about returning true when the destination has left the ring.	2021-04-01 03:58:29 +02:00
Piotr Dulikowski	4f90514247	hints: use token_metadata to tell if node is in the ring Now, instead of looking at the gossiper state to check if the destination node is still in the ring, we are using token_metadata as a source of truth. This results in much simpler code in can_send() as token_metadata has an is_member method which does exactly what we want.	2021-04-01 03:58:29 +02:00
Piotr Dulikowski	e7d9057d0c	hints: slightly reogranize "if" statement in can_send This commit reverses the order of if-else blocks in can_send, which makes it - in my opinion, at least - slightly easier to read.	2021-04-01 03:58:29 +02:00
Piotr Dulikowski	b7f4f47608	storage_service: release token_metadata lock before notify_left There is no need to keep holding the token_metadata lock after metadata was successfully updated on all shards.	2021-04-01 03:58:21 +02:00
Asias He	323f72e48a	repair: Switch to use NODE_OPS_CMD for replace operation In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem, we can do the replacing operation in multiple stages. One solution is to introduce a new gossip status state as proposed here: gossip: Introduce STATUS_PREPARE_REPLACE #7416 1) replacing node does not respond echo message 2) replacing node advertises prepare_replace state (Remove replacing node from natural endpoint, but do not put in pending list yet) 3) replacing node responds echo message 4) replacing node advertises hibernate state (Put replacing node in pending list) Since we now have the node ops verb introduced in `829b4c1438` (repair: Make removenode safe by default), we can do the multiple stage without introducing a new gossip status state. This patch uses the NODE_OPS_CMD infrastructure to implement replace operation. Improvements: 1) It solves the race between marking replacing node alive and sending writes to replacing node 2) The cluster reverts to a state before the replace operation automatically in case of error. As a result, it solves when the replacing node fails in the middle of the operation, the repacing node will be in HIBERNATE status forever issue. 3) The gossip status of the node to be replaced is not changed until the replace operation is successful. HIBERNATE gossip status is not used anymore. 4) Users can now pass a list of dead nodes to ignore explicitly. Refs #8013	2021-04-01 09:38:54 +08:00
Asias He	bdb95233e8	gossip: Add advertise_to_nodes gossiper::advertise_to_nodes() is added to allow respond to gossip echo message with specified nodes and the current gossip generation number for the nodes. This is helpful to avoid the restarted node to be marked as alive during a pending replace operation. After this patch, when a node sends a echo message, the gossip generation number is sent in the echo message. Since the generation number changes after a restart, the receiver of the echo message can compare the generation number to tell if the node has restarted. Refs #8013	2021-04-01 09:38:54 +08:00
Asias He	f690f3ee8e	gossip: Add helper to wait for a node to be up This patch adds gossiper::wait_alive helper to wait for nodes to be up on all shards. Refs #8013	2021-04-01 09:38:54 +08:00
Asias He	4f5676630e	gossip: Add is_normal_ring_member helper Check if a node is in NORMAL or SHUTDOWN status which means the node is part of the token ring from the gossip point of view and operates in normal status or was in normal status but is shutdown. Refs #8013	2021-04-01 09:38:54 +08:00
Piotr Dulikowski	ca65f012b0	storage_service: notify_left after token_metadata is replicated Previously, at the end of the removenode operation, endpoint lifecycle subscribers are informed about the node being removed (the "on_leave_cluster" method) before the token_metadata is updated to reflect the fact that a node was removed. Although no subscriber currently depends on token_metadata being up-to-date when "on_leave_cluster" is called, the hints manager will become sensitive to this in a later commit in this series. This commit gets rid of the future problem by notifying subscribers later, only after token_metadata is fully updated and replicated to all shards.	2021-04-01 02:13:27 +02:00
Avi Kivity	bbec43f9a1	Update tools/java submodule * tools/java ccc4201ded...fb21784b91 (2): > fix: Add dummy implementation of getToppartitions > nodetool: Make toppartitions call the generic endpoint Fixes #4520.	2021-03-31 17:38:03 +03:00
Wojciech Mitros	b1b5bda848	sstables: add non-contiguous parsing of byte strings to the primitive_consumer Currently, the primitive_consumer parses all values in contiguous buffers. A string of bytes may be very long, so parsing it in a single buffer can cause a big allocation. This patch allows parsing into fragmented_temporary_buffers instead of temporary_buffers. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 12:09:52 +02:00
Wojciech Mitros	3f529b2860	utils: add ostream operator<<() for fragmented_temporary_buffer::view We are going to store sstable cells' values in fragmented_temporary_buffers. This patch will allow checking these values with loggers. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 12:09:52 +02:00
Wojciech Mitros	74a0c158c5	compound_type: extend serialize_value for all FragmentedView types compound_type::serialize_value is currently implemented for arguments of type 'bytes_view', 'managed_bytes', or 'managed_bytes_view'. We will want to use it for a fragmented_temporary_buffer::view, so we extend it for all FragmentedView types. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 12:09:52 +02:00
Pavel Emelyanov	887a1b0d3d	tracing: Stop tracing in main's deferred action Tracing is created in two steps and is destroyed in two too. The 2nd step doesn't have the corresponding stop part, so here it is -- defer tracing stop after it was started. But need to keep in mind, that tracing is also shut down on drain, so the stopping should handle this. Fixes #8382 tests: unit(dev), manual(start-stop, aborted-start) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210331092221.1602-1-xemul@scylladb.com>	2021-03-31 12:28:37 +03:00
Piotr Jastrzebski	57c7964d6c	config: ignore enable_sstables_mc_format flag Don't allow users to disable MC sstables format any more. We would like to retire some old cluster features that has been around for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this we first have to make sure that all existing clusters have them enabled. It is impossible to know that unless we stop supporting enable_sstables_mc_format flag. Test: unit(dev) Refs #8352 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8360	2021-03-31 12:23:59 +03:00
Avi Kivity	f9244734f9	Update seastar submodule * seastar 48376c76a...72e3baed9 (3): > file: Add RFW_NOWAIT detection case for AuFS > sharded: provide type info on no sharded instance exception > iotune: Estimate accuarcy of measurement Added missing include "database.hh" to api/lsa.cc since seastar::sharded<> now needs full type information.	2021-03-31 10:40:04 +03:00
Avi Kivity	de10a74a84	Merge 'types: remove linearization from abstract_type::compare' from Wojciech Mitros This patch is another series on removing big allocations from scylla. The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well. Tests:unit(dev) Closes #8357 * github.com:scylladb/scylla: types: remove linearization from abstract_type::compare types: replace buffers in tuple_deserializing_iterator with fragmented ones types: make tuple_type_impl::split work with any FragmentedViews types: move read_collection_size/value specialization to header file	2021-03-31 08:50:52 +03:00
Wojciech Mitros	f57fa935a2	types: remove linearization from abstract_type::compare To avoid high latencies caused by large contigous allocations needed by linearizing, work on fragmented buffers instead. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 06:35:10 +02:00
Wojciech Mitros	daa31be37f	types: replace buffers in tuple_deserializing_iterator with fragmented ones In preparation for removing linearization from abstract_type::compare, add options to avoid linearization in tuple_deserializing_iterator. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 06:35:09 +02:00
Wojciech Mitros	823d4c7529	types: make tuple_type_impl::split work with any FragmentedViews We may want to store a tuple in a fragmented buffer. To split it into a vector of optional bytes, tuple_type_impl::split can be used. To split a contiguous buffer(bytes_view), simply pass single_fragmented_view(bytes_view). Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-31 06:34:37 +02:00
Piotr Sarna	6a2377a233	Merge 'Fast slow query trace doc' from Ivan Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234 (write issue: "Tracing: slow query fast mode documentation request") adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md patch to the scylla-doc will be dup-ed after this one merged cc @nyh cc @vladzcloudius Closes #8373 * github.com:scylladb/scylla: tracing: api: fast mode doc improvement tracing: fast slow query tracing mode docs	2021-03-30 17:57:04 +02:00
Botond Dénes	4762b84b44	reader_concurrency_semaphore: remove now unused may_proceed()	2021-03-30 17:54:34 +03:00
Botond Dénes	94c7e619af	reader_concurrency_semaphore: restructure do_wait_admission() Currently the code is structured such that first the conditions required for admission are checked. The success paths have early returns and if all of them fail, we fall back to enqueueing the permit. This patch restructures the code such that the wait conditions are checked first, and if all of them fail, we fall back to admitting the permit. This structure allows for easier introduction of additional wait/admit conditions in the future.	2021-03-30 17:51:17 +03:00
Botond Dénes	d1dd55d98f	reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter() Besides making the code more readable, this also enables restructuring `do_wait_admission()`, without moving too much code around. As a bonus, queue length is now only checked when the permit actually has to be enqueued.	2021-03-30 17:49:30 +03:00
Botond Dénes	d90cd6402c	reader_concurrency_semaphore: make admission conditions consistent Currently there are two places where we check admission conditions: `do_wait_admission()` and `signal()`. Both use `has_available_units()` to check resource availability, but the former has some additional resource related conditions on top (in `may_proceed()`), which lead to the two paths working with slightly different conditions. To fix, push down all resource availability related checks to `has_available_units()` to ensure admission conditions are consistent across all paths.	2021-03-30 17:39:57 +03:00
Ivan Prisyazhnyy	778d9217f3	tracing: api: fast mode doc improvement Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>	2021-03-30 16:22:56 +02:00
Ivan Prisyazhnyy	b3b66fb629	tracing: fast slow query tracing mode docs Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>	2021-03-30 16:22:56 +02:00
Avi Kivity	d2921b5112	Merge 'Clean up > 2-year-old features' from Piotr Sarna Following the work started in `253a7640e`, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years. Features picked this time via `git blame`, along with the date of their introduction: * `fe4afb1aa3` (Asias He 2018-09-05 14:52:10 +0800 109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR"; * `ff5e541335` (Calle Wilund 2019-02-05 13:06:07 +0000 110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE"; * `fefef7b9eb` (Tomasz Grabiec 2019-03-05 19:08:07 +0100 111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC"; Tests: unit(dev) Closes #8235 * github.com:scylladb/scylla: sstables,test: remove variables depending on old features gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature gms: make TRUNCATION_TABLE feature unconditionally true gms: make ROW_LEVEL_REPAIR feature unconditionally true repair: stop relying on ROW_LEVEL_REPAIR feature	2021-03-30 16:13:35 +03:00
Calle Wilund	c0666ea89b	commitlog: Fix inner loop condition in allocation pre-fill Fixes #8369 This was originally found (and fixed) by @gleb-cloudius, but the patch set with the fix was reverted at some point, and the fix went away. Now the error remains even in new, nice coroutine code. We check the wrong var in the inner loop of the pre-fill path of allocate_segment_ex, often causing us to generate giant writev:s of more or less the whole file. Not intended. Closes #8370	2021-03-30 12:14:55 +02:00
Avi Kivity	c2866f46b5	test: relax quota for tests on machines with small page size `8a8589038c` ("test: increase quota for tests to 6GB") increased the quota for tests from 2GB to 6GB. I later found that the increased requirement is related to the page size: Address Sanitizer allocates at least a page per object, and so if the page size is larger the memory requirement is also larger. Make use of this by only increasing the quota if the page size is greater than 4096 (I've only seen 4096 and 65536 in the wild). This allows greater parallelism when the page size is small. Closes #8371	2021-03-30 12:13:42 +02:00
Avi Kivity	8785dd62cb	tests: use kernel page cache Tests are short-lived and use a small amount of data. They are also often run repeatly, and the data is deleted immediately after the test. This is a good scenario for using the kernel page cache, as it can cache read-only data from test to test, and avoid spilling write data to disk if it is deleted quickly. Acknowledge this by using the new --kernel-page-cache option for tests. This is expected to help on large machines, where the disk can be overloaded. Smaller machines with NVMe disks probably will not see a difference. Closes #8347	2021-03-30 12:04:55 +02:00
Piotr Sarna	6de2691bbd	sstables,test: remove variables depending on old features In order to maintain backward compatibility wrt. cluster features, two boolean variables were kept in sstable writers: - correctly_serialize_non_compound_range_tombstones - correctly_serialize_static_compact_in_mc Since these features are assumed to always be present now, the above variables are no longer needed and can be purged.	2021-03-30 09:37:41 +02:00
Piotr Sarna	e42dee6afb	gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true The feature is assumed to be true due to being over 2 years old. It's still advertised in gossip, but it's assumed to always be present.	2021-03-30 09:37:13 +02:00
Piotr Sarna	28c9af6fa5	sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature The feature bit is going away because it's over 2 years old, so the code which depended on it becomes unconditional.	2021-03-30 09:37:04 +02:00
Piotr Sarna	08c4350968	gms: make TRUNCATION_TABLE feature unconditionally true Turns out the feature was not used presently. Historically, the commit which removed the support is `30a700c5b0` .	2021-03-30 09:36:45 +02:00
Piotr Sarna	c070178c7e	gms: make ROW_LEVEL_REPAIR feature unconditionally true The feature is assumed to be true due to being over 2 years old. It's still advertised in gossip, but it's assumed to always be present.	2021-03-30 09:36:11 +02:00
Piotr Sarna	80ebedd242	repair: stop relying on ROW_LEVEL_REPAIR feature The feature is going away because it's over 2 years old, so the code which depended on it becomes unconditional.	2021-03-30 09:35:40 +02:00
Avi Kivity	c1badc6317	noexcept_traits: convert enable_if to concepts A little easier to read. Closes #8329	2021-03-30 09:30:23 +02:00
Avi Kivity	405c4e7af1	serializer: replace enable_if in deserialized_bytes_proxy with constraint Simpler to read and understand. Closes #8303	2021-03-30 09:30:06 +02:00
Avi Kivity	7c953f33d5	utils: disk-error-handler: replace enable_if with concepts Simpler, cleaner. We also replace the deprecated std::result_of_t with std::invoke_result_t. Closes #8305	2021-03-30 09:29:46 +02:00
Nadav Har'El	115324f71a	Merge 'Add partial admission control to Thrift frontend' from Piotr Sarna This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later. Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share. Refs #4826 Closes #8313 * github.com:scylladb/scylla: thrift: add support for max_concurrent_requests_per_shard thrift: add metrics for admission control thrift: add a counter for in-flight requests thrift: add a counter for blocked requests thrift: partially add admission control service_permit: add a getter for the number of units held thrift: coroutinize processing a request memory_limiter: add a missing seastarx include	2021-03-29 21:36:50 +03:00
Raphael S. Carvalho	a390f4eb61	sstables: optimize LCS reshape for repair-based operations LCS reshape is currently inefficient for repair-based operation, because the disjoint run of 256 sstables is reshaped into bigger L0 files, which will be then integrated into the main sstable set. On reshape completion, LCS has to compact those big L0 files onto higher levels, until last level is reached, producing bad write amplification. A much better approach is to instead compact that disjoint run into the best possible level L, which can be figured out with: log (base fan_out) of (total_size / max_sstable_size) This compaction will be essentially a copy operation. It's important to do it rather than only mutating the level of sstables because we have to reshape the input run according to LCS parameters like sstable size. For repair-based bootstrap/replace, the input disjoint run is now efficiently reshaped into an ideal level L, so there's no compaction backlog once reshape completes. This behavior will manifest in the log as this: LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2 For repair-based decommission/removenode though, which reshape wasn't wired on yet, level L may temporarily hold 2 disjoint runs, which overlap one another, but LCS itself will incrementally merge them through either promotion of L-1 into L, or by detecting overlapping in level L and merging the overlapping sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>	2021-03-29 20:22:04 +03:00
Botond Dénes	3c54c990ab	test: view_build_test: test_view_update_generator_buffering: fail gracefully Failures in this test typically happen inside the test consumer object. These however don't stop the test as the code invoking the consumer object handles exceptions coming from it. So the test will run to completion and will fail again when comparing the produced output with the expected one. This results in distracting failures. The real problem is not the difference in the output, but the first check that failed, which is however buried in the noise. To prevent this add an "ok" flag which is set to false if the consumer fails. In this case the additional checks are skipped in the end to not generate useless noise. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>	2021-03-29 17:58:28 +03:00
Avi Kivity	a8463cfb37	Merge "reader_permit: signal leaked resources" from Botond " When a permit is destroyed we check if it still holds on to any resources in the destructor. Any resources the permit still holds on are leaked resources, as users should have released these. Currently we just invoke `on_internal_error_noexcept()` to handle this, which -- depending on the configuration -- will result in an error message or an assert. In the former case, the resources will be leaked for good. This mini-series fixes this, by signaling back these resources to the semaphore. This helps avoid an eventual complete dry-up of all semaphore resources and a subsequent complete shutdown of reads. Tests: unit(release, debug) " * 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla: reader_permit: signal leaked resources test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore	2021-03-29 17:57:31 +03:00
Botond Dénes	9e01c4c667	test: view_build_test: test_view_update_generator_buffering: use separate permit for readers Said test has two separate logical readers, but they share the same permit, which is illegal. This didn't cause any problems yet, but soon the semaphore will start to keep score of active/inactive permits which will be confused by such sharing, so have them use separate permits. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>	2021-03-29 17:35:51 +03:00
Takuya ASADA	6f678ab7ff	aws: initialize self._disks['ebs'] when no EBS disks Seems like aws_instance.ebs_disks() causes traceback when no EBS disks available, need to initialize with empty list. Fixes #8365 Closes #8366	2021-03-29 17:21:14 +03:00
Gleb Natapov	13a3cf62bb	raft: move incoming message processing into per state functions Clean up step() function by moving state specific processing into per state functions. This way it is easier to see how each state handles individual messages. No functional changes here. Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>	2021-03-29 15:48:43 +02:00
Tomasz Grabiec	43fd322856	Merge 'scylla-gdb.py: Add io-queues command' from Piotr Sarna The command can be used to inspect IO queues of a local reactor. Example output: ``` (gdb) scylla io-queues Dev 0: Class: \|shares: \|ptr: -------------------------------------------------------------------------------- "default" \|1 \|(seastar::priority_class_data )0x6000002c6500 "commitlog" \|1000 \|(seastar::priority_class_data )0x6000003ad940 "memtable_flush" \|1000 \|(seastar::priority_class_data )0x6000005cb300 "streaming" \|200 \|(seastar::priority_class_data )0x0 "query" \|1000 \|(seastar::priority_class_data )0x600000718580 "compaction" \|1000 \|(seastar::priority_class_data )0x6000030ef0c0 Max request size: 2147483647 Max capacity: Ticket(weight: 4194303, size: 4194303) Capacity tail: Ticket(weight: 73168384, size: 100561888) Capacity head: Ticket(weight: 77360511, size: 104242143) Resources executing: Ticket(weight: 2176, size: 514048) Resources queued: Ticket(weight: 384, size: 98304) Handles: (1) Class 0x6000005d7278: Ticket(weight: 128, size: 32768) Ticket(weight: 128, size: 32768) Ticket(weight: 128, size: 32768) Pending in sink: (0) ``` Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations. Closes #8362 * github.com:scylladb/scylla: scylla-gdb: add io-queues command scylla-gdb.py: add parsing std::priority_queue scylla-gdb.py: add parsing std::atomic scylla-gdb.py: add parsing std::shared_ptr scylla-db.py: add parsing intrusive_slist	2021-03-29 15:31:48 +02:00
Piotr Sarna	adf07eb8fb	scylla-gdb: add io-queues command The command can be used to inspect reactor's IO queues. Example output: (gdb) scylla io-queues Dev 0: Class: \|shares: \|ptr: -------------------------------------------------------------------------------- "default" \|1 \|(seastar::priority_class_data )0x6000002c6500 "commitlog" \|1000 \|(seastar::priority_class_data )0x6000003ad940 "memtable_flush" \|1000 \|(seastar::priority_class_data )0x6000005cb300 "streaming" \|200 \|(seastar::priority_class_data )0x0 "query" \|1000 \|(seastar::priority_class_data )0x600000718580 "compaction" \|1000 \|(seastar::priority_class_data )0x6000030ef0c0 Max request size: 2147483647 Max capacity: Ticket(weight: 4194303, size: 4194303) Capacity tail: Ticket(weight: 73168384, size: 100561888) Capacity head: Ticket(weight: 77360511, size: 104242143) Resources executing: Ticket(weight: 2176, size: 514048) Resources queued: Ticket(weight: 384, size: 98304) Handles: (1) Class 0x6000005d7278: Ticket(weight: 128, size: 32768) Ticket(weight: 128, size: 32768) Ticket(weight: 128, size: 32768) Pending in sink: (0)	2021-03-29 15:01:25 +02:00
Piotr Sarna	f162423b8a	scylla-gdb.py: add parsing std::priority_queue The parsing assumes that the underlying storage is a vector, which is often enough the case.	2021-03-29 13:10:36 +02:00
Piotr Sarna	e36c1f1d25	scylla-gdb.py: add parsing std::atomic	2021-03-29 13:10:36 +02:00
Piotr Sarna	0d4d04d3e6	scylla-gdb.py: add parsing std::shared_ptr	2021-03-29 13:10:36 +02:00
Piotr Sarna	c61822bc86	scylla-db.py: add parsing intrusive_slist	2021-03-29 13:10:36 +02:00
Piotr Sarna	bc1c92fd05	Merge 'Improve flat_mutation_reader::consume_pausable' from Piotr Jastrzębski `flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places worth mentioning are memtables and combined readers but there are others as well. This patchset improves `consume_pausable` in three ways: 1. it removes unnecessary allocation 2. it rearranges ifs to not check the same thing twice 3. for a consumer that returns plain stop_iteration not a future<stop_iteration> it reduces the amount of future usage Test: unit(dev, release, debug) Combined reader microbenchmark has shown from 2% to 22% improvement in median execution time while memtable microbenchmark has shown from 3.6% to 7.8% improvement in median execution time. Before the change: ``` ./build/release/test/perf/perf_mutation_readers --random-seed 3549335083 single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 16 random seed: 3549335083 test iterations median mad min max combined.one_row 1316234 140.120ns 0.020ns 140.074ns 140.141ns combined.single_active 7332 91.484us 31.890ns 91.453us 91.778us combined.many_overlapping 945 870.973us 429.720ns 868.625us 871.403us combined.disjoint_interleaved 7102 85.989us 7.847ns 85.973us 85.997us combined.disjoint_ranges 7129 85.570us 7.840ns 85.562us 85.596us combined.overlapping_partitions_disjoint_rows 5458 124.787us 56.738ns 124.731us 125.370us clustering_combined.ranges_generic 1920688 217.940ns 0.184ns 217.742ns 218.275ns clustering_combined.ranges_specialized 1935318 194.610ns 0.199ns 194.210ns 195.228ns memtable.one_partition_one_row 624001 1.600us 1.405ns 1.599us 1.605us memtable.one_partition_many_rows 79551 12.555us 1.829ns 12.549us 12.558us memtable.many_partitions_one_row 40557 24.748us 77.083ns 24.644us 25.135us memtable.many_partitions_many_rows 3220 310.429us 57.628ns 310.295us 311.189us ``` After the change: ``` ./build/release/test/perf/perf_mutation_readers --random-seed 3549335083 single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 16 random seed: 3549335083 test iterations median mad min max combined.one_row 1358839 109.222ns 0.122ns 109.089ns 109.348ns combined.single_active 7525 87.305us 25.540ns 87.273us 87.362us combined.many_overlapping 962 853.195us 1.904us 851.244us 855.142us combined.disjoint_interleaved 7310 81.988us 28.877ns 81.949us 82.032us combined.disjoint_ranges 7315 81.699us 37.144ns 81.662us 81.874us combined.overlapping_partitions_disjoint_rows 5591 120.964us 15.294ns 120.949us 121.120us clustering_combined.ranges_generic 1954722 211.993ns 0.052ns 211.883ns 212.084ns clustering_combined.ranges_specialized 2042194 187.807ns 0.066ns 187.732ns 188.289ns memtable.one_partition_one_row 648701 1.542us 0.339ns 1.542us 1.543us memtable.one_partition_many_rows 85007 11.759us 1.168ns 11.752us 11.782us memtable.many_partitions_one_row 43893 22.805us 17.147ns 22.782us 22.843us memtable.many_partitions_many_rows 3441 290.220us 41.720ns 290.172us 290.306us ``` Closes #8359 * github.com:scylladb/scylla: flat_mutation_reader: optimize consume_pausable for some consumers flat_mutation_reader: special case consumers in consume_pausable flat_mutation_reader: Change order of checks in consume_pausable flat_mutation_reader: fix indentation in consume_pausable flat_mutation_reader: Remove allocation in consume_pausable perf: Add benchmarks for large partitions	2021-03-29 13:06:56 +02:00
Piotr Sarna	4c79f132b6	thrift: add support for max_concurrent_requests_per_shard The Thrift frontend is now capable of limiting the max number of concurrent in-flight requests. Surplus requests are shed. Tests: manual	2021-03-29 13:05:16 +02:00
Piotr Sarna	9f53327c9d	thrift: add metrics for admission control The new metrics include information about how many requests were blocked on memory, how much is still available, etc.	2021-03-29 13:05:16 +02:00
Piotr Sarna	6b021779d2	thrift: add a counter for in-flight requests	2021-03-29 13:05:16 +02:00
Piotr Sarna	9391515461	thrift: add a counter for blocked requests The counter tracks how many requests were blocked by the memory estimation based admission control semaphore.	2021-03-29 13:05:16 +02:00
Piotr Sarna	ef1de114f0	thrift: partially add admission control This commit adds admission control in the form of passing service permits to the Thrift server. The support is partial, because Thrift also supports running CQL queries, and for that purpose a query_state object is kept in the Thrift handler. However, the handler is generally created once per connection, not once per query, and the query_state object is supposed to keep the state of a single query only. In order to keep this series simpler, the CQL-on-top-of-Thrift layer is not touched and is left as TODO. Moreover, the Thrift layer does not make it easy to pass custom per-query context (like service_permit), so the implementation uses a trick: the service permit is created on the server and then passed as reference to its connections and their respective Thrift handlers. Then, each time a query is read from the socket, this service permit is overwritten and then read back from the Thrift handler. This mechanism heavily relies on the fact that there are zero preemption points between overwriting the service permit and reading it back by the handler. Otherwise, races may occur. This assumption was verified by code inspection + empirical tests, but if somebody is aware that it may not always hold, please speak up.	2021-03-29 13:05:16 +02:00
Nadav Har'El	ccc75bfe2a	Merge 'Disable thrift by default' from Piotr Sarna The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default. Fixes #8336 Closes #8338 * github.com:scylladb/scylla: docs: mention disabling Thrift by default db,config: disable Thrift by default	2021-03-29 12:48:20 +03:00
Piotr Sarna	3388694e69	service_permit: add a getter for the number of units held The helper function makes debugging considerably easier.	2021-03-29 11:34:18 +02:00
Piotr Sarna	364b921e25	thrift: coroutinize processing a request While not particularly useful now, it will facilitate later changes which introduce service permits.	2021-03-29 11:34:18 +02:00
Piotr Sarna	09621e5fc5	memory_limiter: add a missing seastarx include It's that or declaring everything that belongs to seastar namespace explicitly, and including "seastarx.hh" is more standard.	2021-03-29 11:34:18 +02:00
Michał Chojnowski	8c45225f21	docs: remove the obsolete IMR design note IMR, as described in this design note, was removed in `001652815c`. This doc should have been removed back then, but was overlooked. Closes #8340	2021-03-29 10:58:05 +02:00
Pekka Enberg	aec33c599b	Update tools/python3 submodule * tools/python3 6f3bcbe...ad04e8e (2): > dist/debian: fix renaming debian/scylla-* files rule > fix license of package build script to AGPL	2021-03-29 11:50:24 +03:00
Pekka Enberg	203b7394d7	Update tools/java submodule * tools/java 7b66b7a0fc...ccc4201ded (1): > dist/debian: fix renaming debian/scylla-* files rule	2021-03-29 11:50:19 +03:00
Tomasz Grabiec	c0ce122f77	Merge "raft: wire up rpc add_server/remove_server for configuration changes" from Pavel Solodovnikov Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. There is also another problem to be solved: in Raft an instance may need to communicate with a peer outside its current configuration. This may happen, e.g., when a follower falls out of sync with the majority and then a configuration is changed and a leader not present in the old configuration is elected. The solution is to introduce the concept of "expirable" updates to the RPC subsystem. When RPC receives a message from an unknown peer, it also adds the return address of the peer to the address map with a TTL. Should we need to respond to the peer, its address will be known. An outgoing communication to an unconfigured peer is impossible. * manmanson/raft_mappings_wiring_v12: raft: update README.md with info on RPC server address mappings raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes raft/fsm: add optional `rpc_configuration` field to fsm_output raft: maintain current rpc context in `server_impl` raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff` raft: unit-tests for `raft_address_map` raft: support expiring server address mappings for rpc module	2021-03-29 10:28:45 +02:00
Piotr Jastrzebski	86cf566692	flat_mutation_reader: optimize consume_pausable for some consumers consumers that return stop_iteration not future<stop_iteration> don't have to consume a single fragment per each iteration of repeat. They can consume whole buffer in each iteration. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:55:14 +02:00
Piotr Jastrzebski	26cc4f112d	flat_mutation_reader: special case consumers in consume_pausable consume_pausable works with consumers that return either stop_iteration or future<stop_iteration>. So far it was calling futurize_invoke for both. This patch special cases consumers that return future<stop_iteration> and don't call futurize_invoke for them as this is unnecessary work. More importantly, this will allow the following patch to optimize consumers that return plain stop_iteration. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:55:14 +02:00
Piotr Jastrzebski	164e23d2b1	flat_mutation_reader: Change order of checks in consume_pausable This way we can avoid checking is_buffer_empty twice. Compiler might be able to optimize this out but why depend on it when the alternative is not less readable. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:55:14 +02:00
Piotr Jastrzebski	776ba29cec	flat_mutation_reader: fix indentation in consume_pausable Code was left with wrong indentation by the previous commit that removed do_with call around the code that's currently present. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:55:14 +02:00
Piotr Jastrzebski	9fb0014d72	flat_mutation_reader: Remove allocation in consume_pausable The allocation was introduced in `515bed90bb` but I couldn't figure out why it's needed. It seems that the consumer can just be captured inside lambda. Tests seem to support the idea. Indentation will be fixed in the following commit to make the review easier. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:55:14 +02:00
Piotr Jastrzebski	3aa7bee5e3	perf: Add benchmarks for large partitions in perf_mutation_readers. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-03-29 09:48:11 +02:00
Avi Kivity	ec4d91f9eb	tools: toolchain: dbuild: improve cgroupv2 detection code dbuild detects if the kernel is using cgroupv2 by checking if the cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu 20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second mount matches the search expression and gives a false positive. Fix by adding a space at the end; this will fail to match /sys/fs/cgroup/unified. Closes #8355	2021-03-29 09:31:29 +03:00
Pavel Solodovnikov	2d9e94f050	raft: update README.md with info on RPC server address mappings Describe the high-level scheme of managing RPC mappings and also expand on the introduction of "expirable" RPC mappings concept and why these are needed. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:13 +03:00
Pavel Solodovnikov	f61206e483	raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:09 +03:00
Pavel Solodovnikov	16d9e8e9af	raft/fsm: add optional `rpc_configuration` field to fsm_output The field is set in `fsm.get_output` whenever `_log.last_conf_idx()` or the term changes. Also, add `_last_conf_idx` and `_last_term` to `fsm::last_observed_state`, they are utilized in the condition to evaluate current rpc configuration in `fsm.get_output()`. This will be used later to update rpc config state stored in `server_impl` and maintain rpc address map. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:05 +03:00
Tomasz Grabiec	6035fd05b3	Merge "Unify drain() and drain_on_shutdown()" from Pavel Emelyanov The start-stop code is drifting towards a straightforward scheme of a bunch of service foo; foo.start(); auto stop_foo = defer([&foo] { foo.stop(); }); blocks. The drain_on_shutdown() and its relation to drain() and decommission() is a big hurdle on the way of this effort. This set unifies drain() and drain_on_shutdown() so that drain really becomes just some first steps of the regular shutdown, i.e. -- what it should be. Some synchronisation bits around it are still needed, though. This unification also closes a bunch not-yet-caught bugs when parts of the system remained running in case shutdown happens after nodetool drain. In this case the whole drain_on_sutdown() becomes a noop (just returns drain()'s future) and what's missing in drain() becomes missing on shutdown. tests: unit(dev), dtest(simple_boot_shutdown : dev), manual(start+stop, start+drain+stop : dev) refs: #2737 * xemul/br-drain-on-shutdown: drain_on_shutdown: Simplify drain: Fix indentation storage_service: Unify drain and drain_on_shutdown storage_proxy: Drain and unsubscribe in main.cc migration_manager: Stop it in two phases stream_manager: Stop instances on drain batchlog_manager: Stop its instances on shutdown tracing: Shutdown tracing in drain tracing: Stop it in main.cc system_distributed_keyspace: Stop it in main.cc storage_service: Move (un)subscription to migration events	2021-03-26 18:37:27 +01:00
Pavel Solodovnikov	19cc85b3b6	raft: maintain current rpc context in `server_impl` Introduce rpc server_address that represents the last observed state of address mappings for RPC module. It does not correspond to any kind of configuration in the raft sense, just an artificial construct corresponding to the largest set of server addresses coming from both previous and current raft configurations (to be able to contact both joining and leaving servers). This will be used later to update rpc module mappings when cluster configuration changes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	8799ccbab0	raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff` `std::unordered_set::contains` is introduced in C++20 and provides clearer semantics to check existence of a given element in a set. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	7c229998e8	raft: unit-tests for `raft_address_map` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	3c4d46728d	raft: support expiring server address mappings for rpc module This patch introduces `raft_address_map` class to abstract the notion of expirable address mappings for a raft rpc module. In Raft an instance may need to communicate with a peer outside its current configuration. This may happen, e.g., when a follower falls out of sync with the majority and then a configuration is changed and a leader not present in the old configuration is elected. The solution is to introduce the concept of "expirable" updates to the RPC subsystem. When RPC receives a message from an unknown peer, it also adds the return address of the peer to the address map with a TTL. Should we need to respond to the peer, its address will be known. An outgoing communication to an unconfigured peer is impossible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Emelyanov	d1796ab3dc	drain_on_shutdown: Simplify The modern version of this method doesn't need the run_with_no_api_lock(), as it's launched on shard 0 anyway, neither it needs logging before and after as it's done by the deferred action from main that calls it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:46 +03:00
Pavel Emelyanov	58b47efe16	drain: Fix indentation Previous patch left it broken for readability. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:46 +03:00
Pavel Emelyanov	8d7ad6de03	storage_service: Unify drain and drain_on_shutdown Now they only differ in one bit -- compaction manager is drained on drain and is left running (until regular stop) on shutdown. So this unification adds a boolean flag for this case. Also the indentation is deliberately left broken for the sake of patch readability. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:46 +03:00
Pavel Emelyanov	b60099c2f8	storage_proxy: Drain and unsubscribe in main.cc Currently shutdown after drain leaves storage proxy subscribed on storage_service events and without the storage_proxy::drain_on_shutdown being called. So it seems safe if the whole thing is relocated closer to its starting peers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:46 +03:00
Pavel Emelyanov	9a8f125890	migration_manager: Stop it in two phases Before the patch the migration manager was stopped in two ways and one was buggy. Plain shutdown -- it's just sharded::stop-ed by defer in main(), but this happens long after the shutdown of commitlog, which is not correct. Shutdown after drain -- it's stopped twice, first time right before the commitlog shutdown, second -- the same defer in main(). And since the sharded::stop is reentrable, the 2nd stop works noop. This patch splits the stop into two phases: first it stops the instances and does this in _both_ -- plain shutdown and shutdown after drain. This phase is done before commitlog shutdown in both cases. Second, the existring deferred sharded::stop in main.cc. This changes needs the migration_manager::stop() to become re-entrable, but that's easily checked with the help of abort_source the migration_manager has. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	de8a7fe798	stream_manager: Stop instances on drain It's not seen directly from ths patch itself, but the only difference between first several calls that drain() makes and the stop_transport() is the do_stop_stream_manager() in the latter. Again, it's partially a bugfix (shutdown after drain leaves streaming running), partially a must-have thing (streaming is not expected in the background after drain), partially a unification of two drains out there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	9a7e2a218b	batchlog_manager: Stop its instances on shutdown It's now stopped (not sharded::stop(), but batchlog_manager::stop) on plain drain, but plain shutdown leaves it running, so fill this gap. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	bcc3935ce7	tracing: Shutdown tracing in drain First of all, shutdown that happens after nodetoo drain leaves tracing up-n-running, so it's effectively a bugfix. But also a step towards unified drain and drain_on_shutdown. Keeping this bit in drain seems to be required because drain stops transport, flushes column families and shuts commitlog down. Any tracing activity happening after it looks uncalled for. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	f1d7804102	tracing: Stop it in main.cc The tracing::stop() just checks that it was shutdown()-ed and otherwise a noop, so it's OK to stop tracing later. This brings drain() and drain_on_shutdown() closer to each other and makes main.cc look more like it should. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	d7cccec97f	system_distributed_keyspace: Stop it in main.cc It's now stopped in drain_on_shutdown, but since its stop() method is a noop, it doesn't matter where it is. Keeping it in main.cc next to related start brings drain_on_shutdown() closer to drain() and the whole thing closer to the Ideal start-stop sequence. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Pavel Emelyanov	5456174d69	storage_service: Move (un)subscription to migration events After the patch the subscription effectively happens at the same time as before, but is now located in main.cc, so no real change here. The unsubscription was in the drain_on_shutdown before the patch, but after it it happens to be a defer next to its peer, i.e. later, but it shouldn't be disastrous for two reasons. First -- client services and migration manager are already stopped. Second -- before the patch this subscription was _not_ cancelled if shutdown ran after nodetool drain and it didn't cause troubles. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-26 18:58:45 +03:00
Botond Dénes	d64b1fdd6a	reader_permit: signal leaked resources When destroying a permit with leaked resources we call `on_internal_error_noexcept()` in the destructor. This method logs an error or asserts depending on the configuration. When not asserting, we need to return the leaked units to the semaphore, otherwise they will be leaked for good. We can do this because we know exactly how many resources the user of the permit leaked (never signalled).	2021-03-26 14:23:32 +02:00
Botond Dénes	0f1a72ba59	test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease To ensure the semaphores outlive all permits created as part of the tests.	2021-03-26 14:22:43 +02:00
Botond Dénes	f843e3de08	sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore Used to produce the needed permits for the index reads, such that it over-lives all the permits in use.	2021-03-26 11:06:02 +02:00
Tomasz Grabiec	f86896d387	Merge "Iterate range tombstones in partition_snapshot_reader" from Pavel Emelyanov Currently the guy copies and merges all range tombstones from all partition versions (that match the given range, but still) when being initialized or decides to refresh iterators. This is a lot of potentially useless work and memory, as the reader may be dropped before it emits all the mutations from the given range(s). It's better to walk the tombstones step-by-step, like it's done for rows. fixes: #1671 tests: unit(dev) * xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2: range_tombstone_stream: Remove unused methods partition_snapshot_reader: Emit range tombstones on demand partition_snapshot_reader: Introduce maybe_refresh_state partition_snapshot_reader: Move range tombstone stream member partition_snapshot_reader: Add reset_state method to helper class partition_snapshot_reader: Downgrade heap comparator partition_snapshot_reader: Use on-demand comparators range_tombstone_list: Add new slice() helper range_tombstone_list: Introduce iterator_range alias	2021-03-26 01:27:18 +01:00
Pavel Emelyanov	c6a0e0439e	files: Construct file_impls properly Constructors of classes inherited from file_impl copy alignment values by hands, but miss the overwrite one, thus on a new file it remains default-initialized. To fix this and not to forget to properly initalize future fields from file_impl, use the impl's copy constructor. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210325104830.31923-1-xemul@scylladb.com>	2021-03-26 00:22:11 +01:00
Tomasz Grabiec	ef06a939c4	Merge "raft: seven etcd unit tests ported" from Alejo Seven etcd unit tests as boost tests. * alejo/raft-tests-etcd-08-v4-communicate-v5: raft: etcd unit tests: test proposal handling scenarios raft: etcd unit tests: test old messages ignored raft: etcd unit tests: test single node precandidate raft: etcd unit tests: test dueling precandidates raft: etcd unit tests: test dueling candidates raft: etcd unit tests: test cannot commit without new term raft: etcd unit tests: test single node commit raft: etcd unit tests: update test_leader_election_overwrite_newer_logs raft: etcd unit tests: fix test_progress_leader raft: testing: log comparison helper functions raft: testing: helper to make fsm candidate raft: testing: expose log for test verification raft: testing: use server_address_set raft: testing: add prevote configuration raft: testing: make become_follower() available for tests	2021-03-25 20:27:07 +01:00
Alejo Sanchez	ace0ee514f	raft: etcd unit tests: test proposal handling scenarios TestProposal For multiple scenarios, check proposal handling. Note, instead of expecting an explicit result for each specified case, the test automatically checks for expected behavior when quorum is reached or not. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	77163ea76a	raft: etcd unit tests: test old messages ignored TestOldMessages Checks an append request from a leader from a previous term is ignored. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	bf65b19803	raft: etcd unit tests: test single node precandidate TestSingleNodePreCandidate Checks a single node configuration with precandidate on works to automatically elect the node. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	de7051467b	raft: etcd unit tests: test dueling precandidates TestDuelingPreCandidates In a configuration of 3 nodes, two nodes don't see each other and they compete for leadership. Loser (3) should revert to follower when prevote is rejected and revert to term 1. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	aa7d23f86b	raft: etcd unit tests: test dueling candidates TestDuelingCandidates In a configuration of 3 nodes, two nodes don't see each other and they compete for leadership. Once reconnected, loser should not disrupt. But note it will remain candidate with current algorithm without prevoting and other fsms will not bump term. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	1eac94e7d6	raft: etcd unit tests: test cannot commit without new term TestCannotCommitWithoutNewTermEntry tests the entries cannot be committed when leader changes, no new proposal comes in and ChangeTerm proposal is filtered. NOTE: this doesn't check committed but it's implicit for next round; this could also use communicate() providing committed output map Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	b421fe3605	raft: etcd unit tests: test single node commit Port etcd TestSingleNodeCommit In a single node configuration elect the node, add 2 entries and check number of committed entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	9b4538476b	raft: etcd unit tests: update test_leader_election_overwrite_newer_logs Make test_leader_election_overwrite_newer_logs use newer communicate() and other new helpers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	368eec1190	raft: etcd unit tests: fix test_progress_leader Make implementation follow closer to original test. Use newer boost test helpers. NOTE: in etcd it seems a leader's self progress is in PIPELINE state. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:28 -04:00
Alejo Sanchez	ba29970e29	raft: testing: log comparison helper functions Two helper functions to compare logs. For now only index, term, and data type are used. Data content comparison does not seem to be necessary for now. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:28 -04:00
Alejo Sanchez	aeab4cf4a9	raft: testing: helper to make fsm candidate Current election_timeout() helper might bump the term twice. It's convenient and less error prone to have a more fine grained helper that stops right when candidate state is reached. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:19 -04:00
Alejo Sanchez	7a6616f1cb	raft: testing: expose log for test verification Let derived classes access the log to verify its contents. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:03:46 -04:00
Alejo Sanchez	05b1f57e67	raft: testing: use server_address_set Use server_address_set in local namespace for brevity. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:01:12 -04:00
Alejo Sanchez	9d0a7d8ccf	raft: testing: add prevote configuration Provide a generic prevote configuration for tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:00:28 -04:00
Dejan Mircevski	b2a04985f7	cql-pytest: Drop needless INSERT in test_null One INSERT statement was unnecessary for the test, so delete it. Another was necessary, so explain it. Tests: cql-pytest/test_null on both Scylla and Cassandra Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8304	2021-03-25 16:37:00 +01:00
Tomasz Grabiec	7b30d31d77	Merge "raft: test configuration changes" from Kostja Test raft configuration changes: a node with empty configuration, transitioning to an entirely different cluster, transitioning in presence of down nodes, leader change during configuration change, stray replies, etc. * scylla-dev/raft-empty-confchange-v5: (21 commits) raft: (testing) stray replies from removed followers raft: always return a non-zero configuration index from the log raft: (testing) leader change during configuration change raft: (testing) test confchange {ABCDE} -> {ABCDEFG} raft: (testing) test confchange {ABCDEF} -> {ABCGH} raft: (testing) test confchange {ABC} -> {CDE} raft: (testing) test confchange {AB} -> {CD} raft: (testing) test confchange {A} -> {B} raft: (testing) test a server with empty configuration raft: (testing) introduce testing utilities raft: (testing) simplify id allocation in test raft: (testing) add select_leader() helper raft: (testing) introduce communicate() helper raft: (testing) style cleanup in raft_fsm_test raft: (testing) fix bug in election_threshold raft: minor style changes & comments raft: do not assert when transitioning to empty config raft: assert we never apply a snapshot over uncommitted entries (leader) raft: improve tracing raft: add fsm_output::empty() helper to aid testing ...	2021-03-25 14:01:09 +01:00
Wojciech Mitros	b152dc8c86	types: move read_collection_size/value specialization to header file The template method needs to be specialized in each file that is using it. To avoid rewriting the specialization into multiple files, move it to the header file. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-03-25 12:18:38 +01:00
Avi Kivity	46185d7d82	Update tools/jmx submodule * tools/jmx 9c687b5...440313e (1): > storage_service: Add a generic toppartitions endpoint	2021-03-25 12:36:10 +02:00
Alejo Sanchez	7e6807e8fc	raft: testing: make become_follower() available for tests Some etcd tests need to force a follower with a specific leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-24 19:11:09 -04:00
Piotr Wojtczak	c1daf2bb24	column_family: Make toppartitions queries more generic Right now toppartitions can only be invoked on one column family at a time. This change introduces a natural extension to this functionality, allowing to specify a list of families. We provide three ways for filtering in the query parameter "name_list": 1. A specific column family to include in the form "ks:cf" 2. A keyspace, telling the server to include all column families in it. Specified by omitting the cf name, i.e. "ks:" 3. All column families, which is represented by an empty list The list can include any amount of one or both of the 1. and 2. option. Fixes #4520 Closes #7864	2021-03-24 17:54:05 +02:00
Raphael S. Carvalho	bcbb39999b	LCS: Fix terrible write amplification when reshaping level 0 LCS reshape is basically 'major compacting' level 0 until it contains less than N sstables. That produces terrible write amplification, because any given byte will be compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially contained 256 ssts, there would be a WA of about 8. This terrible write amplification can be reduced by performing STCS instead on L0, which will leave L0 in a good shape without hurting WA as it happens now. Fixes #8345. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>	2021-03-24 17:48:50 +02:00
Piotr Sarna	24a43681b4	thrift: handle gate closed exception on retry During the retry mechanism, it's possible to encounter a gate closed exception, which should simply be ignored, because it indicates that the server is shutting down. Closes #8337	2021-03-24 17:41:58 +02:00
Konstantin Osipov	1a1d7ab662	raft: (testing) stray replies from removed followers	2021-03-24 14:05:55 +03:00
Konstantin Osipov	0295163f6f	raft: always return a non-zero configuration index from the log Return snapshot index for last configuration index if there is no configuration in the log.	2021-03-24 14:05:55 +03:00
Konstantin Osipov	cec59e53ef	raft: (testing) leader change during configuration change	2021-03-24 14:05:36 +03:00
Pavel Emelyanov	37bec6fb76	commitlog: Open files with append_is_unlikely This open option tells seastar that the file in question will be truncated to the needed size right at once and all the subsequent writes will happen within this size. This hint turns off append optimization in seastar that's not that cheap and helps so save few cpu cycles. The option was introduced in seastar by 8bec57bc. tests: unit(dev), dtest(commitlog: test_batch_commitlog, test_periodic_commitlog, test_commitlog_replay_on_startup) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210323115409.31215-1-xemul@scylladb.com>	2021-03-24 13:05:33 +02:00
Konstantin Osipov	a203c8833f	raft: (testing) test confchange {ABCDE} -> {ABCDEFG}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	40e117d36e	raft: (testing) test confchange {ABCDEF} -> {ABCGH}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	14b2d5d308	raft: (testing) test confchange {ABC} -> {CDE} Test leader change during configuration change.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	3c718a175e	raft: (testing) test confchange {AB} -> {CD}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	2e30c8540e	raft: (testing) test confchange {A} -> {B} Test non-restart and leader restart scenario.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	e23da06fef	raft: (testing) test a server with empty configuration Try becoming a candidate for such server, or adding it to an existing configuration.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	b18599c630	raft: (testing) introduce testing utilities Add a discrete_failure_detector, to be able to mark a single server dead.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	8d26d24370	raft: (testing) simplify id allocation in test	2021-03-24 14:04:18 +03:00
Konstantin Osipov	322a15ec33	raft: (testing) add select_leader() helper With leader stepdown extension, leadership transfer can happen to any follower with long enough log. Add a helper to select that follower from a list.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	4a00da276d	raft: (testing) introduce communicate() helper Allow to communicate between arbitrary number of FSMs. Drop messages to FSMs which are not in the argument list. Stop communication upon predicate.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	7182323ac0	raft: (testing) style cleanup in raft_fsm_test 1) Avoid memory violations on test failure 2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs BOOST_CHECK)	2021-03-24 14:04:18 +03:00
Konstantin Osipov	f0f25bf7fb	raft: (testing) fix bug in election_threshold election_threshold was ticking one extra tick, causing the follower to become candidate in some cases. This was rendering tests unstable.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	00d7379bc9	raft: minor style changes & comments Add comments explaining the rationale from transfer_leadership() (more PhD quotes), encapsulate stable leader check in tick() into a lambda and add more detailed comments to it.	2021-03-24 14:04:18 +03:00
Piotr Sarna	06131e21a3	configure.py: add customizing clang inline threshold Until clang figures things out with the now infamous `-llvm -inline-threshold X` parameter, let's allow customizing it to make the compilation of release builds less tiresome. For instance, scylla's row_level.o object file currently does not compile for me until I decrease the inline threshold to a low value (e.g. 50). Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>	2021-03-24 12:09:26 +02:00
Tomasz Grabiec	9272e74e8c	sstable: writer: ka/la: Write row marker cell after row tombstone Row marker has a cell name which sorts after the row tombstone's start bound. The old code was writing the marker first, then the row tombstone, which is incorrect. This was harmeless to our sstable reader, which recognized both as belonging to the current clustering row fragment, and collects both fine. However, if both atoms trigger creation of promoted index blocks, the writer will create a promoted index with entries wich violate the cell name ordering. It's very unlikely to run into in practice, since to trigger promoted index entries for both atoms, the clustering key would be so large so that the size of the marker cell exceeds the desired promoted index block size, which is 64KB by default (but user-controlled via column_index_size_in_kb option). 64KB is also the limit on clustering key size accepted by the system. This was caught by one of our unit tests: sstable_conforms_to_mutation_source_test ...which runs a battery of mutation reader tests with various desired promoted index block sizes, including the target size of 1 byte, which triggers an entry for every atom. The test started to fail for some random seeds after commit `ecb6abe` inside the test_streamed_mutation_forwarding_is_consistent_with_slicing test case, reporting a mutation mismatch in the following line: assert_that(sliced_m).is_equal_to(fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key())); It compares mutations read from the same sstable using different methods, slicing using clustering key restricitons, and fast forwarding. The reported mismatch was that fwd_m contained the row marker, but sliced_m did not. The sstable does contain the marker, so both reads should return it. After reverting the commit which introduced dynamic adjustments, the test passes, but both mutations are missing the marker, both are wrong! They are wrong because the promoted index contians entries whose starting positions violate the ordering, so binary search gets confused and selects the row tombstone's position, which is emitted after the marker, thus skipping over the row marker. The explanation for why the test started to fail after dynamic adjustements is the following. The promoted index cursor works by incrementally parsing buffers fed by the file input stream. It first parses the whole block and then does a binary search within the parsed array. The entries which cursor touches during binary search depend on the size of the block read from the file. The commit which enabled dynamic adjustements causes the block size to be different for subsequent reads, which allows one of the reads to walk over the corrupted entries and read the correct data by selecting the entry corresponding to the row marker. Fixes #8324 Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>	2021-03-23 16:13:47 +01:00
Tomasz Grabiec	235154cca5	Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov Clustering rows are now stored in intrusive btree, cells are now stored in radix tree, but scylla-gdb tries to walk the intrusive_set and vector/set union respectively. For the former case -- the btree wrapper is introduced. For the latter -- compiler optimizes-away too many important bits and walking the tree turns into a bunch of hard-coded hacks and reiterpret-casts. Untill better solution is found, just print the address of the tree root. * xemul/br-gdb-btree-rows: gdb: Show address of the row::_cells tree (or "empty" mark) gdb: Add support for intrusive B tree gdb: Use helper to get rows from mutation_partition	2021-03-23 12:50:17 +01:00
Pavel Emelyanov	1cd9ec952f	gdb: Show address of the row::_cells tree (or "empty" mark) Currently clang optimizes-out lots of critical stuff from compact radix tree. Untill we find out the way to walk the tree in gdb, it's better to at least show where it is in memory. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-23 13:29:40 +03:00
Pavel Emelyanov	5c85fcb3c9	gdb: Add support for intrusive B tree Rows inside partition are now stored in an intrusive B-tree, so here's the helper class that wraps this collection. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-23 12:54:44 +03:00
Pavel Emelyanov	ed38b18a84	gdb: Use helper to get rows from mutation_partition Preparation for the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-23 12:54:14 +03:00
Avi Kivity	3c292e31af	utils: utf8: fix validate_partial() on non-SIMD-optimized architectures validate_partial() is declared in the internal namespace, but defined outside it. This causes calls to validate_partial() to be ambiguous on architectures that haven't been SIMD-optimized yet (e.g. s390x). Fix by defining it in the internal namespace. Closes #8268	2021-03-23 09:21:14 +02:00
Avi Kivity	957259fab7	tools: toolchain: prepare: adjust manifest manipulations The manifest manipulation commands stopped working with podman 3; the containers-storage: prefix now throws errors. Switch to `buildah manifest`; since we're building with buildah, we might as well maintain the manifest with buildah as well. Closes #8231	2021-03-23 09:18:19 +02:00
Avi Kivity	4dae434f69	utils: crc: fix build with big-endian architectures and 1-byte objects crc has some code to reverse endianness on big-endian machines, but does not handle the case of a 1-byte object (which doesn't need any adjustement). This causes clang to complain that the switch statement doesn't handle that case. Fix by adding a no-op case. Closes #8269	2021-03-23 09:16:20 +02:00
Konstantin Osipov	ce29fb44c3	raft: do not assert when transitioning to empty config Throw instead, to make this case testable.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	2ee15ad6c7	raft: assert we never apply a snapshot over uncommitted entries (leader)	2021-03-22 18:55:40 +03:00
Konstantin Osipov	c7f7ad2c4e	raft: improve tracing Add tracing to apply_snapshot, request_vote.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	4dd66edae5	raft: add fsm_output::empty() helper to aid testing Used in testing to implement trivial transport.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	89349f550c	raft: aid testing by providing fsm::id()	2021-03-22 18:55:40 +03:00
Botond Dénes	742a33730a	scylla-gdb.py: dereference_smart_ptr(): add support for seastar::smart_ptr Although a seastar::smart_ptr is trivial to dereference manually, so is adding support for it to dereference_smart_ptr(), avoiding the annoying (but brief) detour which is currently needed. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>	2021-03-22 17:30:35 +02:00
Piotr Sarna	b774d69ad2	docs: mention disabling Thrift by default Thrift is no longer enabled by default, so the documentation should mention that, as well as the suggested way of enabling it if necessary.	2021-03-22 14:32:51 +01:00
Raphael S. Carvalho	c86dd125a1	sstables: clean up partitioned_sstable_set::insert() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210322130227.16805-2-raphaelsc@scylladb.com>	2021-03-22 15:30:32 +02:00
Raphael S. Carvalho	48d8cc261e	sstables: don't swallow exception in partitioned_sstable_set::insert() regression introduced by `02b2df1ea9` (Fri Mar 12 01:22:41 2021 -0300). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210322130227.16805-1-raphaelsc@scylladb.com>	2021-03-22 15:30:31 +02:00
Avi Kivity	50dda795e9	Update seastar submodule * seastar 83339edb04...48376c76a1 (2): > iotune: Warn user about write-back cache mode > reactor: add --kernel-page-cache option to disable O_DIRECT	2021-03-22 13:33:08 +02:00
Avi Kivity	74df67776b	bytes_ostream: convert write_placeholder from enable_if to concepts Concepts are easier to read and result in better error messages. This change also tightens the constraint from "std::is_fundamental" to "std::integral". The differences are floating point values, nullptr_t, and void. The latter two are illegal/useless to write, and nobody uses floating point values for list lengths, so everything still compiles. Closes #8326	2021-03-22 12:00:07 +01:00
Piotr Sarna	e2443337d9	db,config: disable Thrift by default It will still be possible to use Thrift once it's enabled in the yaml file, but it's better to not open this port by default, since Thrift is definitely not the first choice for Scylla users. Fixes #8336	2021-03-22 10:54:26 +01:00
Piotr Sarna	23057dd186	Merge 'Implement RAFT's leader stepdown extension' from Gleb This series implements leader stepdown extension. See patch 4 for justification for its existence. First three patches either implement cleanups to existing code that future patch will touch or fix bugs that need to be fixed in order for stepdown test to work. * 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev: raft: add test for leader stepdown raft: introduce leader stepdown procedure raft: fix replication when leader is not part of current config raft: do not update last election time if current leader is not a part of current configuration raft: move log limiting semaphore into the leader state	2021-03-22 09:45:19 +01:00
Avi Kivity	3c44445c07	Merge "Introduce off-strategy compaction for repair-based bootstrap and replace" from Raphael " Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself. This aggressiveness comes from the fact that: 1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation. 2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant. To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set. The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint. NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series. Refs #5226. " * 'offstrategy_v7' of github.com:raphaelsc/scylla: tests: Add unit test for off-strategy sstable compaction table: Wire up off-strategy compaction on repair-based bootstrap and replace table: extend add_sstable_and_update_cache() for off-strategy sstables/compaction_manager: Add function to submit off-strategy work table: Introduce off-strategy compaction on maintenance sstable set table: change build_new_sstable_list() to accept other sstable sets table: change non_staging_sstables() to filter out off-strategy sstables table: Introduce maintenance sstable set table: Wire compound sstable set table: prepare make_reader_excluding_sstables() to work with compound sstable set table: prepare discard_sstables() to work with compound sstable set table: extract add_sstable() common code into a function sstable_set: Introduce compound sstable set reshape: STCS: preserve token contiguity when reshaping disjoint sstables	2021-03-22 10:43:13 +02:00
Gleb Natapov	272cb1c1e6	raft: add test for leader stepdown	2021-03-22 10:31:16 +02:00
Gleb Natapov	9d6bf7f351	raft: introduce leader stepdown procedure Section 3.10 of the PhD describes two cases for which the extension can be helpful: 1. Sometimes the leader must step down. For example, it may need to reboot for maintenance, or it may be removed from the cluster. When it steps down, the cluster will be idle for an election timeout until another server times out and wins an election. This brief unavailability can be avoided by having the leader transfer its leadership to another server before it steps down. 2. In some cases, one or more servers may be more suitable to lead the cluster than others. For example, a server with high load would not make a good leader, or in a WAN deployment, servers in a primary datacenter may be preferred in order to minimize the latency between clients and the leader. Other consensus algorithms may be able to accommodate these preferences during leader election, but Raft needs a server with a sufficiently up-to-date log to become leader, which might not be the most preferred one. Instead, a leader in Raft can periodically check to see whether one of its available followers would be more suitable, and if so, transfer its leadership to that server. (If only human leaders were so graceful.) The patch here implements the extension and employs it automatically when a leader removes itself from a cluster.	2021-03-22 10:28:43 +02:00
Gleb Natapov	888b52dea1	raft: fix replication when leader is not part of current config When a leader orchestrates its own removal from a cluster there is a situation where the leader is still responsible for replication, but it is no longer part of active configuration. Current code skips replication in this case though. Fix it by always replicating in the leader state.	2021-03-22 09:52:17 +02:00
Gleb Natapov	1acc8996bc	raft: do not update last election time if current leader is not a part of current configuration Since we use external failure detector instead of relying on empty AppendRequests from a leader there can be a situation where a node is no longer part of a certain raft group but is still alive (and also may be part of other raft groups). In such case last election time should not be updated even if the node is alive. It is the same as if it would have stopped to send empty AppendRequests in original raft.	2021-03-22 09:52:17 +02:00
Gleb Natapov	ccf4435759	raft: move log limiting semaphore into the leader state Log limiting semaphore is used on a leader only, so it should be stored inside the leader state.	2021-03-22 09:52:17 +02:00
Takuya ASADA	35a14ab22b	configure.py: drop compat-python3 targets Since we switched scylla-python3 build directory to tools/python3/build on Jenkins, we nolonger need compat-python3 targets, drop them. Related scylladb/scylla-pkg#1554 Closes #8328	2021-03-21 18:04:27 +02:00
Benny Halevy	f562c9c2f3	test: sstable_datafile_test: tombstone_purge_test: use a longer ttl As seen in next-3319 unit testing on jenkins The cell ttl may expire during the test (presuming that the test machine was overloaded), leading to: ``` INFO 2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacting [/jenkins/workspace/scylla-master/next/scylla/testlog/release/scylla-af8644ec-7f07-4ffe-80bf-6703a942e435/la-17-big-Data.db:level=0:origin=, ] INFO 2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacted 1 sstables to []. 4kB to 0 bytes (~0% of original) in 0ms = 0 bytes/s. ~128 total partitions merged to 0. ./test/lib/mutation_assertions.hh(108): fatal error: in "tombstone_purge_test": Mutations differ, expected {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: { rows: [ { cont: true, dummy: false, position: { bound_weight: 0, }, 'value': { atomic_cell{1,ts=1616313953,expiry=1616313958,ttl=5} }, }, ] } } ...but got: {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: { rows: [ { cont: true, dummy: false, position: { bound_weight: 0, }, 'value': { atomic_cell{DEAD,ts=1616313953,deletion_time=1616313953} }, }, ] } } ``` This corresponds to: ``` 2395 auto mut2 = make_expiring(alpha, ttl); 2396 auto mut3 = make_insert(beta); ... 2399 auto sst2 = make_sstable_containing(sst_gen, {mut2, mut3}); ``` Extend (logical) ttl to 10 seconds to reduce flakiness due to real-time timing. Test: sstable_datafile_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210321142931.1226850-1-bhalevy@scylladb.com>	2021-03-21 16:42:00 +02:00
Avi Kivity	1e820687eb	Merge "reader_concurrency_semaphore: limit non-admitted inactive reads" from Botond " Due to bad interaction of recent changes (`913d970` and `4c8ab10`) inctive readers that are not admitted have managed to completely fly under the radar, avoiding any sort of limitation. The reason is that pre-admission the permits don't forward their resource cost to the semaphore, to prevent them possibly blocking their own admission later. However this meant that if such a reader is registered as inactive, it completely avoids the normal resource based eviction mechanism and can accumulate without bounds. The real solution to this is to move the semaphore before the cache and make all reads pass admission before they get started (#4758). Although work has been started towards this, it is still a while until it lands. In the meanwhile this patchset provides a workaround in the form of a new inactive state, which -- like admitted -- causes the permit to forward its cost to the semaphore, making sure these un-admitted inactive reads are accounted for and evicted if there is too much of them. Fixes: #8258 Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number) " * 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla: test: mutation_reader_test: add test for permit cleanup test: querier_cache_test: add memory based cache eviction test reader_permit: add inactive state querier: insert(): account immediately evicted querier as resource based eviction reader_concurrency_semaphore: fix clear_inactive_reads() reader_concurrency_semaphore: make inactive_read_handle a weak reference reader_concurrency_semaphore: make evict() noexcept reader_concurrency_semaphore: update out-of-date comments	2021-03-21 16:24:54 +02:00
Nadav Har'El	ab75226626	test/cql-pytest: remove xfail from passing test After commit `0bd201d3ca` ("cql3: Skip indexed column for CK restrictions") fixed issue #7888, the test cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering began passing, as expected. So we can remove its "xfail" label. Refs #7888. cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>	2021-03-21 16:02:30 +02:00
Avi Kivity	e2cd551880	Update seastar submodule * seastar ea5e529f30...83339edb04 (21): > cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc) > Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov Fixes #8327. > file: Add option to refuse the append-challenged file > Merge "Teach io-tester to work on block device" from Pavel E > Merge "Cleanup files code" from Pavel E > install-dependencies: Support rhel-8.3 > install-dependencies: Add some missing rh packages > file, reactor: reinstate RWF_NOWAIT support > file: Prevent fsxattr.fsx_extsize from overflow > cmake: enable clang's -Wno-error=#warnings if supported > cmake: harden seastar_supports_flag aginst inputs with spaces or # > cmake: fix seastar_supports_flag failing after first invocation > thread: Stop backtraces in main() on s390x architecture > intent: Explicitly declare constructors for references > test: file_io_test: parallel_overwrite: use testing::local_random_engine > util: log-impl: rework log_buf::inserter_iterator > rwlock: pass timeout parameter to get_units > concepts: require lib support to enable concepts > rpc: print more info on bad protocol magic > seastar-addr2line: strip input line to restore multiline support > log: skip on unknown nested mixing instead of stopping the logging Ref #8327.	2021-03-21 15:58:10 +02:00
Nadav Har'El	10bf2ba60a	cql-pytest: translate Cassandra's reproducers for issue #2962 This is a translation of Cassandra's CQL unit test source file validation/entities/SecondaryIndexOnMapEntriesTest.java into our our cql-pytest framework. This test file checks various features of indexing (with secondary index) individual entries of maps. All these tests pass on Cassandra, but fail on Scylla because of issue #2962 - we do not yet support indexing of the content of unfrozen collections. The failing test currently fail as soon as they try to create the index, with the message: "Cannot create secondary index on non-frozen collection or UDT column v". Refs #2962. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>	2021-03-21 12:30:00 +02:00
Avi Kivity	75da8a8d81	Merge 'Fix the retry mechanism in Thrift frontend' from Piotr Sarna Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything. The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy. Fixes #8317 Tests: unit(dev), manual(see #8317 for a simple reproducer) Closes #8318 * github.com:scylladb/scylla: thrift: add exponential backoff for retries thrift: fix and simplify retry logic	2021-03-21 12:26:13 +02:00
Avi Kivity	a78f43b071	Merge 'tracing: fast slow query tracing' from Ivan Prisyazhnyy The set of patches introduces a new tracing mode - `fast slow query tracing`. In this mode, Scylla tracks only tracing sessions and omits all tracing events if the tracing context does not have a `full_tracing` state set. Fixes #2572 Motivation --- We want to run production systems with that option always enabled so we could always catch slow queries without an overhead. The next step is we are gonna optimize further the costs of having tracing enabled to minimize session context handling overhead to allow it to be as transparent for the end-user as possible. Fast tracing mode --- To read the status do $ curl -v http://localhost:10000/storage_service/slow_query To enable fast slow-query tracing $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true Potential optimizations --- - remove tracing::begin(lazy_eval) - replace tracing::begin(string) for enum to remove copying and memory allocations - merge parameters allocations - group parameters check for trace context - delay formatting - reuse prepared statement shared_ptr instead of both copying it and copying its query Performance --- 100% cache hits --- 1 Core: ``` $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1 ./cassandra-stress write n=100000 no-warmup -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=1 -mode native cql3 curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=false for i in $(seq 5); do taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3 done curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true for i in $(seq 5); do taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3 done curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=true for i in $(seq 5); do taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3 done ``` \| qps \| \| \| -- \| -- \| -- \| -- \| -- \| baseline \| fast, slow \| nofast, slow \| %[1-fastslow/baseline] \| 29,018 \| 26,468 \| 23,591 \| 8.79% \| 28,909 \| 26,274 \| 23,584 \| 9.11% \| 28,900 \| 26,547 \| 23,598 \| 8.14% \| 28,921 \| 26,669 \| 23,596 \| 7.79% \| 28,821 \| 26,385 \| 23,601 \| 8.45% stdev \| 70.24030182 \| 150.9678774 \| 6.670832032 \| avg \| 28,914 \| 26,469 \| 23,594 \| stderr \| 0.24% \| 0.57% \| 0.03% \| %[avg/baseline] \| \| 8.46% \| 18.40% \| 8.46% performance degradation in `fast slow query mode` for pure in-memory workload with minimum traces. 18.40% performance degradation in `original slow query mode` for pure in-memory workload with minimum traces. 0% cache hits --- 1GB memory, 1 Core: $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --memory 1G --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1 2.4GB, 10000000 keys data: $ ./cassandra-stress write n=10000000 no-warmup -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 -mode native cql3 $ curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true CASSANDRA_STRESS prepared statements with BYPASS CACHE $ taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3 20000 reads IOPS, 100MB/s from disk \| qps \| \| \| -- \| -- \| -- \| -- \| -- \| baseline reads \| fast, slow reads \| %[1-fastslow/baseline] \| \| 9,575 \| 9,054 \| 5.44% \| \| 9,614 \| 9,065 \| 5.71% \| \| 9,610 \| 9,066 \| 5.66% \| \| 9,611 \| 9,062 \| 5.71% \| \| 9,614 \| 9,073 \| 5.63% \| stdev \| 16.75410397 \| 6.892024376 \| avg \| 9,605 \| 9,064 \| stderr \| 0.17% \| 0.08% \| %[avg/baseline] \| \| 5.63% \| 5.63% performance degradation in `fast slow query mode` for pure on-disk workload with minimum traces. Closes #8314 * github.com:scylladb/scylla: tracing: fast mode unit test tracing: rest api for lightweight slow query tracing tracing: omit tracing session events and subsessions in fast mode	2021-03-21 12:15:17 +02:00
Dejan Mircevski	318f773d81	types: Unreverse tuple subtype for serialization When a tuple value is serialized, we go through every element type and use it to serialize element values. But an element type can be reversed, which is artificially different from the type of the value being read. This results in a server error due to the type mismatch. Fix it by unreversing the element type prior to comparing it to the value type. Fixes #7902 Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8316	2021-03-21 12:07:29 +02:00
Dejan Mircevski	0bd201d3ca	cql3: Skip indexed column for CK restrictions When querying an index table, we assemble clustering-column restrictions for that query by going over the base table token, partition columns, and clustering columns. But if one of those columns is the indexed column, there is a problem; the indexed column is the index table's partition key, not clustering key. We end up with invalid clustering slice, which can cause problems downstream. Fix this by skipping the indexed column when assembling the clustering restrictions. Tests: unit (dev) Fixes #7888 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8320	2021-03-21 09:52:06 +02:00
Avi Kivity	58b7f225ab	keys: convert trichotomic comparators to return std::strong_ordering A trichotomic comparator returning an int an easily be mistaken for a less comparator as the return types are convertible. Use the new std::strong_ordering instead. A caller in cql3's update_parameters.hh is also converted, following the path of least resistance. Ref #1449. Test: unit (dev) Closes #8323	2021-03-21 09:30:43 +02:00
Avi Kivity	29a5047982	utils: error_injection: convert enable_if to concepts Constrain inject() with a requires clause rather than enable_if, simplifying the code and compiler diagnostics. Note that the second instance could not have been called, since the template argument does not appear in the function parameter list and thus could not be deduced. This is corrected here. Closes #8322	2021-03-21 09:28:23 +02:00
Avi Kivity	c28d67dd7f	types: time_point_to_string: convert enable_if to concepts time_point_to_string ensures its input is a time_point with millisecond resolution (though it neglects to verify the epoch is what it expects). Change the test from a clunky enable_if to a nicer concept. Closes #8321	2021-03-21 09:11:40 +02:00
Tomasz Grabiec	88a019ba21	Merge "raft: respond with snapshot_reply to send_snapshot RPC" from Kostja Currently send_snapshot is the only two-way RPC used by Raft. However, the sender (the leader) does not look at the receiver's reply, other than checks it's not an error. This has the following issues: - if the follower has a newer term and rejects the snapshot for that reason, the leader will not learn about a newer follower term and will not step down - the send_snapshot message doesn't pass through a single-endpoint fsm::step() and thus may not follow the general Raft rules which apply for all messages. - making a general purpose transport that simply calls fsm::step() for every message becomes impossible. Fix it by actually responding with snapshot_reply to send_snapshot RPC, generating this reply in fsm::step() on the follower, and feeding into fsm::step() on the leader. * scylla-dev/raft-send-snapshot-v2: raft: pass snapshot_reply into fsm::step() raft: respond with snapshot_reply to send_snapshot RPC raft: set follower's next_idx when switching to SNAPSHOT mode raft: set the current leader upon getting InstallSnapshot	2021-03-19 18:13:40 +01:00
Piotr Sarna	31d3854bb7	thrift: add exponential backoff for retries The original backoff mechanism which just retries after 1ms may still lead to rapid resource depletion. Instead, an exponential backoff is used, with a cap of ~2s. Tests: manual, with cassandra-stress and browsing logs	2021-03-19 13:16:39 +01:00
Piotr Sarna	f81044d75d	thrift: fix and simplify retry logic The retry logic for Thrift frontend had two bugs: 1. Due to missing break in a switch statement, two retry calls were always performed instead of one, which acts a little bit like a Seastar forkbomb 2. The delayed action was not guarded with any gate, so it was theoretically possible to access a captured `this` pointer of an object which already got deallocated. In order to fix the above, the logic is simplified to always retry with backoff - it makes very little sense to skip the backoff and immediate retries are not needed by anyone, while they cause severe overload risk. Tests: manual - a simple cassandra-stress invocation was able to crash scylla with a segfault: $ cassandra-stress write -mode thrift -rate threads=2000 Fixes #8317	2021-03-19 13:15:35 +01:00
Nadav Har'El	abab1d906c	Merge 'sstables: convert enable_if to equivalent concepts' from Avi Kivity enable_if is hard to understand, especially its error messages. Convert enable_if in sstable code to concepts. A new concept is introduced, self_describing, for the case of a type that follows the obj.describe_type() protocol. Otherwise this is quite straightforward. Closes #8315 * github.com:scylladb/scylla: sstables: vector write: convert to concepts sstables: check_truncated_and_assign: convert to concept sstables: convert write() to concepts sstables: convert write_vint() to concepts sstables: vector parse(): convert to concept sstables: convert parse() for a self-describing type to concept sstables: read_vint(): convert enable_if to concepts sstables: add concept for self-describing type	2021-03-18 23:09:34 +02:00
Raphael S. Carvalho	64d78eae6a	tests: Add unit test for off-strategy sstable compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 16:56:00 -03:00
Avi Kivity	bf0c7d1340	sstables: vector write: convert to concepts We have an integral and a non-integral overload, each constrained with enable_if. We use std::integral to constrain the integral overload and leave the other unconstrained, as C++ will choose the more constrained version when applicable.	2021-03-18 19:26:54 +02:00
Avi Kivity	11636563d9	sstables: check_truncated_and_assign: convert to concept Use std::integral instead of static_assert to reject non-integral parameters.	2021-03-18 19:26:54 +02:00
Avi Kivity	42e3f33722	sstables: convert write() to concepts There are three variants: integral, enum, and self-describing (currently expressed as not integral and not enum). Convert to concepts by using the standard concepts or the new self_describing concept.	2021-03-18 19:26:43 +02:00
Avi Kivity	4832041857	sstables: convert write_vint() to concepts Instead of a maze of deleted functions, enable_if, and static_assert, use the standard std::integral concept.	2021-03-18 19:24:42 +02:00
Nadav Har'El	0b2cf21932	alternator-test: increase read timeout and avoid retries By default the boto3 library waits up to 60 second for a response, and if got no response, it sends the same request again, multiple times. We already noticed in the past that it retries too many times thus slowing down failures, so in our test configuration lowered the number of retries to 3, but the setting of 60-second-timeout plus 3 retries still causes two problems: 1. When the test machine and the build are extremely slow, and the operation is long (usually, CreateTable or DeleteTable involving multiple views), the 60 second timeout might not be enough. 2. If the timeout is reached, boto3 silently retries the same operation. This retry may fail because the previous one really succeeded at least partially! The symptom is tests which report an error when creating a table which already exists, or deleting a table which dooesn't exist. The solution in this patch is first of all to never do retries - if a query fails on internal server error, or times out, just report this failure immediately. We don't expect to see transient errors during local tests, so this is exactly the right behavior. The second thing we do is to increase the default timeout. If 1 minute was not enough, let's raise it to 5 minutes. 5 minutes should be enough for every operation (famous last words...). Even if 5 minutes is not enough for something, at least we'll now see the timeout errors instead of some wierd errors caused by retrying an operation which was already almost done. Fixes #8135 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>	2021-03-18 18:58:08 +02:00
Avi Kivity	777d48e78d	sstables: vector parse(): convert to concept The two vector parse() overloads select between integral members and non-integral members. Use std::integral to constrain the integral overload and leave the other unconstrained; C++ will choose the more constrained version when it applies.	2021-03-18 18:48:11 +02:00
Avi Kivity	bc42aee7c1	sstables: convert parse() for a self-describing type to concept This parse() overload uses "not integral and not enum" to reject non-self-describing types. Express it directly with the self_describing concept instead.	2021-03-18 18:47:00 +02:00
Avi Kivity	a96b8e8aed	sstables: read_vint(): convert enable_if to concepts Convert read_vint() to a concept. The explicitly deleted version is no longer needed since wrongly-typed inputs will be rejected by the constraint. Similarly the static assert can be dropped for the same reason.	2021-03-18 18:45:05 +02:00
Avi Kivity	bba9c1c616	sstables: add concept for self-describing type Our sstable parsing and writing code contains a self-describing type concept, where a type can advertise its members via a describe_types() member function with a specific protocol. Formalize that into a C++ concept. This is a little tricky, since describe_type() accepts a parameter that is itself a template, and requires clauses only work with concrete type. To handle this problem, create such a concrete example type and use it in the concept.	2021-03-18 17:52:54 +02:00
Botond Dénes	7980140549	test: test_utils: do_check()/do_require(): tone down log to trace They are way too noisy to be at debug level. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210318143547.101932-1-bdenes@scylladb.com>	2021-03-18 16:59:59 +02:00
Raphael S. Carvalho	65b09567dd	table: Wire up off-strategy compaction on repair-based bootstrap and replace Now, sstables created by bootstrap and replace will be added to the maintenance set, and once the operation completes, off-strategy compaction will be started. We wait until the end of operation to trigger off-strategy, as reshaping can be more efficient if we wait for all sstables before deciding what to compact. Also, waiting for completion is no longer an issue because we're able to read from new sstables using partitioned_sstable_set and their existence aren't accounted by the compaction backlog tracker yet. Refs #5226. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	c45d2e1d27	table: extend add_sstable_and_update_cache() for off-strategy Function is extended to add sstable to maintenance set if requested by the caller. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	6ca2ac34ac	sstables/compaction_manager: Add function to submit off-strategy work This new variant will allow its caller to submit off-strategy job asynchronously on behalf of a given table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	e0e5bf8285	table: Introduce off-strategy compaction on maintenance sstable set Off-strategy compaction is about incrementally reshaping the off-strategy sstables in maintenance set, using our existing reshape mechanism, until the set is ready for integration into the main sstable set. The whole operation is done in maintenance mode, using the streaming scheduling group. We can do it this way because data in maintenance set is disjoint, so effects on read amplification is avoided by using partitioned_sstable_set, which is able to efficiently and incrementally retrieve data from disjoint sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	439e9b6fab	table: change build_new_sstable_list() to accept other sstable sets Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	6e95860e09	table: change non_staging_sstables() to filter out off-strategy sstables SSTables that are off-strategy should be excluded by this function as it's used to select candidates for regular compaction. So in addition to only returning candidates from the main set, let's also rename it to precisely reflect its behavior. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:49 -03:00
Raphael S. Carvalho	c64a156c53	table: Introduce maintenance sstable set This new sstable set will hold sstables created by repair-based operations. A repair-based op creates 1 sstable per vrange (256), so sstables added to this new set are disjoint, therefore they can be efficiently read from using partitioned_sstable_set. Compound set is changed to include this new set, so sstables in this new set are automatically included when creating readers, computing statistics, and so on. This new set is not backlog tracked, so changes were needed to prevent a sstable in this set from being added or removed from the tracker. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:47:47 -03:00
Raphael S. Carvalho	1e7a444a8b	table: Wire compound sstable set From now own, _sstables becomes the compound set, and _main_sstables refer only to the main sstables of the table. In the near future, maintenance set will be introduced and will also be managed by the compound set. So add_sstable() and on_compaction_completion() are changed to explicitly insert and remove sstables from the main set. By storing compound set in _sstables, functions which used _sstables for creating reader, computing statistics, etc, will not have to be changed when we introduce the maintenance set, so code change is a lot minimized by this approach. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:46:06 -03:00
Raphael S. Carvalho	42b309b43e	table: prepare make_reader_excluding_sstables() to work with compound sstable set Compound set will not be inserted or erased directly, so let's change this function to build a new set from scratch instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:42:50 -03:00
Raphael S. Carvalho	4e142458eb	table: prepare discard_sstables() to work with compound sstable set After compound set, discard_sstables() will have to prune each set individually and later refresh the compound set. So let's change the function to support multiple sstable sets, taking into account that a sstable set may not want to be backlog tracked. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:42:50 -03:00
Raphael S. Carvalho	d25822a030	table: extract add_sstable() common code into a function The purpose is to allow the code to be eventually reused by maintenance sstable set, which will be soon introduced. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:42:50 -03:00
Raphael S. Carvalho	e4b5f5ba33	sstable_set: Introduce compound sstable set This new sstable set implementation is useful for combining operation of multiple sstable sets, which can still be referenced individually via its shared ptr reference. It will be used when maintenance set is introduced in table, so a compound set is required to allow both sets to have their operations efficiently combined. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:42:49 -03:00
Raphael S. Carvalho	1261519266	reshape: STCS: preserve token contiguity when reshaping disjoint sstables When reshaping hundreds of disjoint sstables, like on bootstrap, contiguity wasn't being preserved because the heuristic for picking candidates didn't take into account their token range, which resulted in reshape messing with the contiguity that could otherwise be preserved by respecting the token order of the disjoint sstables. In other words, sstables with the smallest first tokens should be compacted first. By doing that, the contiguity is preserved even across size tiers, after reshape has completed its possible multiple rounds to get all the data in shape. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-18 11:36:18 -03:00
Botond Dénes	ad02f313dd	test: mutation_reader_test: add test for permit cleanup Check that a permit correctly restores the units on the semaphore in each state it can be destroyed in.	2021-03-18 16:18:22 +02:00
Raphael S. Carvalho	e53cedabb1	LCS: reshape: tolerate more sstables in level 0 with relaxed mode Relaxed mode, used during initialization, of reshape only tolerates min_threshold (default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in level 0, otherwise boot will have to reshape level 0 every time it crosses the min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32. This change is beneficial because once table is populated, LCS regular compaction can decide to merge those sstables in level 0 into level 1 instead, therefore reducing WA. Refs #8297. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>	2021-03-18 15:58:21 +02:00
Botond Dénes	2b7c1bce86	scylla-gdb.py: add variant_member convenience function Allow conveniently accessing the active member of an `std::variant` instance. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210318134427.92668-1-bdenes@scylladb.com>	2021-03-18 15:57:51 +02:00
Konstantin Osipov	fcc6e621f8	raft: pass snapshot_reply into fsm::step() By the time we receive snapshot_reply from a follower we may no longer be the leader. Follower term may be different from snapshot term, e.g. the follower may be aware of a new leader already and have a higher term. We should pass this information into (possibly ex-) leader FSM via fsm::step() so that it can correctly change its state, and not call FSM directly.	2021-03-18 16:56:46 +03:00
Konstantin Osipov	4afa662d62	raft: respond with snapshot_reply to send_snapshot RPC Raft send_snapshot RPC is actually two-way, the follower responds with snapshot_reply message. This message until now was, however, muted by RPC. Do not mute snapshot_reply any more: - to make it obvious the RPC is two way - to feed the follower response directly into leader's FSM and thus ensure that FSM testing results produced when using a test transport are representative of the real world uses of raft::rpc.	2021-03-18 16:56:42 +03:00
Konstantin Osipov	cb3314d756	raft: set follower's next_idx when switching to SNAPSHOT mode Set follower's next_idx to snapshot index + 1 when switching it to snapshot mode. If snapshot transfer succeeds, that's the best match for the follower's next replication index. If it fails, the leader will send a new probe to find out the follower position again and re-try sending a possibly newer snapshot. The change helps reduce protocol state managed outside FSM.	2021-03-18 16:35:11 +03:00
Ivan Prisyazhnyy	f00391af8b	tracing: fast mode unit test Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>	2021-03-18 15:05:09 +02:00
Ivan Prisyazhnyy	7cbe2aa9c6	tracing: rest api for lightweight slow query tracing The patch adds REST API support for the lightweight slow query tracing (fast) mode that is implemented by omitting all of the trace events during the tracing. $ curl -v http://localhost:10000/storage_service/slow_query $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>	2021-03-18 15:05:05 +02:00
Ivan Prisyazhnyy	85fbca2049	tracing: omit tracing session events and subsessions in fast mode If tracing::tracing::_ignore_trace_events is enabled then the tracing system must ignore all sessions events for non full_tracing sessions (probability tracing and user requested) and creating subsessions with the make_trace_info. Patch introduces the slow query tracing fast mode that omits all events during tracing. Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>	2021-03-18 15:04:47 +02:00
Botond Dénes	c822f0d02a	test: querier_cache_test: add memory based cache eviction test Ensure that the memory consumption of querier cache entries is kept under the limit.	2021-03-18 14:58:21 +02:00
Botond Dénes	a14bb4ba94	reader_permit: add inactive state This state will be used for permits that are not in admitted state when registered as inactive. We can have such reads if a read can be served entirely from cache/memtables and it doesn't have to go to disk and hence doesn't go through admission. These permits currently don't forward their cost to the semaphore so they won't prevent their own admission creating a deadlock. However, when in inactive state, we do want to keep tabs on their resource consumption so we don't accumulate too much of these inactive reads. So introduce a new state for these non-admitted inactive reads. When entering the inactive state, the permit registers its cost with the semaphore, and when unregistered as inactive, it retracts it. This is a workaround (khm hack) until #4758 is solved and all permits will be admitted on creation.	2021-03-18 14:58:21 +02:00
Botond Dénes	594636ebbf	querier: insert(): account immediately evicted querier as resource based eviction `reader_concurrency_semaphore::register_inactive_read()` drops the registered inactive read immediately if there is a resource shortage. This is in effect a resource based eviction, so account it as such in `querier::insert()`.	2021-03-18 14:57:57 +02:00
Botond Dénes	1a337d0ec1	reader_concurrency_semaphore: fix clear_inactive_reads() Broken by the move to an intrusive container (`9cbbf40`), which caused said method to only clear the container but not destroy the inactive reads contained therein. This patch restores the previous behaviour and also adds a call the destructor (to ensure inactive reads are cleaned up under any circumstances), as well as a unit test.	2021-03-18 14:57:57 +02:00
Botond Dénes	581edc4e4e	reader_concurrency_semaphore: make inactive_read_handle a weak reference Having the handle keep an owning reference to the inactive read lead to awkward situations, where the inactive read is destroyed during eviction in certain situations only (querier cache) and not in other cases. Although the users didn't notice anything from this, it lead to very brittle code inside the reader concurrency semaphore. Among others, the inactive read destructor has to be open coded in evict() which already lead to mistakes. This patch goes back to the weak pointer paradigm used a while ago, which is a much more natural fit for this. Inactive reads are still kept in an intrusive list in the semaphore but the handle now keeps a weak pointer to them. When destroyed the handler will destroy the inactive read if it is still alive. When evicting the inactive read, it will set the pointer in the handle to null.	2021-03-18 14:57:57 +02:00
Botond Dénes	cbc83b8b1b	reader_concurrency_semaphore: make evict() noexcept In the next patch it will be called from a destructor.	2021-03-18 14:57:57 +02:00
Botond Dénes	2d348e0211	reader_concurrency_semaphore: update out-of-date comments	2021-03-18 14:57:57 +02:00
Botond Dénes	3b8220f777	scylla-gdb.py: update w.r.t. storage_proxy::_hints_manager not being optional Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210318110256.50137-1-bdenes@scylladb.com>	2021-03-18 12:47:57 +01:00
Piotr Sarna	2509b7dbde	Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity As #1449 notes, trichotomic comparators returning int are dangerous as they can be mistaken for less comparators. This series converts dht::ring_position and dht::decorated_key, as well as a few closely related downstream types, to return std::strong_ordering. Closes #8225 * github.com:scylladb/scylla: dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering pager: rephrase misleading comparison check test: total_order_checks: prepare for std::strong_ordering test: mutation_test: prepare merge_container for std::strong_ordering intrusive_array: prepare for std::strong_ordering utils: collection-concepts: prepare for std::strong_ordering	2021-03-18 11:51:54 +01:00
Avi Kivity	378556418c	dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering Convert tri_comparators to return std::strong_ordering rather than int, to prevent confusion with less comparators. Downstream users are either also converted, or adjust the return type back to int, whichever happens to be simpler; in all cases the change it trivial.	2021-03-18 12:40:05 +02:00
Avi Kivity	4ead1a79ce	pager: rephrase misleading comparison check We check !result_of_tri_compare, which makes it look like we're checking a boolean predicate, whereas we're really checking for equality. Change to result_of_tri_compare == 0, which is less likely to be confusing, and is also compatible with std::strong_ordering.	2021-03-18 12:40:05 +02:00
Avi Kivity	a5f17b9a2d	test: total_order_checks: prepare for std::strong_ordering Adjust the total_order_check template to work with comparators returning either int (as a temporary compatibility measure) or std::strong_ordering (for #1449 safety).	2021-03-18 12:40:05 +02:00
Avi Kivity	f0092ae475	test: mutation_test: prepare merge_container for std::strong_ordering The function merge_container() accepts a trichotomic comparator returning an int. As #1449 explains, this is dangerous as it could be mistaken for a less comparator. Switch to std::strong_ordering, but leave a compatible merge_container() in place as it is still needed (even after this series).	2021-03-18 12:40:05 +02:00
Avi Kivity	fe0f983dfb	intrusive_array: prepare for std::strong_ordering Newer comparators can return std::strong_ordering, so don't expect an int.	2021-03-18 12:40:05 +02:00
Avi Kivity	9fbe4850c9	utils: collection-concepts: prepare for std::strong_ordering collection-concepts includes a Comparable concept for a trichotomic comparator function, used in intrusive btree and double_decker. Prepare for std::strong_ordering by also allowing std::strong_ordering as a return type. Once we've cleaned the code base, we can tighten it to only allow std::strong_ordering.	2021-03-18 12:40:03 +02:00
Piotr Sarna	0bcf584992	docs: mention --no-rebase in maintainer.md For a default git config it's enough to pull with --no-ff to ensure that a merge commit is created, but with a custom configuration, it's better to also explicitly prevent rebasing. Message-Id: <7dc6027f1f38fa4db7435592a3b72308b1a08614.1616063525.git.sarna@scylladb.com>	2021-03-18 12:38:29 +02:00
Piotr Sarna	5a852d3812	Merge 'Decouple memory limiter sem from storage service' from Pavel This set removes few more calls for global storage service and prevents more of them to happen in thrift that's about to start using the memory limiter semaphore too. The set turns this semaphore into a sharded one living in the scope of main(), makes others use the local instance and removes the no longer needed bits from storage service. tests: unit(dev) branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem * xemul_drop_memory_limiter: storage_service: Drop memory limiter memory_limiter: Use main-local instance everyehere main: Have local memory limiter and carry where needed memory_limiter: Encapsulate memory limiting facility cql_server: Remove semaphore getter fn from config	2021-03-18 11:29:32 +01:00
Pavel Emelyanov	dcdd207349	storage_service: Drop memory limiter Nobody uses it now. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-18 11:28:45 +01:00
Pavel Emelyanov	f0a79574d4	memory_limiter: Use main-local instance everyehere The cql_server and alternator both need the limiter, so patch them to stop using storage service's one and use the main-local one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-18 11:28:45 +01:00
Pavel Emelyanov	359e9caf54	main: Have local memory limiter and carry where needed Prepare memory limiters to have non-global instance of the service. For now the main-local instance is not used and (!) is not stopped for real, just like the storage_service's one is. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-18 11:28:45 +01:00
Pavel Emelyanov	4ca2ae1341	memory_limiter: Encapsulate memory limiting facility The storage service carries sempaphore and a size_t value to facilitate the memory limiting for client services. This patch encapsulates both fields on a separate helper class that will be used by whoever needs it without messing with the storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-18 11:28:45 +01:00
Pavel Emelyanov	c2f94fb527	cql_server: Remove semaphore getter fn from config The cql_server() need to get the memory limiter semaphore from local storage service instance. To make this happen a callback in introduced on the config structure. The same can be achieved in a simler manner -- by providing the local storage service instances directly. Actually, the storage service will be removed in further patches from this place, so this patch is mostly to get rid of the callback from the config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-18 11:28:45 +01:00
Nadav Har'El	4a7d3175e9	test/alternator: make another test faster The slowest test in test_streams.py is test_list_streams_paged. It is meant to test the ListStreams operation with paging. The existing test repeated its test four times, for four different stream types. However, there is no reason to suspect that the ListStreams operation might somehow be different for the four stream types... We already have other tests which create streams of the four types, and uses these streams - we don't need the test for ListStreams to also test creating the four types. By doing this test just once, not four times, we can save around 1.5 seconds of test time. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>	2021-03-18 11:24:18 +01:00
Nadav Har'El	79af728335	test/alternator: make tracing test a bit faster In the test test_tracing.py::test_tracing_all, we do some operations and then need to wait until they appear in the tracing table. The current code used an exponentially-increasing delay during this wait, starting with 0.1 seconds and then doubling the delay until we find what we're looking for. However, it turns out that the delay until the data appears in the table is deliberately chosen by Scylla - and is always around 2 seconds. In this case, an exponential delay is really bad - we will usually wait for around 1 seconds too long after the needed wait of 2 seconds. So in this patch we replace the exponential delay by a constant delay - we wait 0.3 seconds between each retry. This change makes the test test_tracing.py::test_tracing_all finish in a little over 2 seconds, instead of a little over 3 seconds before this patch. We cannot reduce this 2 second time any further unless we make the 2-second tracing delay configurable. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>	2021-03-18 11:24:18 +01:00
Nadav Har'El	4e87f95b42	test/alternator: remove slow and unhelpful test The test test_table.py::test_table_streams_on creates tables with various stream types, and then immediately deletes them without testing anything. This is a slow test (taking almost a full second on my laptop), and is redundant because in test_streams.py we have tests which create tables with streams in the same way - but then actually test that things work with these streams. So this test might as well be removed, and this is what we do in this patch. Removing this test shaves another second from the Alternator test suite's run time. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>	2021-03-18 11:24:18 +01:00
Nadav Har'El	879656e3e0	test/alternator: make a test faster, safer and more correct The test test_condition_expression.py::test_condition_expression_with_forbidden_rmw takes half a second to run (dev build, on my laptop), one of the slowest tests in Alternator's test suite. Part of the reason was that it needlessly set the same table to forbidden_rmw, multiple times. Instead of doing that, we switch to using the test_table_s_forbid_rmw fixture, which is a table like test_table_s but created just once in forbid_rmw mode. The result is a faster test (0.05 seconds instead of 0.5 seconds), but also safer if we ever want to run tests in parallel. It also fixes a bug in the test: At the end of the test, we intended to double-check that although the forbid_rmw table forbids read-modify-write operations, it does allow pure writes. Yet the test did this after clearing the forbid_rmw mode... So after this patch the test verifies this on the forbid_rmw table, as intended. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>	2021-03-18 11:24:18 +01:00
Nadav Har'El	1c2e473e62	test/alternator: make a test faster The test test_condition_expression.py::test_condition_expression_with_permissive_write_isolation Currently takes (on my laptop, dev build) a full two seconds, one of the slowest tests. It is not surprising it is slow - it runs five other tests three times each (for three different write isolation modes), but it doesn't have to be this slow. Before this patch, for each of the five tests we switch the write isolation mode three times, and these switches involve schema changes and are fairly slow. So in this patch we reverse the loop - and switch the write isolation mode to the outer loop. This patch halves the runtime of this test - from two seconds to one. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>	2021-03-18 11:24:18 +01:00
Takuya ASADA	d9a625c842	scylla_setup: don't run node-exporter setup when it's not installed We need to run package existance check before run setup of node-exporter. Fixes #8276 Closes #8278	2021-03-18 11:24:18 +01:00
Avi Kivity	f038d1555c	Merge 'Add more context to configure.py' from Piotr Sarna This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker. Closes #8267 * github.com:scylladb/scylla: configure: print more context if the linking attempt failed configure: provide more context on failed ./configure.py run configure: add verbose option to try_compile_and_link	2021-03-18 11:24:18 +01:00
Takuya ASADA	0424a41e30	tools/toolchain: stop ignoring error on install-dependencies.sh, run jmx/java script correctly We should run install-dependencies.sh with -e option to prevent ignoring error in the script. Also, need to add tools/jmx/install-dependencies.sh and tools/java/install-dependencies.sh, to fix 'No such file or directory' error on them. Fixes #8293 Closes #8294 [avi: did not regenerate toolchain image, since no new packages are installed]	2021-03-18 11:24:18 +01:00
Avi Kivity	b91d6776a0	Update tools/java submodule * tools/java fdc8fcc22c...7b66b7a0fc (1): > dist/redhat: add support SLES	2021-03-18 11:24:18 +01:00
Nadav Har'El	bd742f2951	Merge 'treewide: get rid of incorrect reinterpret casts' from Michał Chojnowski In some places we use the `reinterpret_cast<const net::packed<T>>(&x)` pattern to reinterpret memory. This is a violation of C++'s aliasing rules, which invokes undefined behaviour. The blessed way to correctly reinterpret memory is to copy it into a new object. Let's do that. Note: the reinterpret_cast way has no performance advantage. Compilers recognize the memory copy pattern and optimize it away. Closes #8241 * github.com:scylladb/scylla: treewide: get rid of unaligned_cast treewide: get rid of incorrect reinterpret casts	2021-03-18 11:24:18 +01:00
Benny Halevy	7862cad669	sstable_set: partitioned_sstable_set: clone: do clone all sstables The existing implementation wrongfully shares _all sstables rather than cloning it. This caused a use-after-free in `repair_meta::do_estimate_partitions_on_local_shard` when traversing a shared sstable_set, during which `table::make_reader_excluding_sstables` erased an entry. The erase should have happened on a cloned copy of the sstable_list, not on a shared copy. The regression was introduced in `c3b8757fa1`. Added a unit test that reproduces the share-on-copy issue for partitioned_stable_set (sstables::sstable_set). Fixes #8274 Test: unit(release, debug) DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>	2021-03-18 11:15:59 +02:00
Piotr Sarna	ea096de1b4	service, transport: avoid using private storage_service fields ... in the transport controller. Instead, simple getters would suffice. Message-Id: <582a71d0c1b61edf0107f5a2ef96536c395972d0.1615988516.git.sarna@scylladb.com>	2021-03-18 11:15:59 +02:00
Nadav Har'El	42169b2eef	Merge 'Alternator: add slow query logging' from Piotr Sarna This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced. In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future. This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`. Tests: unit(dev, alternator pytest) manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced. Fixes #8292 Example trace: ```cql cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955; parameters \| duration ---------------------------------------------------------------------------------------------+---------- {'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} \| 75732 ``` Closes #8298 * github.com:scylladb/scylla: alternator: add test for slow query logging alternator: allow enabling slow query logging tracing: allow providing a custom session record param	2021-03-18 11:15:59 +02:00
Avi Kivity	de45575ea9	Merge "Allow all supported compaction types to be stopped by nodetool stop" from Raphael " All compaction types can now be stopped with the nodetool stop command, example: nodetool stop SCRUB Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB, INDEX_BUILD, RESHARD, UPGRADE, RESHAPE. " * 'stop_compaction_types_v2' of github.com:raphaelsc/scylla: compaction: Allow all supported compaction types to be stopped compaction: introduce function to map compaction name to respective type compaction: refactor mapping of compaction type to string compaction: move compaction_name() out of line	2021-03-18 11:15:59 +02:00
Botond Dénes	981699ae76	sstables: move promoted_index_blocks_reader into own header index_entry.hh (the current home of `promoted_index_blocks_reader`) is included in `sstables.hh` and thus in half our code-base. All that code really doesn't need the definition of the promoted index blocks reader which also pulls in the sstables parser mechanism. Move it into its own header and only include it where it is actually needed: the promoted index cursor implementations. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>	2021-03-18 11:15:59 +02:00
Botond Dénes	5859195b36	sstables: mx/parser.hh: add missing include Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210317093806.34858-1-bdenes@scylladb.com>	2021-03-18 11:15:59 +02:00
Benny Halevy	2e7677f76b	sstables: sstable_set_impl: include mutation_reader.hh To make sstables/sstable_set_impl.hh self-sufficient mutation_reader.hh provides position_reader_queue, needed by time_series_sstable_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210317094223.590067-1-bhalevy@scylladb.com>	2021-03-18 11:15:59 +02:00
Konstantin Osipov	66c729da66	raft: set the current leader upon getting InstallSnapshot If the current leader is set, the follower will not vote for another candidate. This is also known as "sticky leadership" rule. Before this change, the rule was enacted only upon receiving AppendEntries RPC from the leader. Turn it on also upon receiving InstallSnapshot RPC.	2021-03-18 08:36:57 +03:00
Michał Chojnowski	5c3385730b	treewide: get rid of unaligned_cast unaligned_cast violates strict aliasing rules. Replace it with safe equivalents.	2021-03-17 17:00:41 +01:00
Michał Chojnowski	4e35befcf2	treewide: get rid of incorrect reinterpret casts In some places we use the `reinterpret_cast<const net::packed<T>>(&x)` pattern to reinterpret memory. This is a violation of C++'s aliasing rules, which invokes undefined behaviour. The blessed way to correctly reinterpret memory is to copy it into a new object. Let's do that. Note: the reinterpret_cast way has no performance advantage. Compilers recognize the memory copy pattern and optimize it away.	2021-03-17 17:00:38 +01:00
Piotr Sarna	efe734c575	alternator: add test for slow query logging The test checks whether slow queries are properly logged in the system_traces.node_slow_log system table. The test is deterministic because it uses the threshold of 0ms to qualify a query as slow, which effectively makes all queries "slow enough".	2021-03-17 13:24:26 +01:00
Benny Halevy	6846319e65	partitioned_sstables_set: insert: propagate exception Do not swallow the caught exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210316170821.496218-1-bhalevy@scylladb.com>	2021-03-17 13:29:03 +02:00
Piotr Sarna	f9adee70d2	alternator: allow enabling slow query logging Alternator is now aware of the slow query logging configuration and can start tracing slow queries.	2021-03-17 11:20:42 +01:00
Piotr Sarna	5386739354	tracing: allow providing a custom session record param The mechanism of session record params is currently only used to store query strings and a couple more params like consistency level, but since we now have more frontends than just CQL and Thrift, it would be nice to also allow the users to put custom parameters in there. An immediate first user of this mechanism would be alternator, which is going to put the operation type under the "alternator_op" key. The operation type is not part of the query string due to how DynamoDB's protocol works - the op type is stored separately in the HTTP header. While it's possible to extract the operation type from the session_id, it might not be the case once #2572 is implemented.	2021-03-17 11:14:28 +01:00
Gleb Natapov	32d386d0d8	raft: fix use after free during logging in append_entries_reply() As the existing comment explains a progress can be deleted at the point of logging. The logging should only be done if the progress still exists. Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>	2021-03-17 09:59:22 +02:00
Dejan Mircevski	8db24fc03b	cql3/expr: Handle `IN ?` bound to null Previously, we crashed when the IN marker is bound to null. Throw invalid_request_exception instead. Fixes #8265 Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8287	2021-03-17 09:59:22 +02:00
Avi Kivity	1afd6fbe06	hashing: appending_hash: convert from enable_if to concepts A little simpler to understand. Closes #8288	2021-03-17 09:59:22 +02:00
Piotr Sarna	7961a28835	Merge 'storage_proxy: Include counter writes in... ... `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz With this change, coordinator prefers himself as the "counter leader", so if another endpoint is chosen as the leader, we know that coordinator was not a member of replica set. With this guarantee we can increment `scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric after electing different leader (that metric used to neglect the counter updates). The motivation for this change is to have more reliable way of counting non-token-aware queries. Fixes #4337 Closes #8282 * github.com:scylladb/scylla: storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set` counters: Favor coordinator as leader	2021-03-17 09:59:22 +02:00
Avi Kivity	972ea9900c	Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund Refs #7794 Iff we need to pre-fill segment file ni O_DSYNC mode, we should drop this for the pre-fill, to avoid issuing flushes until the file is filled. Done by temporarily closing, re-opening in "normal" mode, filling, then re-opening. Closes #8250 * github.com:scylladb/scylla: commitlog: Make pre-allocation drop O_DSYNC while pre-filling commitlog: coroutinize allocate_segment_ex	2021-03-17 09:59:22 +02:00
Dejan Mircevski	992d5c6184	cql3/expr: Improve column printing Before this change, we would print an expression like this: ((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b) Now, we print the same expression like this: (c = 0000007b) Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8285	2021-03-17 09:59:22 +02:00
Tomasz Grabiec	40121621f6	Merge "Kill some get_local_migration_manager() calls" from Pavel Emelyanov There are a bunch of such calls in schema altering statements and there's currently no way to obtain the migration manager for such statements, so a relatively big rework needed. The solution in this set is -- all statements' execute() methods are called with query processor as first argument (now the storage proxy is there), query processor references and provides migration manager for statements. Those statements that need proxy can already get it from the query processor. Afterwards table_helper and thrift code can also stop using the global migration manager instance, since they both have query processor in needed places. While patching them a couple of calls to global storage proxy also go away. The new query processor -> migration manager dependency fits into current start-stop sequence: the migration manager is started early, the query processor is started after it. On stop the query processor remains alive, but the migration manager stops. But since no code currently (should) call get_local_migration_manager() it will _not_ call the query_processor::get_migration_manager() either, so this dangling reference is ugly, but safe. Another option could be to make storage proxy reference migration manager, but this dependency doesn't look correct -- migration manager is higher-level service than the storage proxy is, it is migration manager who currently calls storage proxy, but not the vice versa. * xemul/br-kill-some-migration-managers-2: cql3: Get database directly from query processor thrift: Use query_processor::get_migration_manager() table_helper: Use query_processor::get_migration_manager() cql3: Use query_processor::get_migration_manager() (lambda captures cases) cql3: Use query_processor::get_migration_manager() (alter_type statement) cql3: Use query_processor::get_migration_manager() (trivial cases) query_processor: Keep migration manager onboard cql3: Pass query processor to announce_migration:s cql3: Switch to qp (almost) in schema-altering-stmt cql3: Change execute()'s 1st arg to query_processor	2021-03-17 09:59:22 +02:00
Raphael S. Carvalho	2065e2c912	partitioned_sstable_set: adjust select_sstable_runs() to work with compound set compound set will select runs from all of its managed sets, so let's adjust select_sstable_runs() to only return runs which belong to it. without this adjustment, selection of runs would fail because function would try to unconditionally retrieve the run which may live somewhere else. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>	2021-03-17 09:59:22 +02:00
Raphael S. Carvalho	02b2df1ea9	sstable_set: move select_sstable_runs() into partitioned_sstable_set after compound set is introduced, select_sstable_runs() will no longer work because the sstable runs live in sstable_set, but they should actually live in the sstable_set being written to. Given that runs is a concept that belongs only to strategies which use partitioned_sstable_set, let's move the implementation of select_sstable_runs() to it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>	2021-03-17 09:59:22 +02:00
Avi Kivity	11308c05f4	Update tools/jmx submodule * tools/jmx 15c1d4f...9c687b5 (1): > dist/redhat: add support SLES	2021-03-17 09:59:22 +02:00
Calle Wilund	a0745f9498	messaging_service: Enforce dc/rack membership iff required for non-tls connections When internode_encryption is "rack" or "dc", we should enforce incoming connections are from the appropriate address spaces iff answering on non-tls socket. This is implemented by having two protocol handlers. One for tls/full notls, and one for mixed (needs checking) connections. The latter will ask snitch if remote address is kosher, and refuse the connection otherwise. Note: requires seastar patches: "rpc: Make is possible for rpc server instance to refuse connection" "RPC: (client) retain local address and use on stream creation" Note that ip-level checks are not exhaustive. If a user is also using "require_client_auth" with dc/rack tls setting we should warn him that there is a possibility that someone could spoof himself pass the authentication. Closes #8051	2021-03-17 09:59:22 +02:00
Avi Kivity	bcd41cb32d	Merge 'Support installing our rpm to SLES' from Takuya ASADA Basically SLES support is already done in `f20736d93d`, but it was for offline installer. This fixes few more problems to install our rpm to SLES. After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb. SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo. Closes #8277 * github.com:scylladb/scylla: scylla_coredump_setup: support SLES scylla_setup: use rpm to check package availability for SLES dist: install optional packages for SLES	2021-03-17 09:59:22 +02:00
Tomasz Grabiec	cc0bb92afe	Merge "raft: provide a ticker for each raft server" from Pavel Solodovnikov Automatically initialize and start a timer in `raft_services::add_server` for each raft server instance created. The patch set also changes several other things in order for tickers to work: 1. A bug in `raft_sys_table_storage` which caused an exception if `raft::server::start` is called without any persisted state. 2. `raft_services::add_server` now automatically calls `raft::server::start()` since a server instance should be started before any of its methods can be called. 3. Raft servers can now start with initial term = 0. There was an artificial restriction which is now lifted. 4. Raft schema state machine now returns a ready future instead of throwing "not implemented" exception in `abort()`. * github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase: raft/raft_services: provide a ticker for each raft server raft/raft_services: switch from plain `throw` to `on_internal_error` raft/raft_services: start server instance automatically in `add_server` raft: return ready future instead of throwing in schema_raft_state_machine raft: allow raft server to start with initial term 0 raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state	2021-03-17 09:59:22 +02:00
Nadav Har'El	e344f74858	Merge 'logalloc: improve background reclaim shares management' from Avi Kivity The log structured allocator's background reclaimer tries to allocate CPU power proportional to memory demand, but a bug made that not happen. Fix the bug, add some logging, and future-proof the timer. Also, harden the test against overcommitted test machines. Fixes #8234. Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each) Closes #8281 * github.com:scylladb/scylla: test: logalloc_test: harden background reclain test against cpu overcommit logalloc: background reclaim: use default scheduling group for adjusting shares logalloc: background reclaim: log shares adjustment under trace level logalloc: background reclaim: fix shares not updated by periodic timer	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	aaea8c6c7d	raft/raft_services: provide a ticker for each raft server Automatically initialize a ticker for each raft server instance when `raft_services::add_server` is called. A ticker is a timer which regularly calls `raft::server::tick` in order to tick its raft protocol state machine. Note that the timer should start after the server calls its `start()` method, because otherwise it would crash since fsm is not initialized yet. Currently, the tick interval is hardcoded to be 100ms. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	1496a3559f	raft/raft_services: switch from plain `throw` to `on_internal_error` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	975c9a8021	raft/raft_services: start server instance automatically in `add_server` Raft server instance cannot be used in any way prior to calling the `start()` method, which initializes its internal state, e.g. raft protocol state machine. Otherwise, it will likely result in a crash. Also, properly stop the servers on shutdown via `raft_services::stop_servers()`. In case some exception happened inside `add_server`, the `init` function will de-initialize what it already initialized, i.e. raft rpc verbs. This is important since otherwise it would break further initialization process and, what is more important, will prevent raft rpc verbs deinitialization. This will cause a crash in `messaging_service` uninit procedure, because raft rpc handlers would still be initialized. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	0b3dba07bd	raft: return ready future instead of throwing in schema_raft_state_machine The current implementation throws an exception, which will cause a crash when stopping scylla. This will be used in the next patch. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	93c565a1bf	raft: allow raft server to start with initial term 0 Prior to the fix there was an assert to check in `raft::server_impl::start` that the initial term is not 0. This restriction is completely artificial and can be lifted without any problems, which will be described below. The only place that is dependent on this corner case is in `server_impl::io_fiber`. Whenever term or vote has changed, they will be both set in `fsm::get_output`. `io_fiber` checks whether it needs to persist term and vote by validating that the term field is set (by actually executing a `term != 0` condition). This particular check is based on an unobvious fact that the term will never be 0 in case `fsm::get_output` saves term and vote values, indicating that they need to be persisted. Vote and term can change independently of each other, so that checking only for term obscures what is happening and why even more. In either case term will never be 0, because: 1. If the term has changed, then it's naturally greater than 0, since it's a monotonically increasing value. 2. If the vote has changed, it means that we received a vote request message. In such case we have already updated our term to the requester's term. Switch to using an explicit optional in `fsm_output` so that a reader don't have to think about the motivation behind this `if` and just checks that `term_and_vote` optional is engaged. Given the motivation described above, the corresponding assert(_fsm->get_current_term() != term_t(0)); in `server_impl::start` is removed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Pavel Solodovnikov	ae5f26adec	raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state When a raft server is started for the first time and there isn't any persisted state yet, provide default return values for `load_term_and_vote` and `load_snapshot`. The code currently does not handle this corner case correctly and fail with an exception. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Juliusz Stasiewicz	f77d0f5439	storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set` Coordinator prefers himself as the "counter leader", so if another endpoint is chosen as the leader, we know that coordinator was not a member of replica set. We can use this information to increment relevant metric (which used to neglect the counters completely). Fixes #4337	2021-03-16 12:07:16 +01:00
Juliusz Stasiewicz	5689106b92	counters: Favor coordinator as leader This not only reduces internode traffic but is also needed for a later change in this PR: metrics for non-token-aware writes including counter updates.	2021-03-16 12:07:13 +01:00
Pavel Emelyanov	a7a5ad4ded	range_tombstone_stream: Remove unused methods Both methods apply a list of tombstones to the stream. One was unused even before the set, the other one became unused after previous patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:08:18 +03:00
Pavel Emelyanov	2e6255c499	partition_snapshot_reader: Emit range tombstones on demand Currently the reader gets all range tombstones from the given range and places them into a stream. When filling the buffer with fragments the range tombstones are extracted from the stream one by one. This is memory consuming, the reader's memory usage shouldn't depend on the number of inhabitants in the partition range. The patch implements the heap-based cursor for range tombstones almost like it's done for rows. The heap contains range_tombstone_list::iterator_ranges, the tombstones are popped from the heap when needed, are applied into the stream and then are emitted from it into the buffer. The refresh_state() is called on each new range to set up the iterators, and when lsa reports references invalidation to refresh the iterators. To let the refresh_state revalidate the iterators, the position at which the last range tombstone was emitted is maintained. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:08:18 +03:00
Pavel Emelyanov	ef61f84426	partition_snapshot_reader: Introduce maybe_refresh_state The existing refresh_state() is supposed to setup or revalidate iterators to rows inside partition versions if needed. It will be called in more than one place soon, so here's the helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:08:18 +03:00
Pavel Emelyanov	5e0a8130d4	partition_snapshot_reader: Move range tombstone stream member The lsa_partition_reader is the helper sub-class for partition_snapshot_reader that, among other things, is responsible for filling the stream of range tombstones, that's then used by the reader itself. Next patches will change the way range tombstones are emitted by the reader, so hide the stream inside the helper subclass in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:08:18 +03:00
Pavel Emelyanov	755d993031	partition_snapshot_reader: Add reset_state method to helper class This method "notifies" the lsa_reader helper class when the owning reader moves to a new range. This method is now empty, but will be used by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:07:20 +03:00
Pavel Emelyanov	a387fbd984	partition_snapshot_reader: Downgrade heap comparator Next patch will extend the comparator to manage heap of range tombstones. Not to add yet another comparator to it (and not to create another heap comparator class) just use the comparator that's common for both -- rows and range tombstones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:06:19 +03:00
Pavel Emelyanov	2179014efa	partition_snapshot_reader: Use on-demand comparators There are already two raii-sh comparators on reader, next patch will need to add the third. This just bloats the reader, the comparators in question are state-less and can be created on demand for free. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 12:04:47 +03:00
Pavel Emelyanov	c8b2079705	range_tombstone_list: Add new slice() helper There are two of them now -- one to return iterator_range that covers the given query::clustering_range, the other to return it for two given positions. In the next patch the 3rd one is needed -- the slice() to get iterator_range that's a) starts strictly after a given position b) ends after the given clustering_range's end It will be used to refresh the range tombstones iterators after some of them will have been emitted. The same thing is currently done by partition_snapshot_reader's refresh_state wrt rows: if (last_row) start = rows.upper_bound(last_row) // continuation else start = rows.lower_bound(range.start) // initial end = rows.upper_bound(range.end) // end is the same in // either case Respectively for range tombstones the goal is the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 11:55:28 +03:00
Pavel Emelyanov	7e1170ecb9	range_tombstone_list: Introduce iterator_range alias The range_tombstone_list::slice() set of methods return back pair of iterators represending a range. In the next patches this pair will be actively used, and it's handy to have a shorter alias for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-16 11:55:28 +03:00
Piotr Sarna	2201c9b146	configure: print more context if the linking attempt failed Previously, when a linking attempt failed, configure.py immediately printed that neither lld nor gold was found, which might be misleading if the linkers are installed, but the compilation failed anyway. The printed information is now more specific, and combined with the previous commit, it will also provide more information why the compilation attempt failed.	2021-03-16 07:39:05 +01:00
Piotr Sarna	f86b879933	configure: provide more context on failed ./configure.py run If the configuration step failed, it used to only inform that it must be due to the wrong GCC version, which can be misleading. For instance, trying to compile on clang with incorrect flags also resulted in an "wrong GCC version" message. Now, the message is more generic, but it also prints the stderr output from the miscompilation, which may help pinpoint the problem: $ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang Note: neither lld nor gold found; using default system linker Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 [] // clang pretends to be gcc (defined __GNUC__), so we // must check it first \#ifdef __clang__ \#if __clang_major__ < 10 #error "MAJOR" \#endif \#elif defined(__GNUC__) \#if __GNUC__ < 10 #error "MAJOR" \#elif __GNUC__ == 10 #if __GNUC_MINOR__ < 1 #error "MINOR" #elif __GNUC_MINOR__ == 1 #if __GNUC_PATCHLEVEL__ < 1 #error "PATCHLEVEL" #endif #endif \#endif \#else \#error "Unrecognized compiler" \#endif int main() { return 0; } clang-11: error: unknown argument: '-fhello' distcc[4085341] ERROR: compile (null) on localhost failed Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.	2021-03-16 07:39:03 +01:00
Piotr Sarna	6389246d6e	configure: add verbose option to try_compile_and_link Which will be useful later for providing more context why a ./configure.py run failed.	2021-03-16 07:35:16 +01:00
Pavel Emelyanov	12e4269dce	cql3: Get database directly from query processor After previous patches some places in cql3 code take a long path to get database reference: query processor -> storage proxy -> database The query processor can provide the database reference by itself, so take this chance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:36:04 +03:00
Pavel Emelyanov	fb49550943	thrift: Use query_processor::get_migration_manager() Thrift needs migration manager to call announce_<something> on it and currently it grabs blobak migration manager instance. Since thrift handler has query processor rerefence onboard and the query processor can provide the migration manager reference, it's time to remove few more globals from thrift code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:35:59 +03:00
Pavel Emelyanov	6dc9a16b4e	table_helper: Use query_processor::get_migration_manager() After the migration manager can be obtained from the query processor the table heler can also benefit from it and not call for global migration manager instance any longer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:35:53 +03:00
Pavel Emelyanov	a9646dd779	cql3: Use query_processor::get_migration_manager() (lambda captures cases) There are few schema altering statements that need to have the query processor inside lambda continuations. Fortunately, they all are continuations of make_ready_future<>()s, so the query processor can be simply captured by reference and used. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:35:48 +03:00
Pavel Emelyanov	50e4eacd08	cql3: Use query_processor::get_migration_manager() (alter_type statement) This statement needs the query processor one step below the stack from its .announce_migration method. So here's the dedicated patch for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:35:43 +03:00
Pavel Emelyanov	464e58abf7	cql3: Use query_processor::get_migration_manager() (trivial cases) Most of the schema altering statements implementations can now stop calling for global migration manager instance and get it from the query processor. Here are the trivial cases when the query processor is just avaiable at the place where it's needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:35:36 +03:00
Pavel Emelyanov	1de235f4da	query_processor: Keep migration manager onboard The query processor sits upper than the migration manager, in the services layering, it's started after and (will be) stopped before the migration manager. The migration manager is needed in schema altering statements which are called with query processor argument. They will later get the migration manager from the query processor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:00:58 +03:00
Pavel Emelyanov	1e8f0963f9	cql3: Pass query processor to announce_migration:s Now when the only call to .announce_migration gas the query processor at hands -- pass it to the real statements. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:00:33 +03:00
Pavel Emelyanov	470928dd94	cql3: Switch to qp (almost) in schema-altering-stmt The schema altering statements are all inherited from the same base class which delcares a pure virtual .announce_migration() method. All the real statements are called with storage proxy argument, while the need the migration manager. So like in the previous patch -- replace storage proxy with query processor. While doing the replacement also get the database instance from the querty processor, not from proxy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:00:33 +03:00
Pavel Emelyanov	26c115f379	cql3: Change execute()'s 1st arg to query_processor Currently the statement's execute() method accepts storage proxy as the first argument. This is enough for all of them but schema altering ones, because the latter need to call migration manager's announce. To provide the migration manager to those who need it it's needed to have some higher-level service that the proxy. The query processor seems to be good candidate for it. Said that -- all the .execute()s now accept the querty processor instead of the proxy and get the proxy itself from the query processor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-15 19:00:33 +03:00
Avi Kivity	65fea203d2	test: logalloc_test: harden background reclain test against cpu overcommit Use thread CPU time instead of real time to avoid an overcommitted machine from not being able to supply enough CPU for the test.	2021-03-15 13:54:49 +02:00
Avi Kivity	290897ddbc	logalloc: background reclaim: use default scheduling group for adjusting shares If the shares are currently low, we might not get enough CPU time to adjust the shares in time. This is currently no-op, since Seastar runs the callback outside scheduling groups (and only uses the scheduling group for inherited continuations); but better be insulated against such details.	2021-03-15 13:54:49 +02:00
Avi Kivity	a87f6498c3	logalloc: background reclaim: log shares adjustment under trace level Useful when debugging, but too noisy at any other time.	2021-03-15 13:54:49 +02:00
Avi Kivity	ce1b1d6ec4	logalloc: background reclaim: fix shares not updated by periodic timer adjust_shares() thinks it needs to do nothing if the main loop is running, but in reality it can only avoid waking the main loop; it still needs to adjust the shares unconditionally. Otherwise, the background reclaim shares can get locked into a low value. Fix by splitting the conditional into two.	2021-03-15 13:54:37 +02:00
Tomasz Grabiec	bf6c4e0b24	Merge "raft: consolidate tests in raft directory" from Alejo Move boost tests to tests/raft and factor out common helpers. * alejo/raft-tests-reorg-5-rebase-next-2: raft: tests: move common helpers to header raft: tests: move boost tests to tests/raft	2021-03-15 11:59:16 +01:00
Takuya ASADA	e8cfd5114f	scylla_coredump_setup: support SLES SLES requires to install systemd-coredump package and enable systemd-coredump.socket to use systemd-coredump.	2021-03-15 19:19:56 +09:00
Takuya ASADA	13871ff1f8	scylla_setup: use rpm to check package availability for SLES Use rpm to check scylla packages installed on SLES.	2021-03-15 19:18:44 +09:00
Takuya ASADA	e3b5ffcf14	dist: install optional packages for SLES Support SUSE original package manager 'zypper' for pkg_install() function.	2021-03-15 19:17:48 +09:00
Alejo Sanchez	88063b6e3e	raft: tests: move common helpers to header Move common test helper functions and data structures to a common helpers.hh header. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-15 06:16:58 -04:00
Alejo Sanchez	6139ad6337	raft: tests: move boost tests to tests/raft Move raft boost tests to test/raft directory. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-15 06:16:58 -04:00
Calle Wilund	48ca01c3ab	commitlog: Make pre-allocation drop O_DSYNC while pre-filling Refs #7794 Iff we need to pre-fill segment file ni O_DSYNC mode, we should drop this for the pre-fill, to avoid issuing flushes until the file is filled. Done by temporarily closing, re-opening in "normal" mode, filling, then re-opening. v2: * More comment v3: * Add missing flush v4: * comment v5: * Split coroutine and fix into separate patches	2021-03-15 09:35:45 +00:00
Calle Wilund	ae3b8e6fdf	commitlog: coroutinize allocate_segment_ex To make further changes here easier to write and read.	2021-03-15 09:35:37 +00:00
Avi Kivity	f326a2253c	Update tools/java submodule * tools/java 2c6110500c...fdc8fcc22c (1): > sstableloader: Use compound "where" restrictions for clustering	2021-03-15 11:19:22 +02:00
Raphael S. Carvalho	7171244844	compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism Prior to `463d0ab`, only one table could be cleaned up at a time on a given shard. Since then, all tables belonging to a given keyspace are cleaned up in parallel. Cleanup serialization on each shard was enforced with a semaphore, which was incorrectly removed by the patch aforementioned. So space requirement for cleanup to succeed can be up to the size of keyspace, increasing the chances of node running out of space. Node could also run out of memory if there are tons of tables in the keyspace. Memory requirement is at least #_of_tables * 128k (not taking into account write behind, etc). With 5k tables, it's ~0.64G per shard. Also all tables being cleaned up in parallel will compete for the same disk and cpu bandwidth, so making them all much slower, and consequently the operation time is significantly higher. This problem was detected with cleanup, but scrub and upgrade go through the same rewrite procedure, so they're affected by exact the same problem. Fixes #8247. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>	2021-03-14 14:31:26 +02:00
Nadav Har'El	d73934372d	storage_service: correct missing exception in logging rebuild failure When failing to rebuild a node, we would print the error with the useless explanation "<no exception>". The problem was a typo in the logging command which used std::current_exception() - which wasn't relevant in that point - instead of "ep". Refs #8089 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>	2021-03-14 14:11:11 +02:00
Tomasz Grabiec	f2ecb4617e	Merge "raft: implement prevoting stage in leader election" from Gleb This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required. * scylla-dev/raft-prevote-v5: raft: store leader and candidate state in state variant raft: add boost tests for prevoting raft: implement prevoting stage in leader election raft: reset the leader on entering candidate state raft: use modern unordered_set::contains instead of find in become_candidate	2021-03-12 11:15:51 +01:00
Gleb Natapov	e231186a7b	raft: store leader and candidate state in state variant We already have server state dependant state in fsm, so there is no need to maintain "voters" and "tracker" optionals as well. The upside is that optional and variant sates cannot drift apart now.	2021-03-12 11:12:57 +02:00
Gleb Natapov	e17e7d57bd	raft: add boost tests for prevoting	2021-03-12 11:12:57 +02:00
Gleb Natapov	1f868d516e	raft: implement prevoting stage in leader election This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required.	2021-03-12 11:09:21 +02:00
Raphael S. Carvalho	f6fc32c8da	table: use new sstable_set::for_each_sstable for_each_sstable() is preferred over all() because it's guaranteed to perform no copy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210311163009.42210-2-raphaelsc@scylladb.com>	2021-03-11 18:47:17 +02:00
Raphael S. Carvalho	e7a6f3926a	sstable_set: introduce for_each_sstable() This new method is preferred over all() for iterations purposes, because all() may have to copy sstables into a temporary. For example, all() implementation of the upcoming compound_sstable_set will have no choice but to merge all sstables from N managed sets into a temporary. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>	2021-03-11 18:47:16 +02:00
Avi Kivity	486f6bf29c	Merge "sstables: move format specific reader code to kl/, mx/" from Botond " Currently the sstable reader code is scattered across several source files as following (paths are relative to sstables/): * partition.cc - generic reader code; * row.hh - format specific code related to building mutation fragments from cells; * mp_row_consumer.hh - format specific code related to parsing the raw byte stream; This is a strange organization scheme given that the generic sstable reader is a template and as such it doesn't itself depend on the other headers where the consumer and context implementations live. Yet these are all included in partition.cc just so the reader factory function can instantiate the sstable reader template with the format specific objects. This patchset reorganizes this code such that the generic sstable reader is exposed in a header. Furthermore, format specific code is moved to the kl/ and mx/ directories respectively. Each directory has a reader.hh with a single factory function which creates the reader, all the format specific code is hidden from sight. The added benefit is that now reader code specific to a format is centralized in the format specific folder, just like the writer code. This patchset only moves code around, no logical changes are made. Tests: unit(dev) " * 'sstable-reader-separation/v1' of https://github.com/denesb/scylla: sstables: get rid of mp_row_consumer.{hh,cc} sstables: get rid of row.hh sstables/mp_row_consumer.hh: remove unused struct new_mutation sstables: move mx specific context and consumer to mx/reader.cc sstables: move kl specific context and consumer to kl/reader.cc sstables: mv partition.cc sstable_mutation_reader.hh	2021-03-11 16:57:54 +02:00
Raphael S. Carvalho	6ff8bb4eac	compaction: Allow all supported compaction types to be stopped Let's make stop_compaction() use sstables::to_compaction_type(), so all supported compaction types can now be aborted. Refs #7738. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:30:11 -03:00
Raphael S. Carvalho	f1b8d5f20f	compaction: introduce function to map compaction name to respective type Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:29:59 -03:00
Raphael S. Carvalho	a44bc233f5	compaction: refactor mapping of compaction type to string This will make it easier to introduce new type and also to map type to string and vice-versa, using reverse lookup. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:29:53 -03:00
Raphael S. Carvalho	503a0ea928	compaction: move compaction_name() out of line Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:29:46 -03:00
Botond Dénes	361ba473c7	sstables: get rid of mp_row_consumer.{hh,cc} Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	3ba782bddd	sstables: get rid of row.hh Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	f5b0657fa5	sstables/mp_row_consumer.hh: remove unused struct new_mutation	2021-03-11 12:17:13 +02:00
Botond Dénes	cecc7f8064	sstables: move mx specific context and consumer to mx/reader.cc Move all the mx format specific context and consumer code to mx/reader.cc and add a factory function `mx::make_reader()` which takes over the job of instantiating the `sstable_mutation_reader` with the mx specific context and consumer.	2021-03-11 12:17:13 +02:00
Botond Dénes	4e3ae9d913	sstables: move kl specific context and consumer to kl/reader.cc Move all the kl format specific context and consumer code to kl/reader* and add a factory function `kl::make_reader()` which takes over the job of instantiating the `sstable_mutation_reader` with the kl specific context and consumer. Code which is used by test is moved to kl/reader_impl.hh, while code that can be hidden us moved to kl/reader.cc. Users who just want to create a reader only have to include kl/reader.hh.	2021-03-11 12:17:13 +02:00
Botond Dénes	0ec040921d	sstables: mv partition.cc sstable_mutation_reader.hh The sstable reader currently knows the definition of all the different consumers and contexts. But it doesn't really need to, as it is a template. Exploit this and prepare for a organization scheme where the consumers and contexts live hidden in a cc file which includes and instantiates the sstable reader template. As a first step expose `sstable_mutation_reader` in a header.	2021-03-11 12:17:13 +02:00
Avi Kivity	a49c4ab754	Update tools/java submodule * tools/java c5d9e8513e...2c6110500c (1): > cassandra.in.sh: Add path to rack/dc properties file to classpath Fixes #7930.	2021-03-11 12:03:01 +02:00
Asias He	d5e6ba1ff1	repair: Shortcut when no followers to repair with - 3 nodes in the cluster with rf = 3 - run repair on node1 with ignore_nodes to ignore node2 and node3 - node1 has no followers to repair with However, currently node1 will walk through the repair procedure to read data from disk and calculate hashes which are unnecessary. This patch fixes this issue, so that in case there are no followers, we skip the range and avoid the unnecessary work. Before: $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3" repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats: repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769, sub_ranges_nr=769, round_nr=1456, round_nr_fast_path_already_synced=1456, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{127.0.0.1, 2822972}}, row_from_disk_nr={{127.0.0.1, 6218}}, row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s, row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={} Data was read from disk. After: $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3" repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats: repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769, sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={}, row_from_disk_nr={}, row_from_disk_bytes_per_sec={} MiB/s, row_from_disk_rows_per_sec={} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={} No data was read from disk. Fixes #8256 Closes #8257	2021-03-11 11:53:22 +02:00
Avi Kivity	c8f692e526	Merge 'cql3: Rewrite get_clustering_bounds() using expressions' from Dejan Mircevski Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause. This will allow us to eventually drop the `restrictions` hierarchy altogether. Tests: unit (dev, debug) Closes #8227 * github.com:scylladb/scylla: cql3: Make get_clustering_bounds() use expressions cql3/expr: Add is_multi_column() cql3/expr: Add more operators to needs_filtering cql3: Replace CK-bound mode with comparison_order cql3/expr: Make to_range globally visible cql3: Gather slice-defining WHERE expressions cql3: Add statement_restrictions::_where test: Add unit tests for get_clustering_bounds	2021-03-11 11:46:52 +02:00
Gleb Natapov	a849246cfc	raft: reset the leader on entering candidate state Not resetting a leader causes vote requests to be ignored instead of rejected which will make voting round to take more time to fail and may slow down new leader election.	2021-03-11 10:36:43 +02:00
Gleb Natapov	20d6bb36cd	raft: use modern unordered_set::contains instead of find in become_candidate	2021-03-11 10:36:43 +02:00
Dejan Mircevski	990de02d28	cql3: Make get_clustering_bounds() use expressions Use expressions instead of _clustering_columns_restrictions. This is a step towards replacing the entire restrictions class hierarchy with expressions. Update some expected results in unit tests to reflect the new code. These new results are equivalent to the old ones in how storage_proxy::query() will process them (details: bound_view::from_range() returns the same result for an empty-prefix singular as for (-inf,+inf)). Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	8dac132581	cql3/expr: Add is_multi_column() It will come in handy when we start using expressions to calculate the clustering slice. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	1f591bd16e	cql3/expr: Add more operators to needs_filtering Omitting these operators didn't cause bugs, because needs_filtering() is never invoked on them. But that will likely change in the future, so add them now to prevent problems down the road. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	c0c93982d0	cql3: Replace CK-bound mode with comparison_order Instead of defining this enum in multi_column_restriction::slice, put it in the expr namespace and add it to binary_operator. We will need it when we switch bounds calculation from multi_column_restriction to expr classes. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	7dfe471b5a	cql3/expr: Make to_range globally visible It will be used in statement_restrictions for calculating clustering bounds. And it will come in handy elsewhere in the future, I'm sure. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	28b5a372f8	cql3: Gather slice-defining WHERE expressions Add statement_restrictions::_clustering_prefix_restrictions and fill it with relevant expressions. Explain how to find all such expressions in the WHERE clause. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	da096bfdce	cql3: Add statement_restrictions::_where ... and collect all restrictions' expressions into it. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:25:43 -05:00
Dejan Mircevski	2525759027	test: Add unit tests for get_clustering_bounds ... as guardrails for the upcoming rewrite. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-03-10 21:17:26 -05:00
Calle Wilund	f44420f2c9	snapshot: Add filter to check for existing snapshot Fixes #8212 Some snapshotting operations call in on a single table at a time. When checking for existing snapshots in this case, we should not bother with snapshots in other tables. Add an optional "filter" to check routine, which if non-empty includes tables to check. Use case is "scrub" which calls with a limited set of tables to snapshot. Closes #8240	2021-03-10 20:21:38 +02:00
Benny Halevy	ff5b42a0fa	bytes_ostream: max_chunk_size: account for chunk header Currently, if the data_size is greater than max_chunk_size - sizeof(chunk), we end up allocating up to max_chunk_size + sizeof(chunk) bytes, exceeding buf.max_chunk_size(). This may lead to allocation failures, as seen in https://github.com/scylladb/scylla/issues/7950, where we couldn't allocate 131088 (= 128K + 16) bytes. This change adjusted the expose max_chunk_size() to be max_alloc_size (128KB) - sizeof(chunk) so that the allocated chunks would normally be allocated in 128KB chunks in the write() path. Added a unit test - test_large_placeholder that stresses the chunk allocation path from the write_place_holder(size) entry point to make sure it handles large chunk allocations correctly. Refs #7950 Refs #8081 Test: unit(release), bytes_ostream_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>	2021-03-10 19:54:12 +02:00
Asias He	268fa9d9fe	main: Lower shares for main scheduling group The main scheduling group has the shares of 1000, which is as high as the statement group. From time to time, we see unexpected scheduling group leaking to the main group, which causes the drop of the query performance. This patch reduce the main scheduling shares to 200, which is the same as the maintenance scheduling group. It is a safer default in case code leaks to the main scheduling group. Refs: #7720 Closes #8243	2021-03-10 19:34:45 +02:00
Takuya ASADA	af8eae317b	scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero On the environment hard limit of coredump is set to zero, coredump test script will fail since the system does not generate coredump. To avoid such issue, set ulimit -c 0 before generating SEGV on the script. Note that scylla-server.service can generate coredump even ulimit -c 0 because we set LimitCORE=infinity on its systemd unit file. Fixes #8238 Closes #8245	2021-03-10 19:28:10 +02:00
Avi Kivity	5342d79461	Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael * 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla: sstable_set: move all() implementation into sstable_set_impl sstable_set: preparatory work to change sstable_set::all() api sstables: remove bag_sstable_set	2021-03-10 19:19:26 +02:00
Botond Dénes	cf28552357	mutation_test: test_mutation_diff_with_random_generator: compact input mutations This test checks that `mutation_partition::difference()` works correctly. One of the checks it does is: m1 + m2 == m1 + (m2 - m1). If the two mutations are identical but have compactable data, e.g. a shadowable tombstone shadowed by a row marker, the apply will collapse these, causing the above equality check to fail (as m2 - m1 is null). To prevent this, compact the two input mutations. Fixes: #8221 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>	2021-03-10 16:28:14 +01:00
Raphael S. Carvalho	c3b8757fa1	sstable_set: move all() implementation into sstable_set_impl The main motivation behind this is that by moving all() impl into sstable_set_impl, sstable_set no longer needs to maintain a list with all sstables, which in turn may disagree with the respective sstable_set_impl. This will be very important for compound_sstable_set_impl which will be built from existing sets, and will implement all() by combining the all() of its managed sets. Without this patch, we'd have to insert the same sstable at both compound set and also the set managed by it, to guarantee all() of compound set would return the correct data, which would be expensive and error prone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:13 -03:00
Raphael S. Carvalho	05b07c7161	sstable_set: preparatory work to change sstable_set::all() api users of sstable_set::all() rely on the set itself keeping a reference to the returned list, so user can iterate through the list assuming that it is alive all the way through. this will change in the future though, because there will be a compound set impl which will have to merge the all() of multiple managed sets, and the result is a temporary value. so even range-based loops on all() have to keep a ref to the returned list, to avoid the list from being prematurely destroyed. so the following code for (auto& sst : sstable_set.all()) { ...} becomes for (auto sstables = sstable_set.all(); auto& sst : sstables) { ... } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:12 -03:00
Avi Kivity	746798fd56	Merge "sstables: get rid of data_consume_context" from Botond " This class is basically a wrapper around a unique pointer and a few short convenience methods, but is otherwise a distraction in trying to untangle the maze that is the sstable reader class hierachy. So this patchset folds it into its only real user: the sstable reader. " * 'data_consume_context_bye' of https://github.com/denesb/scylla: sstable: move data_consume_* factory methods to row.hh sstables: fold data_consume_context: into its users sstables: partition.cc: remove data_consume_* forward declarations	2021-03-10 16:45:32 +02:00
Nadav Har'El	a1725217e1	Merge 'alternator: coroutinize handle_api_request' from Piotr Sarna The indentation level is significantly reduced, and so is the number of allocations. The function signature is changed from taking an rvalue ref to taking the unique_ptr by value, because otherwise the coroutine captures the request as a reference, which results in use-after-free. Tests: unit(dev) Closes #8249 * github.com:scylladb/scylla: alternator: drop read_content_and_verify_signature alternator: coroutinize handle_api_request	2021-03-10 16:08:08 +02:00
Piotr Sarna	ba264e7199	alternator: drop read_content_and_verify_signature The only use of this helper function was inlined in a bigger coroutine, so it's no longer needed.	2021-03-10 14:42:53 +01:00
Piotr Sarna	35da51879f	alternator: coroutinize handle_api_request The indentation level is significantly reduced, and so is the number of allocations. The function signature is changed from taking an rvalue ref to taking the unique_ptr by value, because otherwise the coroutine captures the request as a reference, which results in use-after-free.	2021-03-10 14:42:52 +01:00
Botond Dénes	1aa2424dcf	sstable: move data_consume_* factory methods to row.hh	2021-03-10 15:40:50 +02:00
Botond Dénes	a06465a8f3	sstables: fold data_consume_context: into its users `data_consume_context` is a thin wrapper over the real context object and it does little more than forward method calls to it. The few methods doing more then mere forwarding can be folded into its single real user: `sstable_reader`.	2021-03-10 15:38:58 +02:00
Botond Dénes	37eb547224	sstables: partition.cc: remove data_consume_* forward declarations They don't seem to serve any purpose, everything builds fine without them.	2021-03-10 15:23:54 +02:00
Raphael S. Carvalho	f7cc431477	compaction_manager: Fix use-after-free in rewrite_sstables() Use-after-free introduced by `2cf0c4bbf1`. That's because compacting is moved into then_wrapped() lambda, so it's potentially freed on the next iteration of repeat(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>	2021-03-10 13:18:38 +02:00
Nadav Har'El	f41dac2a3a	alternator: avoid large contiguous allocation for request body Alternator request sizes can be up to 16 MB, but the current implementation had the Seastar HTTP server read the entire request as a contiguous string, and then processed it. We can't avoid reading the entire request up-front - we want to verify its integrity before doing any additional processing on it. But there is no reason why the entire request needs to be stored in one big contiguous allocation. This always a bad idea. We should use a non- contiguous buffer, and that's the goal of this patch. We use a new Seastar HTTPD feature where we can ask for an input stream, instead of a string, for the request's body. We then begin the request handling by reading lthe content of this stream into a vector<temporary_buffer<char>> (which we alias "chunked_content"). We then use this non-contiguous buffer to verify the request's signature and if successful - parse the request JSON and finally execute it. Beyond avoiding contiguous allocations, another benefit of this patch is that while parsing a long request composed of chunks, we free each chunk as soon as its parsing completed. This reduces the peak amount of memory used by the query - we no longer need to store both unparsed and parsed versions of the request at the same time. Although we already had tests with requests of different lengths, most of them were short enough to only have one chunk, and only a few had 2 or 3 chunks. So we also add a test which makes a much longer request (a BatchWriteItem with large items), which in my experiment had 17 chunks. The goal of this test is to verify that the new signature and JSON parsing code which needs to cross chunk boundaries work as expected. Fixes #7213. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>	2021-03-10 09:22:34 +01:00
Juliusz Stasiewicz	382545a614	docs: explain SSL/non-SSL and shard-aware CQL ports I added short description of shard-aware ports + updated the rules for disabling ports and enabling SSL introduced by #7992. Fixes #8146 Closes #8152	2021-03-09 22:48:30 +02:00
Tomasz Grabiec	c9c2beabc0	Merge "raft: replication tests as individual boost tests" from Alejo * alejo/raft-tests-replication-boost-5: raft: replication test: use Seastar random generator raft: replication test: rename drop_replication raft: replication test: change to Boost test raft: replication test: id helper functions raft: replication test: improve handling connectivity raft: replication test: parametrize snapshots raft: replication test: parametrize drop_replication raft: replication test: remove unused configuration raft: replication test: add license	2021-03-09 17:58:59 +01:00
Pavel Emelyanov	096e452db9	test: Fix exit condition of row_cache_test::test_eviction_from_invalidated The test populates the cache, then invalidates it, then tries to push huge (10x times the segment size) chunks into seastar memory hoping that the invalid entries will be evicted. The exit condition on the last stage is -- total memory of the region (sum of both -- used and free) becomes less than the size of one chunk. However, the condition is wrong, because cache usually contains a dummy entry that's not necessarily on lru and on some test iteration it may happen that evictable size < chunk size < evictable size + dummy size In this case test fails with bad_alloc being unable to evict the memory from under the dummy. fixes: #7959 tests: unit(row_cache_test), unit(the failing case with the triggering seed from the issue + 200 times more with random seeds) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210309134138.28099-1-xemul@scylladb.com>	2021-03-09 17:57:52 +01:00
Alejo Sanchez	f67b85e2b3	raft: replication test: use Seastar random generator Use the random generator provided by Seastar test suite. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	1bf10a87c6	raft: replication test: rename drop_replication Rename drop_replication to packet_drops for readability. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	6e193ee3bf	raft: replication test: change to Boost test Change test/raft directory to Boost test type. Run replication_test cases with their own test. RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet loss named name_drops. The directory test/raft is changed to host Boost tests instead of unit. While there improve the documentation. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	8d9c797954	raft: replication test: id helper functions In raft the UUID 0 is a special case so server ids start at 1. Add two helper functions. Convert local 0-based id to raft 1-based UUID. And from UUID to raft_id. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:50:12 -04:00
Alejo Sanchez	0ffa450222	raft: replication test: improve handling connectivity Change global map of disconnected servers to a more intuitive class connected. The class is callable for the most common case connected(id). Methods connect(), disconnect(), and all() are provided for readability instead of directly calling map methods (insert, erase, clear). They also support both numerical (0 based) and server_id (UUID, 1 based) ids. The actual shared map is kept in a lw_shared_ptr. The class is passed around to be copy-constructed which is practically just creating a new lw_shared_ptr. Internally it tracks disconnected servers but externally it's more intuitive to use connect instead of disconnect. So it reads "connected id" and "not disconnected id", without double negatives. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:39:29 -04:00
Alejo Sanchez	7a644f37d3	raft: replication test: parametrize snapshots Snapshots and persisted snapshots created per test instead of globals. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	f72e89fcfe	raft: replication test: parametrize drop_replication Pass drop_replication down instead of keeping it global. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	5a03670f91	raft: replication test: remove unused configuration Remove test case configuration as it's not implemented yet. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	efc6681cd6	raft: replication test: add license Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Piotr Sarna	d473bc9b06	Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani This is a reworked submission of #7686 which has been reverted. This series fixes some race conditions in MV/SI schema creation and load, we spotted some places where a schema without a base table reference can sneak into the registry. This can cause to an unrecoverable error since write commands with those schemas can't be issued from other nodes. Most of those cases can occur on 2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small window after a view or base table altering. Fixes #7709 Closes #8091 * github.com:scylladb/scylla: database: Fix view schemas in place when loading global_schema_ptr: add support for view's base table materialized views: create view schemas with proper base table reference. materialized views: Extract fix legacy schema into its own logic	2021-03-09 16:27:34 +01:00
Asias He	61ac8d03b9	repair: Add ignore_nodes option In some cases, user may want to repair the cluster, ignoring the node that is down. For example, run repair before run removenode operation to remove a dead node. Currently, repair will ignore the dead node and keep running repair without the dead node but report the repair is partial and report the repair is failed. It is hard to tell if the repair is failed only due to the dead node is not present or some other errors. In order to exclude the dead node, one can use the hosts option. But it is hard to understand and use, because one needs to list all the "good" hosts including the node itself. It will be much simpler, if one can just specify the node to exclude explicitly. In addition, we support ignore nodes option in other node operations like removenode. This change makes the interface to ignore a node explicitly more consistent. Refs: #7806 Closes #8233	2021-03-09 16:03:13 +01:00
Gleb Natapov	2a41ad0b57	raft: add testing for non-voting members Add tests to check if quorum (for leader election and commit index purposes) is calculated correctly in the presence of non-voting members. Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>	2021-03-09 13:51:09 +01:00
Gleb Natapov	dd6ba3d507	raft: add non-voting member support This patch adds a support for non-voting members. Non voting member is a member which vote is not counted for leader election purposes and commit index calculation purposes and it cannot become a leader. But otherwise it is a normal raft node. The state is needed to let new nodes to catch up their log without disturbing a cluster. All kind of transitions are allowed. A node may be added as a voting member directly or it may be added as non-voting and then changed to be voting one through additional configuration change. A node can be demoted from voting to non-voting member through a configuration change as well. Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>	2021-03-09 13:47:48 +01:00
Raphael S. Carvalho	863b95aa34	sstables: remove bag_sstable_set bag_sstable_set can be replaced with partitioned_sstable_set, which will provide the same functionality, given that L0 sstables go to a "bag" rather than interval map. STCS, for example, will only have L0 sstables, so it will get exact the same behavior with partitioned_sstable_set. it also gives us the benefit of keeping the leveled sstables in the interval map if user has switched from LCS to STCS, until they're all compacted into size-tiered ssts. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-09 08:39:48 -03:00
Avi Kivity	9038a81317	treewide: drop SEASTAR_CONCEPT Since Scylla requires C++20, there is no need to protect concept definitions or usages with SEASTAR_CONCEPT; it just clutters the code. This patch therefore removes all uses. Closes #8236	2021-03-08 16:04:20 +01:00
Asias He	dc40184faa	gossip: Handle timeout error in gossiper::do_shadow_round Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb is not handled in gossiper::do_shadow_round. If the GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes timeout, gossiper::do_shadow_round will throw an exception and fail the whole boot up process. It is fine that some of the remote nodes timeout in shadow round. It is not a must to talk to all nodes. This patch fixes an issue we saw recently in our sct tests: ``` INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping INFO \| scylla[1579]: [shard 0] gossip - gossip is already stopped INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping was successful ... ERR \| scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out) ``` Fixes #8187 Closes #8213	2021-03-08 13:03:41 +01:00
Nadav Har'El	28804a50f7	alternator-test: test that index can't be a name reference (#xyz) We already have a test which shows verify DynamoDB and Alternator do not allow an index in an attribute path - like a[0].b - to be a value reference - a[:xyz].b. We forgot to verify that the index also can't be a name reference - a[#xyz].b is a syntax error. So here we add a test which confirms that this is indeed the case - DynamoDB doesn't allow it, and neither does Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>	2021-03-08 10:17:19 +01:00
Avi Kivity	938761f49f	types.cc: drop unused #include "compaction_garbage_collector.hh" Garbage-collect unused #includes. Closes #8232	2021-03-08 06:44:03 +01:00
Takuya ASADA	2d9feaacea	scylla_raid_setup: don't abort using raiddev when array_state is 'clear' On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes '/dev/md0 is already using' (issue #7627). So we merged the patch to find free mdX (`587b909`). However, look into /proc/mdstat of the AMI, it actually says no active md device available: ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat Personalities : unused devices: <none> We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True, but according to kernel doc, the file may available even array is STOPPED: clear No devices, no size, no level Writing is equivalent to STOP_ARRAY ioctl https://www.kernel.org/doc/html/v4.15/admin-guide/md.html So we should also check array_state != 'clear', not just array_state existance. Fixes #8219 Closes #8220	2021-03-07 18:30:11 +02:00
Avi Kivity	1287a5e1d0	test: index_reader_assertions: fix misuse of trichotomic comparator in has_monotonic_positions has_monotonic_positions() wants to check for a greater-than-or-equal-to relation, but actually tests for not-equal, since it treats a trichotomic comparator as a less-than comparator. This is clearly seen in the BOOST_FAIL message just below. Fix by aligning the test with the intended invariant. Luckily, the tests still pass. Ref #1449. Closes #8222	2021-03-07 13:44:37 +02:00
Eliran Sinvani	0220786710	database: Fix view schemas in place when loading On restart the view schemas are loaded and might contain old views with an unmarked computed column. We already have code to update the schema, but before we do it we load the view as is. This is not desired since once registered, this view version can be used for writes which is forbidden since we will spot a none computed column which is in the view's primary key but not in the base table at all. To solve this, in addition to altering the persistent schema, we fix the view's loaded schema in place. This is safe since computed column is just involved in generating a value for this column when creating a view update so the effect of this manipulation stays internal. The second stage of the in place fixing is to persist the changes made in the in place fixing so the view is ready for the next node restart in particular the `computed_columns` table.	2021-03-07 12:57:16 +02:00
Eliran Sinvani	04de770566	global_schema_ptr: add support for view's base table Up until now, the global_schema_ptr object was a crack through which a view schema with an uninitialized base reference could sneak. Even if the schema itself contained a base reference, the base schema didn't carry over to shards different than the shard on which the global_schema_ptr was created. Since once the schema is in the registry it might be used for everything (reads and writes), we also need to make sure that global schemas for an incomplete view schemas will not be created.	2021-03-07 12:50:42 +02:00
Eliran Sinvani	9162748b18	materialized views: create view schemas with proper base table reference. Newly created view schemas don't always have their base info, this is bad since such schemas don't support read nor write. This leaves us vulnerable to a race condition where there is an attempt to use this schema for read or write. Here we initialize the base reference and also reconfigure the view to conform to the new computed column type, which makes it usable for write and not only reads. We do it for views created in the migration manager following announcements and also for copied schemas.	2021-03-07 12:50:42 +02:00
Eliran Sinvani	39cd9dae4e	materialized views: Extract fix legacy schema into its own logic We extract the logic for fixing the view schema into it's own logic as we will need to use it in more places in the code. This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since it becomes a two liner wrapper for this logic. We also remove it here and replace the call to it with the equivalent code.	2021-03-07 12:50:42 +02:00
Takuya ASADA	53c7600da8	dist: increase fs.aio-max-nr value for other apps Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla uses, if other apps on the environment also try to use aio, aio slot will be run out. So increase value +65536 for other apps. Related #8133 Closes #8228	2021-03-07 12:11:36 +02:00
Piotr Sarna	7106ca27e6	service: reduce continuation length for paxos pruning A pair of (finally, handle_exception) is reduced to a single use of then_wrapped(), which saves an allocation. Message-Id: <01949e286db93397209435a85fcc46a8beef6d24.1614937462.git.sarna@scylladb.com>	2021-03-07 11:59:10 +02:00
Nadav Har'El	ad563c6279	Update tools/java submodule Fixes an sstableloader bug where we quoted twice column names that had to be quoted, and therefore failed on such tables - and in particular Alternator tables which always have a column called ":attrs". Fixes #8229 * tools/java 142f517a23...c5d9e8513e (1): > sstableloader: Only escape column names once	2021-03-07 10:33:49 +02:00
Botond Dénes	debaae41f9	mutation_partition: operator<<(mutation_partition::printer) Include row tombstones in the row printout. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210305094106.210249-1-bdenes@scylladb.com>	2021-03-05 14:39:39 +02:00
Botond Dénes	45471419d0	multishard_mutation_query: re-enable reverse queries `034cb81323` and `0f0c3be` disallowed reverse partition-range scans based on the observation that the CQL frontend disallows them, assuming that other client APIs also disallow them. As it turns out this is not true and there it at least one client API (Thrift) which does allows reverse range scans. So re-enable them. Fixes: #8211 Tests: unit(release), dtest(thrift_tests.py) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>	2021-03-04 17:06:16 +02:00
Nadav Har'El	acfa180766	cql-pytest: recognize when Scylla crashes Before this patch, if Scylla crashes during some test in cql-pytest, all tests after it will fail because they can't connect to Scylla - and we can get a report on hundreds of failures without a clear sign of where the real problem was. This patch introduces an autouse fixture (i.e., a fixture automatically used by every test) which tries to run a do-nothing CQL command after each test. If this CQL command fails, we conclude that Scylla crashed and report the test in which this happened - and exist pytest instead of failing a hundred more tests. Fixes #8080 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>	2021-03-04 16:00:00 +02:00
Raphael S. Carvalho	1226fc755f	compaction_manager: Increase cleanup compaction resilience when low on disk space In a scenario where node is running out of disk space, which is a common cause of cluster expansion, it's very important to clean up the smallest files first to increase the chances of success when the biggest files are reached down the road. That's possible given that cleanup operates on a single file at a time, and that the smaller the file the smaller the space requirement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>	2021-03-04 15:11:06 +02:00
Botond Dénes	25367deb01	mutation_partition: make row::consume_with() exception safe This function currently eagerly decrements `_size`, before `func()` is invoked. If `func()` throws the consumption fails but the size remains decremented. If this happens right at the last element in the row, the `row::empty()` will incorrectly return `true`, even though there is still one cell left in it. Move the decrement after the `func()` invocation to avoid this by only decrementing if the consumption was successful. Fixes: #8154 Tests: unit(mutation_test:release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>	2021-03-04 15:07:15 +02:00
Piotr Sarna	added53b7d	Merge 'hints: use a soft disk space limit in hints commitlog' from Piotr Dulikowski A recent change to the commitlog (`4082f57`) caused its configurable size limit to be strictly enforced - after reaching the limit, new segments wouldn't be allocated until some of the previous segments are freed. This flow can work for the regular commitlog, however the hints commitlog does not delete the segments itself - instead, hints manager recreates its commitlog every 10 seconds, picks up segments left by the previous instance and deletes each segment manually only after all hints are sent out from a segment. Because of the non-standard flow, it is possible that the hints commitlog fills up and stops accepting more hints. Hints manager uses a relatively low limit for each commitlog instance (128MB divided by shard count), so it's not hard to fill it up. What's worse, hints manager tries to acquire file_update_mutex in exclusive mode before re-creating the commitlog, while hints waiting to be written acquire this lock in shared mode - which causes hints flushing to completely deadlock and no more hints be admitted to the commitlog. The queue of hints waiting to be admitted grows very quickly and soon all writes which could result in a hint being generated are rejected with OverloadedException. To solve this problem, it is now possible to bring back the soft disk space limit by setting a flag in commitlog's configuration. Tests: - unit(dev) - wrote hints for 15 minutes in order to see if it gets stuck again Fixes #8137 Closes #8206 * github.com:scylladb/scylla: hints_manager: don't use commitlog hard space limit commitlog: add an option to allow going over size limit	2021-03-04 12:24:05 +01:00
Tomasz Grabiec	d6a94a7db1	Merge 'Make dht::token tri_compare safer' from Avi Kivity tri_compare() returns an int, which is dangerous as a tri_compare can be misused where a less_compare is expected. To prevent such misuse, convert the interval<> template to accept comparators that return std::strong_ordering, and then convert dht::token's comparator to do the same. Ref #1449. Closes #8181 * github.com:scylladb/scylla: dht: convert token tri_compare to std::strong_ordering interval: support C++20 three-way comparisons	2021-03-04 11:55:08 +01:00
Nadav Har'El	3e66a5cd43	Merge 'More Redis cleanups' from Pekka Enberg This pull request removes seastar namespace imports from the header files. There are some additional cleanups to make that easier and to remove some commented out code. Closes #8202 * github.com:scylladb/scylla: redis: Remove seastar namespace import from query_processor.hh redis: Switch to seastar::sharded<> in query_procesor.hh redis: Remove seastar namespace import from query_utils.hh redis: Remove seastar namespace import from reply.hh redis: Remove commented out code from options.hh redis: Remove seastar namespace import from options.hh redis: Remove seastar namespace import from service.hh redis: Switch to seastar::sharded<> in service.{hh,cc} redis: Remove unneeded include from keyspace_utils.hh redis: Remove seastar namespace import from keyspace_utils.hh redis: Remove seastar namespace import from command_factory.hh redis: Fix include path in command_factory.hh redis: Remove unneeded includes from command_factory.hh	2021-03-04 11:08:24 +02:00
Pekka Enberg	6066db7c90	Update tools/jmx submodule * tools/jmx bac7d0b...15c1d4f (2): > StorageService: Add a method to return the uptime > Bump Jackson version in scylla-apiclient	2021-03-04 10:56:37 +02:00
Nadav Har'El	e12e57c915	Merge 'Fix alternator streams management regression' from Calle Wilund Refs: #8012 Fixes: #8210 With the update to CDC generation management, the way we retrieve and process these changed. One very bad bug slipped through though; the code for getting versioned streams did not take into account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator shard info became quite rump-stumped, leading to more or less no data depending on when generations changed w.r. data. Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check. Closes #8209 * github.com:scylladb/scylla: alternator::streams: Use better method for generation timestamp system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range	2021-03-04 09:43:56 +02:00
Pekka Enberg	1d8a94f941	Update tools/jmx submodule * tools/jmx c2fc96b...bac7d0b (1): > Merge 'Fix locking in APIBuilder.remove()' from Pekka Enberg	2021-03-03 18:30:48 +02:00
Calle Wilund	8bbc976ff1	alternator::streams: Use better method for generation timestamp Get timestamp via system_distributed, instead of local gen.	2021-03-03 15:46:38 +00:00
Calle Wilund	5da0129775	system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp Since we have a table of cdc version timestamps, conviniently sorted reversed, we can just query this and get the latest known gen ts.	2021-03-03 15:44:54 +00:00
Calle Wilund	5a69250d7e	system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range With the new scheme for cdc generation management, one of the last changes was to make the time ordering of the stream timestamps reversed. However, cdc_get_versioned_streams forgot to take this into account when sifting out timestamp ranges for stream retrieval (based on low mark). Fixed by doing reverse iteration.	2021-03-03 15:41:42 +00:00
Tomasz Grabiec	3cb01f218f	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev.git/raft-confchange-test-v4: raft: fix spelling raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::in_memory_size() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-03-03 16:29:40 +01:00
Tomasz Grabiec	0dc57db248	Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja" This reverts commit `f94f70cda8`, reversing changes made to `5206a97915`. Not the latest version of the series was merged. Rvert prior to merging the latest one.	2021-03-03 16:29:02 +01:00
Avi Kivity	facc7c370e	Update tools/jmx submodule * tools/jmx 8073af6...c2fc96b (1): > APIBuilder: Remove RW-lock in JMX server repository wrapper Fixes #7991.	2021-03-03 15:41:09 +02:00
Avi Kivity	aae43e1a20	Merge 'Untyped_result_set: make non-copying and fragment retaining' from Calle Wilund Refs #7961 Fixes #8014 The "untyped_result_set" object was created for small, internal access to cql-stored metadata. It is nowadays used for rather more than that (cdc). This has the potential of mixing badly with the fact that the type does deep copying of data and linearizes all (not to mention handles multiple rows rather inefficiently). Instead of doing a deep copy of input, we keep assume ownership and build rows of the views therein, potentially retaining fragmented data as-is avoiding premature linearization. Note that this is not all sugar and flowers though. Any data access will by nature be more expensive, and the view collections we create are potentially just as expensive as copying for small cells. Otoh, it allows writing code using this that avoids data copying, depending on destination. v2: * Fixed wrong collection reserved in visitor * Changed row index from shared ptr to ref * Moved typedef * Removed non-existing constructors * Added const ref to index build * Fixed raft usage after rebase v3: * Changed shared_ptr to unique Closes #8015 * github.com:scylladb/scylla: untyped_result_set: Do not copy data from input store (retain fragmented views) result_generator: make visitor callback args explicit optionals listlike_partial_deserializing_iterator: expose templated collection routines	2021-03-03 13:13:18 +02:00
Nadav Har'El	4e3db5297a	cql-pytest: rework tests for filtering leaving out most rows Previously, we had two tests demonstrating issue #7966. But since then, our understanding of this issue has improved which resulted in issue #8203, so this patch improves those tests and makes them reproduce the new issue. Importantly, we now know that this problem is not specific to a full-table scan, and also happens in a single-partition scan, so we fix the test to demonstrate this (instead of the old test, which missed the problem so the test passed). Both tests pass on Cassandra, and fail on Scylla. Refs #8203. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>	2021-03-03 11:22:08 +01:00
Calle Wilund	e4d6c8904f	untyped_result_set: Do not copy data from input store (retain fragmented views) Refs #7961 Fixes #8014 Instead of doing a deep copy of input, we keep assume ownership and build rows of the views therein, potentially retaining fragmented data as-is avoiding premature linearization. Note that this is not all sugar and flowers though. Any data access will by nature be more expensive, and the view collections we create are potentially just as expensive as copying for small cells. Otoh, it allows writing code using this that avoids data copying, depending on destination. v2: * Fixed wrong collection reserved in visitor * Changed row index from shared ptr to ref * Moved typedef * Removed non-existing constructors * Added const ref to index build * Fixed raft usage after rebase v3: * Changed shared_ptr to unique	2021-03-03 10:19:46 +00:00
Calle Wilund	353730d4bb	result_generator: make visitor callback args explicit optionals This allows a visitor to separate temporaries (non-optional views) from store backed views (optionals) when traversing.	2021-03-03 10:19:46 +00:00
Calle Wilund	bba43ce31a	listlike_partial_deserializing_iterator: expose templated collection routines To allow using fragmented types as input.	2021-03-03 10:19:46 +00:00
Nadav Har'El	0fea089b37	Merge 'Fix reading whole requests during shedding' from Piotr Sarna When shedding requests (e.g. due to their size or number exceeding the limits), errors were returned right after parsing their headers, which resulted in their bodies lingering in the socket. The server always expects a correct request header when reading from the socket after the processing of a single request is finished, so shedding the requests should also take care of draining their bodies from the socket. Fixes #8193 Closes #8194 * github.com:scylladb/scylla: cql-pytest: add a shedding test transport: return error on correct stream during size shedding transport: return error on correct stream during shedding transport: skip the whole request if it is too large transport: skip the whole request during shedding	2021-03-03 08:52:48 +02:00
Piotr Sarna	4499f89916	cql-pytest: add a shedding test This scylla-only test case tries to push a too-large request to Scylla, and then retries with a smaller request, expecting a success this time. Refs #8193	2021-03-03 07:08:55 +01:00
Pekka Enberg	310b5c9592	redis: Fix license text in server.hh The search and replace pattern went bit overboard. Let's fix up the license text. Message-Id: <20210302171150.3346-1-penberg@scylladb.com>	2021-03-03 07:06:45 +01:00
Dejan Mircevski	05497fe14d	cql3/maps: Drop redundant if condition Accidentally introduced in `9eed26ca3d`, it can never be true due to code above it. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8201	2021-03-03 07:06:45 +01:00
Nadav Har'El	d6335b7fda	test/alternator: better tests of oversized requests Like DynamoDB, Alternator rejects requests larger than some fixed maximum size (16MB). We had a test for this feature - test_too_large_request, but it was too blunt, and missed two issues: Refs #8195 Refs #8196 So this patch adds two better tests that reproduce these two issues: First, test_too_large_request_chunked verifies that an oversized request is detected even if the body is sent with chunked encoding. Second, both tests - test_too_large_request_chunked and test_too_large_request_content_length - verify that the rather limited (and arguably buggy) Python HTTP client is able to read the 413 status code - and doesn't report some generic I/O error. Both tests pass on DynamoDB, but fail on Alternator because of these two open issues. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>	2021-03-03 07:06:45 +01:00
Nadav Har'El	c6ca1ec643	cql-pytest: add reproducers for two filtering-related issues The main goal of this patch is to add a reproducer for issue #7966, where partition-range scan with filtering that begins with a long string of non-matches aborts the query prematurely - but the same thing is fine with a single-partition scan. The test, test_filtering_with_few_matches, is marked as "xfail" because it still fails on Scylla. It passes on Cassandra. I put a lot of effort into making this reproducer fast - the dev-build test takes 0.4 seconds on my laptop. Earlier reproducers for the same problem took as much as 30 seconds, but 0.4 seconds turns this test into a viable regression test. We also add a test, test_filter_on_unset, reproduces issue #6295 (or the duplicate #8122), which was already solved so this test passes. Refs #6295 Refs #7966 Refs #8122 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>	2021-03-03 07:06:45 +01:00
Calle Wilund	58489dc003	cql3::restrictions: Add SCYLLA_CLUSTERING_BOUND keyword for sstableloader Refs #8093 Refs /scylladb/scylla-tools-java#218 Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3) expressions, forcing the restriction to bypass column sort order treatment, and instead just create the raw ck bounds accordningly. This is a very limited, and simple version, but since we only need to cover this above exact syntax, this should be sufficient. v2: * Add small cql test v3: * Added comment in multi_column_restriction::slice, on what "mode" means and is for * Added small document of our internal CQL extension keywords, including this. v4: * Added a few more cases to tests to verify multi-column restrictions * Reworded docs a bit v5: * Fixed copy-paste error in comment v6: * Added negative (error) test cases v7: * Added check + reject of trying to combine SCYLLA_CLUST... slice and normal one Closes #8094	2021-03-03 07:06:45 +01:00
Pekka Enberg	9d54a3e743	redis: Remove seastar namespace import from query_processor.hh	2021-03-02 18:39:30 +02:00
Pekka Enberg	27c5041c86	redis: Switch to seastar::sharded<> in query_procesor.hh	2021-03-02 18:38:41 +02:00
Pekka Enberg	ee8fe53b3c	redis: Remove seastar namespace import from query_utils.hh	2021-03-02 18:37:31 +02:00
Pekka Enberg	c90c1ccd44	redis: Remove seastar namespace import from reply.hh	2021-03-02 18:36:30 +02:00
Pekka Enberg	1075d72780	redis: Remove commented out code from options.hh	2021-03-02 18:34:46 +02:00
Pekka Enberg	1c222fda65	redis: Remove seastar namespace import from options.hh	2021-03-02 18:34:30 +02:00
Pekka Enberg	d452c8f42e	redis: Remove seastar namespace import from service.hh	2021-03-02 18:33:31 +02:00
Pekka Enberg	d0594a86aa	redis: Switch to seastar::sharded<> in service.{hh,cc}	2021-03-02 18:30:39 +02:00
Avi Kivity	ee9db75210	Merge 'Clean up Redis transport layer' from Pekka Enberg The Redis transport layer seems to have originated as a copy-paste of the CQL transport layer. This pull request removes bunch of unused and commented out bits of code, and also does some minor cleanups like organizing includes, to make the code more readable. Closes #8198 * github.com:scylladb/scylla: redis: Remove unused to_bytes_view() function from server.cc redis: Remove unused tracing_request_type enum redis: Remove unneeded connection friend declaration redis: Remove unused process_request_executor friend declaration redis: Remove unused _request_cpu class member redis: Remove commented out code from server.hh redis: Remove duplicate request.hh include redis: Remove unused db::config forward declaration redis: Remove unused fmt_visitor forward declaration redis: Organize includes in server.{cc,hh} redis: Switch to seastar::sharded<> redis: Remove redundant access modifiers from server.hh	2021-03-02 18:27:38 +02:00
Pekka Enberg	097aaa6dc2	redis: Remove unneeded include from keyspace_utils.hh	2021-03-02 18:16:29 +02:00
Pekka Enberg	7f4de3f915	redis: Remove seastar namespace import from keyspace_utils.hh	2021-03-02 18:15:37 +02:00
Pekka Enberg	bf47b58b8a	redis: Remove seastar namespace import from command_factory.hh	2021-03-02 18:13:49 +02:00
Pekka Enberg	92e257d5bd	redis: Fix include path in command_factory.hh	2021-03-02 18:13:08 +02:00
Pekka Enberg	ac4b8e4534	redis: Remove unneeded includes from command_factory.hh	2021-03-02 18:12:30 +02:00
Piotr Dulikowski	376da49cf4	hints_manager: don't use commitlog hard space limit This commit disables the hard space limit applied by commitlogs created to store hints. The hard limit causes problems for hints because they use small-sized commitlogs to store hints (128MB, currently). Instead of letting the commitlog delete the segments itself, it recreates the commitlog every 10 seconds and manually deletes old segments after all hints are sent out from them. If the 128MB limit is reached, the hints manager will get stuck. A future which puts hint into commitlog holds a shared lock, and commitlog recreation needs to get an exclusive lock, which results in a deadlock. No more hints will be admitted, and eventually we will start rejecting writes with OverloadedException due to too many hints waiting to be admitted to the commitlog. By disabling the hard limit for hints commitlog, the old behavior is brought back - commitlog becomes more conservative with the space used after going over its size limit, but does not block until some of its segments are deleted.	2021-03-02 16:53:50 +01:00
Piotr Sarna	8635094144	transport: return error on correct stream during size shedding When a request is shed due to being too large, its response was sent with stream id 0 instead of the stream id that matches the communication lane. That in turn confused the client, which is no longer the case.	2021-03-02 15:10:46 +01:00
Piotr Sarna	d6ea6937ee	transport: return error on correct stream during shedding When a request is shed due to exceeding the max number of concurrent requests, its response was sent with stream id 0 instead of the stream id that matches the communication lane. That in turn confused the client, which is no longer the case.	2021-03-02 15:10:46 +01:00
Pekka Enberg	01a785f561	redis: Remove unused to_bytes_view() function from server.cc	2021-03-02 14:29:52 +02:00
Pekka Enberg	fb6eecfae2	redis: Remove unused tracing_request_type enum	2021-03-02 14:29:52 +02:00
Pekka Enberg	8d79deb973	redis: Remove unneeded connection friend declaration	2021-03-02 14:29:51 +02:00
Pekka Enberg	ff81f7bc23	redis: Remove unused process_request_executor friend declaration	2021-03-02 14:29:51 +02:00
Pekka Enberg	87c5968602	redis: Remove unused _request_cpu class member	2021-03-02 14:29:51 +02:00
Pekka Enberg	11fa32e8c9	redis: Remove commented out code from server.hh	2021-03-02 14:29:51 +02:00
Pekka Enberg	ddab15c47f	redis: Remove duplicate request.hh include	2021-03-02 14:29:51 +02:00
Pekka Enberg	07bd125a59	redis: Remove unused db::config forward declaration	2021-03-02 14:29:51 +02:00
Pekka Enberg	5a7e6b6c09	redis: Remove unused fmt_visitor forward declaration	2021-03-02 14:29:51 +02:00
Pekka Enberg	298bf19981	redis: Organize includes in server.{cc,hh}	2021-03-02 14:29:51 +02:00
Pekka Enberg	23c2f47054	redis: Switch to seastar::sharded<>	2021-03-02 14:29:51 +02:00
Pekka Enberg	7bd4ff9d75	redis: Remove redundant access modifiers from server.hh	2021-03-02 14:13:45 +02:00
Avi Kivity	5f4bf18387	Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros" This reverts commit `31909515b3`, reversing changes made to `ef97adc72a`. It shows many serious regressions in dtest. Fixes #8197.	2021-03-02 13:21:22 +02:00
Takuya ASADA	870c3a28c1	scylla_setup: strip spaces of comma separated list On RAID prompt, we can type disk list something like this: /dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1 However, if the list has spaces in the list, it doesn't work: /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1 Because the script mistakenly recognize the space part of a device path. So we need strip() the input for each item. Fixes #8174 Closes #8190	2021-03-02 12:48:18 +02:00
Piotr Sarna	4a24d7dca0	transport: skip the whole request if it is too large When a request is shed due to being too large, only the header was actually read, and the body was still stuck in the socket - and would be read in the next iteration, which would expect to actually read a new request header. Instead, the whole message is now skipped, so that a new request can be correctly read and parsed. Fixes #8193	2021-03-02 10:10:19 +01:00
Piotr Sarna	3eb7e768cb	transport: skip the whole request during shedding When a request is shed due to exceeding the number of max concurrent requests, only its header was actually read, and the body was still stuck in the socket - and would be read in the next iteration, which would expect to actually read a new request header. Instead, the whole message is now skipped, so that a new request can be correctly read and parsed. Refs #8193	2021-03-02 10:10:19 +01:00
Avi Kivity	10364fca6e	Merge "Build query::result directly in range scan queries" from Botond " Currently range scans build their results on the replica in the `reconcilable_result` format, that -- as its name suggests -- is normally used for reconciliation (read repair). As such this result format is quite inefficient for normal queries: it contains all columns and all tombstones in the requested range. These are all unnecessary for normal queries which only want live data and only those columns that are requested by the user. Furthermore, as the coordinator works in terms of `query::result` for normal queries anyway, this intermediate result has to be converted to the final `query::result` format adding an unnecessary intermediate conversion step. This series gets rid of this problem by introducing `query_data_on_all_shards()`, a variant of `query_mutations_on_all_shards()` that builds `query::result` directly. Reverse queries still use the old intermediate method behind the scenes. Fixes #8061 Refs #7434 Tests: unit(release, debug) " * 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla: cql_query_test: add unit test for the more efficient range scan result format test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter cql_query_test: test_query_limit: clean up scheduling groups storage_proxy: use query_data_on_all_shards() for data range scan queries query: partition_slice: add range_scan_data_variant option gms: add RANGE_SCAN_DATA_VARIANT cluster feature multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries multishard_mutation_query: add query_data_on_all_shards() mutation_partition.cc: fix indentation query_result_builder: make it a public type multishard_mutation_query: generalize query code w.r.t. the result builder used multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine multishar_mutation_query: do_query_mutations(): convert to coroutine multishard_mutation_query: read_page(): convert to coroutine multishard_mutation_query: extract page reading logic into separate method	2021-03-02 08:54:41 +02:00
Botond Dénes	257c295cff	cql_query_test: add unit test for the more efficient range scan result format The most user-visible aspect of this change is range scans which select a small subset of the columns. These queries work as the user expects them to work: unselected columns are not included in determining the size of the result (or that of the page). This is the aspect this test is checking for. While at it, also test single partition queries too.	2021-03-02 08:01:53 +02:00
Botond Dénes	af0a23e75c	test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter To allow conveniently setting the scheduling group `func` is to be run in.	2021-03-02 07:53:53 +02:00
Botond Dénes	fe280271a6	cql_query_test: test_query_limit: clean up scheduling groups Destroy scheduling groups created for this test, so other tests can create scheduling groups with the same name, without conflicts.	2021-03-02 07:53:53 +02:00
Botond Dénes	f8ce168c8e	storage_proxy: use query_data_on_all_shards() for data range scan queries Currently range scans build their result using the `reconcilable_result` format and then convert it to `query::result`. This is inefficient for multiple reasons: 1) it introduces an additional intermediate result format and a subsequent conversion to the final one; 2) the reconcilable result format was designed for reconciliation so it contains all data, including columns unselected by the query, dead rows and tombstones, which takes much more memory to build; There is no reason to go through all this trouble, if there ever was one in the past it doesn't stand anymore. So switch to the newly introduced `query_data_on_all_shards()` when doing normal data range scans, but only if all the nodes in the cluster supports it, to avoid artificial differences in page sizes due to how reconcilable result and query::result calculates result size and the consequent false-positive read repair. The transition to this new more efficient method is coordinated by a cluster feature and whether to use it is decided by the coordinator (instead of each replica individually). This is to avoid needless reconciliation due to the different page sizes the two formats will produce.	2021-03-02 07:53:53 +02:00
Botond Dénes	f15551d23a	query: partition_slice: add range_scan_data_variant option Switching to the data variant of range scans have to be coordinated by the coordinator to avoid replicas noticing the availability of the respective feature in different time, resulting in some using the mutation variant, some using the data variant. So the plan is that it will be the coordinator's job to check the cluster feature and set the option in the partition slice which will tell the replicas to use the data variant for the query.	2021-03-02 07:53:53 +02:00
Botond Dénes	5c84aa52db	gms: add RANGE_SCAN_DATA_VARIANT cluster feature To control the transition to the data variant of range scans. As there is a difference in how the data and mutation variants calculate pages sizes, the transition to the former has to happen in a controlled manner, when all nodes in the cluster support it, to avoid artificial differences in page content and subsequently triggering false-positive read repair.	2021-03-02 07:53:53 +02:00
Botond Dénes	0f0c3be63e	multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries Refuse reverse queries just like in the new `query_data_on_all_shards()`. The reason is the same, reverse range scans are not supported on the client API level and hence they are underspecified and more importantly: not tested.	2021-03-02 07:53:53 +02:00
Botond Dénes	034cb81323	multishard_mutation_query: add query_data_on_all_shards() A data query variant of the existing `query_mutations_on_all_shards()`. This variant builds a `query::result`, instead of `reconcilable_result`. This is actually the result format coordinators want when executing range scans, the reason for using the reconcilable result for these queries is historic, and it just introduces an unnecessary intermediate format. This new method allows the storage proxy to skip this intermediate format and the associated conversion to `query::result`, just like we do for single partition queries. Reverse queries are refused because they are not supported on the client API (CQL) level anyway and hence it is unspecified how they should work and more importantly: they are not tested.	2021-03-02 07:53:53 +02:00
Botond Dénes	df0f501ba2	mutation_partition.cc: fix indentation Left broken from the previous patch.	2021-03-02 07:53:53 +02:00
Botond Dénes	950150c6df	query_result_builder: make it a public type We will want to use it in multishard_mutation_query.cc.	2021-03-02 07:53:53 +02:00
Botond Dénes	f19ab5cff1	multishard_mutation_query: generalize query code w.r.t. the result builder used We want to add support to building `query::result` directly and reuse the code path we use to build reconcilable result currently for it. So templatize said code path on the result builder used. Since the different result builders don't have a source level compatible interface an adaptor class is used.	2021-03-02 07:53:53 +02:00
Botond Dénes	bddb0d35d6	multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method In the next patches we are going to generalize the query logic w.r.t. the result builder used, so query_mutations_on_all_shards() will be just a facade parametrizing the actual query code with the right result builder.	2021-03-02 07:53:53 +02:00
Botond Dénes	b0b620b501	multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine In preparation to generalizing it w.r.t. the result builder used. This change will be much simpler with the coroutine code.	2021-03-02 07:53:53 +02:00
Botond Dénes	5d85615698	multishar_mutation_query: do_query_mutations(): convert to coroutine In preparation to generalizing it w.r.t. the result builder used. This change will be much simpler with the coroutine code.	2021-03-02 07:53:53 +02:00
Botond Dénes	8138bdb434	multishard_mutation_query: read_page(): convert to coroutine In preparation to generalizing it w.r.t. the result builder used. This change will be much simpler with the coroutine code.	2021-03-02 07:53:53 +02:00
Botond Dénes	29195f67f1	multishard_mutation_query: extract page reading logic into separate method The block of code moved also coincides with the scope in which the reader has to be alive, making the code more clear.	2021-03-02 07:53:53 +02:00
Benny Halevy	baf5d05631	storage_service: use atomic_vector for lifecycle_subscribers So it can be modified while walked to dispatch subscribed event notifications. In #8143, there is a race between scylla shutdown and notify_down(), causing use-after-free of cql_server. Using an atomic vector itstead and futurizing unregister_subscriber allows deleting from _lifecycle_subscribers while walked using atomic_vector::for_each. Fixes #8143 Test: unit(release) DTest: update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release) materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>	2021-03-01 20:34:42 +02:00
Benny Halevy	1ed04affab	cql_server: event_notifier: unregister_subscriber in stop Move unregister_subscriber from the destructor to stop as preparation for moving storage_service lifescyle_subscribers to atomic_vector and futurizing unregister_subscriber. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>	2021-03-01 20:34:42 +02:00
Avi Kivity	fe8f9039a2	Update seastar submodule * seastar 803e790598...ea5e529f30 (3): > Merge "Teach io_tester to generate YAML output" from Pavel E > bitset: set_range: mark constructor constexpr > Update dpdk submodule	2021-03-01 20:34:35 +02:00
Avi Kivity	8747c684e0	Merge 'Move timeouts to client state' from Piotr Sarna This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests. The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867). Closes #8140 * github.com:scylladb/scylla: treewide: remove timeout config from query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options service: add timeout config to client state	2021-03-01 20:34:35 +02:00
Raphael S. Carvalho	2cf0c4bbf1	compaction: Prevent cleanup and regular from compacting the same sstable Due to regression introduced by `463d0ab`, regular can compact in parallel a sstable being compacted by cleanup, scrub or upgrade. This redundancy causes resources to be wasted, write amplification is increased and so does the operation time, etc. That's a potential source of data resurrection because the now-owned data from a sstable being compacted by both cleanup and regular will still exist in the node afterwards, so resurrection can happen if node regains ownership. Fixes #8155. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>	2021-03-01 20:34:35 +02:00
Tomasz Grabiec	cb0b8d1903	row_cache: Zap dummy entries when populating or reading a range This will prevent accumulation of unnecessary dummy entries. A single-partition populating scan with clustering key restrictions will insert dummy entries positioned at the boundaries of the clustering query range to mark the newly populated range as continuous. Those dummy entries may accumulate with time, increasing the cost of the scan, which needs to walk over them. In some workloads we could prevent this. If a populating query overlaps with dummy entries, we could erase the old dummy entry since it will not be needed, it will fall inside a broader continuous range. This will be the case for time series worklodas which scan with a decreasing (newest) lower bound. Refs #8153. _last_row is now updated atomically with _next_row. Before, _last_row was moved first. If exception was thrown and the section was retried, this could cause the wrong entry to be removed (new next instead of old last) by the new algorithm. I don't think this was causing problems before this patch. The problem is not solved for all the cases. After this patch, we remove dummies only when there is a single MVCC version. We could patch apply_monotonically() to also do it, so that dummies which are inside continuous ranges are eventually removed, but this is left for later. perf_row_cache_reads output after that patch shows that the second scan touches no dummies: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 265320 Scanning read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB] read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB] Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>	2021-03-01 20:34:35 +02:00
Tomasz Grabiec	761f89e55e	api: Introduce system/drop_sstable_caches RESTful API Evicts objects from caches which reflect sstable content, like the row cache. In the future, it will also drop the page cache and sstable index caches. Unlike lsa/compact, doesn't cause reactor stalls. The old lsa/compact call invokes memory reclamation, which is non-preemptible. It also compacts LSA segments, so does more work. Some use cases don't need to compact LSA segments, just want the row cache to be wiped. Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>	2021-03-01 16:13:04 +02:00
Piotr Dulikowski	aa2df75321	commitlog: add an option to allow going over size limit This commit adds an option which, when turned on, allows the commitlog to go over configured size limit. After reaching the limit, commitlog will be more conservative with its usage of the disk space - for example, it won't increase the segment reserve size or reuse recycled segments. Most importantly, it won't block writes until the space used by the commitlog goes down. This change is necessary for hinted handoff to keep its current behavior. Hinted handoff does not let the commitlog free segments itself - instead, it re-creates it every 10 seconds and manually deletes segments after all hints are sent from a segment.	2021-03-01 14:16:05 +01:00
Takuya ASADA	d0297c599a	dist: tune fs.aio-max-nr based on the number of cpus Current aio-max-nr is set up statically to 1048576 in /etc/sysctl.d/99-scylla-aio.conf. This is sufficient for most use cases, but falls short on larger machines such as i3en.24xlarge on AWS that has 96 vCPUs. We need to tune the parameter based on the number of cpus, instead of static setting. Fixes #8133 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #8188	2021-03-01 14:18:24 +02:00
Avi Kivity	31909515b3	Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that keeps all its versions that are referenced somewhere and provides a way of getting a reference to an immutable version of the set. Each sstable in the set is associated with the versions it is alive in, and is removed when all such versions don't have references anymore. To avoid copying, the object holding all sstables in the set version is changed to a new structure, sstable_list, which was previously an alias for std::unordered_set<shared_sstable>, and which implements most of the methods of an unordered_set, but its iterator uses the actual set with all sstables from all referenced versions and iterates over those sstables that belong to the captured version. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. To release shared_sstables as soon as possible (i.e. when all references to versions that contain them die), each time a version is removed, all sstables that were referenced exclusively by this version are erased. We are able to find these sstables efficiently by storing, for each version, all sstables that were added and erased in it, and, when a version is removed, merging it with the next one. When a version that adds an sstable gets merged with a version that removes it, this sstable is erased. Fixes #2622 Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com Closes #8111 * github.com:scylladb/scylla: sstables: add test for checking the latency of updating the sstable_set in a table sstables: move column_family_test class from test/boost to test/lib sstables: use fast copying of the sstable_set instead of rebuilding it sstables: replace the sstable_set with a versioned structure sstables: remove potential ub sstables: make sstable_set constructor less error-prone	2021-03-01 14:16:36 +02:00
Avi Kivity	ef97adc72a	Merge "Validate token monotonicity on the sstable write path" from Botond " We have recently seen out-of-order partitions getting into sstables causing major disruption later on. Given the damage caused, it was again raised that we should enable partition key monotonicity validation unconditionally in the sstable write path. This was also raised in the past but dismissed as key validation was suspected (but not measured) to add considerable per-fragment overhead. One of the problems was that the key monotonicity validation was all or nothing. It either validated all (clustering and partition) key monotonicity or none of it. This series takes a second look at this and solves the all-or-nothing problem by making the configuration of the key monotonicity check more fine grained, allowing for enabling just token monotonicity validation separately, then enables it unconditionally. Refs: #7623 Tests: unit(release) " * 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla: sstables: enable token monotonicity validation by default mutation_fragment_stream_validator: add token validation level mutation_fragment_stream_validating_filter: make validation levels more fine-grained	2021-03-01 11:23:51 +02:00
Amnon Heiman	0595596172	api/compaction_manager: add the compaction id in get_compaction This patch adds the compaction id to the get_compaction structure. While it was supported, it was not used and up until now wasn't needed. After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions' will include the compaction id. Relates to #7927 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #8186	2021-03-01 10:51:31 +02:00
Piotr Sarna	7936652322	db,view: improve verbosity of errors coming from view updates The error now contains information about the view table that failed, as well as base and view tokens. Example: view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index, base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error) Fixes #8177 Closes #8178	2021-03-01 10:46:14 +02:00
Avi Kivity	86d8977c96	Update tools/python3 submodule * tools/python3 199ac90...6f3bcbe (2): > Add support pip modules > create-relocatable-package.py: add support python libraries in /usr/local	2021-03-01 10:10:13 +02:00
Avi Kivity	8ac0d6d15d	Update tools/jmx submodule * tools/jmx bf8bb16...8073af6 (1): > CompactionManager: add the compaction id when available Fixes #7927.	2021-03-01 10:09:16 +02:00
Takuya ASADA	4cf9b6988e	scylla_coredump_setup: don't run apt-get when systemd-coredump is already installed Check systemd-coredump existance before running apt-get install systemd-coredump. Closes #8185	2021-03-01 09:38:51 +02:00
Botond Dénes	f0b284dab8	sstables: enable token monotonicity validation by default Partition key order validation in data written to sstables can be very disruptive. All our components in the storage layers assume that partitions are in order, which means that reading out-of-order partitions triggers undefined behaviour. Computer scientists often joke that undefined behaviour can erase your hard drive and in this case the damage done by undefined behaviour caused by out-of-order partitions is very close to that. The corruption is known to mutate causing crashes, corrupting more data and even loose data. For this reason it is imperative that out-of-order partitions cannot get into sstables. This patch enables token monotonicity validation unconditionally in the sstable writer. As partition key monotonicity checks involve a key copy per partition, which might have an impact on the performance, we do the next best thing instead and enable only token monotonicity validation.	2021-03-01 07:49:23 +02:00
Botond Dénes	727bc0f5d4	mutation_fragment_stream_validator: add token validation level In some cases the full-blown partition key validation and especially the associated key copy per partition might be deemed too costly. As a next best thing this patch adds a token only validation, which should cover 99% (number pulled out of my sleeve) of the cases. Let's hope no one gets unlucky.	2021-03-01 07:49:23 +02:00
Botond Dénes	694f8a4ec6	mutation_fragment_stream_validating_filter: make validation levels more fine-grained Currently key order validation for the mutation fragment stream validating filter is all or nothing. Either no keys (partition or clustering) are validated or all of them. As we suspect that clustering key order validation would add a significant overhead, this discourages turning key validation on, which means we miss out on partition key monotonicity validation which has a much more moderate cost. This patch makes this configurable in a more fine-grained fashion, providing separate levels for partition and clustering key monotonicity validation. As the choice for the default validation level is not as clear-cut as before, the default value for the validation level is removed in the validating filter's constructor.	2021-03-01 07:49:23 +02:00
Avi Kivity	3cd2f00438	dht: convert token tri_compare to std::strong_ordering Change token's tri_compare functions to return std::strong_ordering, which is not convertible to bool and therefore not suspect to being misused where a less-compare is expected. Two of the users (ring_position and decorated_key) have to undo the conversion, since they still return int. A follow up will convert them too. Ref #1449.	2021-02-28 21:03:59 +02:00
Avi Kivity	d3d7698502	interval: support C++20 three-way comparisons Allow the tri-comparator input to range functions to return std::strong_ordering, e.g. the result of operator<=>. An int input is still allowed, and coerced to std::strong_ordering by tri-comparing it against zero. Once all users are converted, this will be disallowed. The clever code that performs boundary comparisons unfortunately has to be dumbed down to conditionals. A helper require_ordering_and_on_equal_return() is introduced that accepts a comparison result between bound values, an expected comparison result, and what to return if the bound value matches (this depends on whether individual bounds are exclusive or inclusive, on whether the bounds are start bounds or end bounds, and on the sense of the comparison). Unfortunately, the code is somewhat pessimized, and there is no way to pessimize it as the enum underlying std::strong_ordering is hidden.	2021-02-28 21:03:25 +02:00
Avi Kivity	d980f550d1	Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec fill_buffer() will keep scanning until _lower_bound_changed is true, even if preemption is signaled, so that the reader makes forward progress. Before the patch, we did not update _lower_bound on touching a dummy entry. The read will not respect preemption until we hit a non-dummy row. If there is a lot of dummy rows, that can cause reactor stalls. Fix that by updating _lower_bound on dummy entries as well. Refs #8153. Tested with perf_row_cache_reads: ``` $ build/release/test/perf/perf_row_cache_reads -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB] read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB] ``` Notice that max preemption latency is low in the second "read:" line. Closes #8167 * github.com:scylladb/scylla: row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows tests: perf: Introduce perf_row_cache_reads row_cache: Add metric for dummy row hits	2021-02-28 21:00:20 +02:00
Botond Dénes	1d9b5911fe	time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free The optimal path of said method mistakenly captures `pos` (a local variable) in its reader factory method and passes a temporary range implicitly constructed from said `pos` as the range parameter to the sstable reader. This will lead to the sstable reader using a dangling range and will result in returning no result for queries. This patch fixes this bug and adds a unit test to cover this code path. Fixes #8138. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>	2021-02-26 23:57:25 +02:00
Botond Dénes	dd5a601aaa	result_memory_accounter: abort unpaged queries hitting the global limit The `result_memory_accounter` terminates a query if it reaches either the global or shard-local limit. This used to be so only for paged queries, unpaged ones could grow indefinitely (until the node OOM'd). This was changed in `fea5067` which enforces the local limit on unpaged queries as well, by aborting them. However a loophole remained in the code: `result_memory_accounter::check_and_update()` has another stop condition, besides `check_local_limit()`, it also checks the global limit. This stop condition was not updated to enforce itself on unpaged queries by aborting them, instead it silently terminated them, causing them to return less data then requested. This was masked by most queries reaching the local limit first. This patch fixes this by aborting unpaged mutation queries when they hit the global limit. Fixes: #8162 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>	2021-02-26 23:43:16 +02:00
Botond Dénes	bc1fcd3db2	multishard_combining_reader: only read from needed shards The multishard combining reader currently assumes that all shards have data for the read range. This however is not always true and in extreme cases (like reading a single token) it can lead to huge read amplification. Avoid this by not pushing shards to `_shard_selection_min_heap` if the first token they are expected to produce falls outside of the read range. Also change the read ahead algorithm to select the shards from `_shard_selection_min_heap`, instead of walking them in shard order. This was wrong in two ways: * Shards may be ordered differently with respect to the first partition they will produce; reading ahead on the next shard in shard order might not bring in data on the next shard the read will continue on. Shard order is only correct when starting a new range and shards are iterated over in the order they own tokens according to the sharding algorithm. * Shards that may not have data relevant to the read range are also considered for read ahead. After this patch, the multishard reader will only read from shards that have data relevant to the read range, both in the case of normal reads and also for read-ahead. Fixes: #8161 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>	2021-02-26 23:29:20 +02:00
Piotr Sarna	0e0282cdf1	Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun Currently all management of CDC generations happens in storage_service, which is a big ball of mud that does many unrelated things. This PR introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. We plug the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service call the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart). Some parts of generation management still remain in storage_service: the bootstrap procedure, which happens inside storage_service, must also do some initialization regarding CDC generations, for example: on restart it must retrieve the latest known generation timestamp from disk; on bootstrap it must create a new generation and announce it to other nodes. The order of these operations w.r.t the rest of the startup procedure is important, hence the startup procedure is the only right place for them. We may try decoupling these services even more in follow-up PRs, but that requires a bit of careful reasoning. What this PR does is a low-hanging fruit. Still, what remains in storage_service is a small part of the entire CDC generation management logic; most of it has been moved to the new service. This includes listening for generation changes and updating the data structures for performing CDC log writes (cdc::metadata). Furthermore these handling functions now return futures (and are internally coroutines), where previously they required a seastar::async context. This PR is a prerequisite to fixing #7985. The fact that all the CDC generation management code was in storage_service is technical debt. It will be easier to modify the management algorithms when they sit in their own module. Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java Closes #8172 * github.com:scylladb/scylla: cdc: move (most of) CDC generation management code to the new service cdc: coroutinize make_new_cdc_generation cdc: coroutinize update_streams_description cdc: introduce cdc::generation_service main: move cdc_service initialization just prior to storage_service initialization	2021-02-26 12:42:27 +01:00
Kamil Braun	e2f03e4aba	cdc: move (most of) CDC generation management code to the new service Currently all management of CDC generations happens in storage_service, which is a big ball of mud that does many unrelated things. Previous commits have introduced a new service for managing CDC generations. This code moves most of the relevant code to this new service. However, some part still remains in storage_service: the bootstrap procedure, which happens inside storage_service, must also do some initialization regarding CDC generations, for example: on restart it must retrieve the latest known generation timestamp from disk; on bootstrap it must create a new generation and announce it to other nodes. The order of these operations w.r.t the rest of the startup procedure is important, hence the startup procedure is the only right place for them. Still, what remains in storage_service is a small part of the entire CDC generation management logic; most of it has been moved to the new service. This includes listening for generation changes and updating the data structures for performing CDC log writes (cdc::metadata). Furthermore these functions now return futures (and are internally coroutines), where previously they required a seastar::async context.	2021-02-26 12:06:12 +01:00
Tomasz Grabiec	b9c3b6c10f	row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows fill_buffer() will keep scanning until _lower_bound_chnaged is true, even if preemption is signalled, so that the reader makes forward progress. Before the patch, we did not update _lower_bound on touching a dummy entry. The read will not respect preemption until we hit a non-dummy row. If there is a lot of dummy rows, that can cause reactor stalls. Fix that by updating _lower_bound on dummy entries as well. Refs #8153. Tested with perf_row_cache_reads: $ build/release/test/perf/perf_row_cache_reads -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB] read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB] Notice that max preemption latency is low in the second "read:" line.	2021-02-26 01:20:38 +01:00
Tomasz Grabiec	52e411df36	tests: perf: Introduce perf_row_cache_reads Tests performance of various read patterns from the row cache. Example: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Filling memtable Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 156.288986 [ms], preemption: {count: 702, 99%: 0.545791 [ms], max: 0.537537 [ms]}, cache: 99/100 [MB] read: 106.480766 [ms], preemption: {count: 6, 99%: 0.006866 [ms], max: 106.496168 [ms]}, cache: 99/100 [MB]	2021-02-26 01:20:38 +01:00
Tomasz Grabiec	f0a3272a5f	row_cache: Add metric for dummy row hits This will help to diagnose performance problems related to the read having to walk through a lot of dummy rows to fill the buffer. Refs #8153	2021-02-25 18:26:01 +01:00
Piotr Sarna	c5214eb096	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2021-02-25 17:20:27 +01:00
Piotr Sarna	f973e09454	cql3: use timeout config from client state instead of query options ... in batch statement, in order to be able to remove the timeout from query options later.	2021-02-25 17:20:27 +01:00
Piotr Sarna	639d90d2d6	cql3: use timeout config from client state instead of query options ... in modification statement, in order to be able to remove the timeout from query options later.	2021-02-25 17:20:27 +01:00
Piotr Sarna	b71665efe8	cql3: use timeout config from client state instead of query options ... in select statement, in order to be able to remove the timeout from query options later.	2021-02-25 17:20:27 +01:00
Piotr Sarna	7ceafda70a	service: add timeout config to client state Future patches will use this per-connection timeout config to allow setting different timeouts for each session, based on roles.	2021-02-25 17:20:26 +01:00
Takuya ASADA	aabc67e386	dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name dh_installinit --name <service> is for forcing install debian/.service and debian/.default that does not matches with package name. And if we have subpackages, packager has responsibility to rename debian/.service to debian/<subpackage>.service. However, we currently mistakenly running dh_installinit --name scylla-node-exporter for debian/scylla-node-exporeter.service, the packaging system tries to find destination package for the .service, and does not find subpackage name on it, so it will pick first subpackage ordered by name, scylla-conf. To solve the issue, we just need to run dh_installinit without --name when $product == 'scylla'. Fixes #8163 Closes #8164	2021-02-25 17:05:20 +02:00
Avi Kivity	032fdfe855	Update seastar submodule * seastar e53a1059f9...803e790598 (9): > io_queue: Count total time spent in the queue > io_queue: Fix "delay" metrics Fixes #8166. > file: expose disk offset alignment for overwrites Ref #7663. > RPC: (client) retain local address and use on stream creation > rpc: sink_impl: align _{last,next}_seq_num to cache-line size > reactor: Fix outdated comment > fair_queue: Remove now dead ticket strictly_less method > io_queue: Double max request size > bitsets: set_iterator: correctly implement pre- and post-increment operators	2021-02-25 16:58:06 +02:00
Takuya ASADA	f3a82f4685	scylla_setup: allow running scylla_setup with strict umask setting We currently deny running scylla_setup when umask != 0022. To remove this limitation, run os.chmod(0o644) on every file creation to allow reading from scylla user. Note that perftune.yaml is not really needed to set 0644 since perftune.py is running in root user, but setting it to align permission with other files. Fixes #8049 Closes #8119	2021-02-25 16:42:45 +02:00
Nadav Har'El	750d7903be	cql-pytest: fix some comments in util.py Fix some incorrect comments, pasted from other files or mentioning wrong names. No other changes except comments Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210225133237.1403891-1-nyh@scylladb.com>	2021-02-25 16:00:20 +02:00
Raphael S. Carvalho	7bf0744d36	reshape/TWCS: Fix off-by-one in threshold check A given time bucket should also be reshaped if its # of sstables has reached the threshold. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>	2021-02-24 15:12:40 +02:00
Raphael S. Carvalho	21608bd677	sstables: Fix TWCS reshape for windows with at least min_threshold sstables TWCS reshape was silently ignoring windows which contain at least min_threshold sstables (can happen with data segregation). When resizing candidates, size of multi_window was incorrectly used and it was always empty in this path, which means candidates was always cleared. Fixes #8147. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>	2021-02-24 15:11:19 +02:00
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Piotr Sarna	25f47561cb	transport: fix an outdated comment The comment mentions calling a lambda in-place, but the lambda is no longer there since 2019! Message-Id: <3903c84d5c151415409f28935e328b552dd548f8.1614155567.git.sarna@scylladb.com>	2021-02-24 11:14:01 +02:00
Avi Kivity	15d3797e97	test: logalloc_test: test background reclaim Test that the background reclaimer is able to compete with a fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU" is fully randomized. If the background reclaimer is disabled, the test fails as soon as the 20MB "gap" is exhausted. With the reclaimer enabled, it is able to free memory ahead of the allocations.	2021-02-23 19:42:42 +02:00
Nadav Har'El	d905e71a90	Alternator: add support for CORS protocol This patch adds to Alternator support for the CORS (Cross-Origin Resource Sharing) protocol - a simple extension over the HTTP protocol which browsers use when Javascript code contacts HTTP-based servers. Although we usually think of Alternator as being used in a three-tier application, in some setups there is no middle layer and the user's browser, running Javascript code, wants to communicate directly with the database. However, for security reasons, by default Javascript loaded from domain X is not allowed to communicate with different domains Y. The CORS protocol is meant to allow this, and Alternator needs to participate in this protocol if it is to be used directly from Javascript in browsers. To implement CORS, Alternator needs to respond to the OPTIONS method which it didn't allow before - with certain headers based on the input headers. It also needs to do some of these things for the regular methods (mostly, POST). The patch includes a comprehensive test that runs against both Alternator and DynamoDB and shows that Alternator handles these headers and methods the same as DynamoDB. Additionally, I tested manually a Javascript DynamoDB client - which didn't work prior to this patch (the browser reported CORS errors), and works after this patch. Fixes #8025. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>	2021-02-23 13:15:03 +01:00
Asias He	7018377bd7	messaging_service: Move gossip ack message verb to gossip group Fix a scheduling group leak: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip After the fix: INFO [shard 0] gossip - gossiper::run sg=gossip INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip Fixes #7986 Closes #8129	2021-02-23 10:10:00 +02:00
Tomasz Grabiec	fb1d3fe2cf	table: Fix schema mismatch between memtable reader and sstable writer The schema used to create the sstable writer has to be the same as the schema used by the reader, as the former is used to intrpret mutation fragments produced by the reader. Commit `9124a70` intorduced a deferring point between reader creation and writer creation which can result in schema mismatch if there was a concurrent alter. This could lead to the sstable write to crash, or generate a corrupted sstable. Fixes #7994 Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>	2021-02-22 17:51:00 +02:00
Raphael S. Carvalho	81d773e5d8	compaction_manager: Redefine weight for better control of parallel compactions Compaction manager allows compaction of different weights to proceed in parallel. For example, a small-sized compaction job can happen in parallel to a large-sized one, but similar-sized jobs are serialized. The problem is the current definition of weight, which is the log (base 4) of total size (size of all sstables) of a job. This is what we get with the current weight definition: weight=5 for sizes=[1K, 3K] weight=6 for sizes=[4K, 15K] weight=7 for sizes=[16K, 63K] weight=8 for sizes=[64K, 255K] weight=9 for sizes=[258K, 1019K] weight=10 for sizes=[1M, 3M] weight=11 for sizes=[4M, 15M] weight=12 for sizes=[16M, 63M] weight=13 for sizes=[64M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 12 Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5 jobs smaller than 1MB could proceed in parallel. High number of parallel compactions can be observed after repair, which potentially produces tons of small sstables of varying sizes. That causes compaction to use a significant amount of resources. To fix this problem, let's add a fixed tax to the size before taking the log, so that jobs smaller than 1M will all have the same weight. Look at what we get with the new weight definition: weight=10 for sizes=[1K, 2M] weight=11 for sizes=[3M, 14M] weight=12 for sizes=[15M, 62M] weight=13 for sizes=[63M, 254M] weight=14 for sizes=[256M, 1022M] weight=15 for sizes=[1033M, 4078M] weight=16 for sizes=[4119M, 10188M] total weights: 7 Fixes #8124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>	2021-02-22 15:50:29 +02:00
Asias He	554ab035dd	main: Run init_server and join_cluster inside maintenance scheduling group Currently, init_server and join_cluster which initiate the bootstrap and replace operations on the new node run inside the main scheduling group. We should run them inside the maintenance scheduling group to reduce the impact on the user workload. This patch fixes a scheduling group leak for bootstrap and replace operation. Before: [shard 0] storage_service - storage_service::bootstrap sg=main [shard 0] repair - bootstrap_with_repair sg=main After: [shard 0] storage_service - storage_service::bootstrap sg=streaming [shard 0] repair - bootstrap_with_repair sg=streaming Fixes #8130 Closes #8131	2021-02-22 14:55:02 +02:00
Michał Chojnowski	a24f83852e	atomic_cell: fix operator<< for atomic_cell_or_collection operator<< used the wrong criterium for deciding whether the data is stored as atomic_cell or collection_mutation, resulting in catastrophical failure if it was used with frozen collections or UDTs. Since frozen collections and UDTs are stored as atomic_cell, not collection_mutation, the correct criterium is not is_collection(), but is_multi_cell(). Closes #8134	2021-02-22 14:45:34 +02:00
Kamil Braun	022d7773f4	cdc: coroutinize make_new_cdc_generation	2021-02-22 12:47:44 +01:00
Kamil Braun	26ca9d6c33	cdc: coroutinize update_streams_description	2021-02-22 12:46:53 +01:00
Kamil Braun	d4937daaea	cdc: introduce cdc::generation_service This commit introduces a new service crafted to handle CDC generation management: listening and reacting to generation changes in the cluster. The implementation is a stub for now, the service reacts to generation changes by simply logging the event. The commit plugs the service in, initializing it in main and test code, passing a reference to storage_service and having storage_service start the service (using the `after_join` method): the service only starts doing its job after the node joins the token ring (either on bootstrap or restart).	2021-02-22 12:45:43 +01:00
Kamil Braun	8e72c33d7c	main: move cdc_service initialization just prior to storage_service initialization As a preparation for introducing CDC generation management service. cdc_service will depend on the generation service. But the generation service needs some other services to work properly. In particular, it uses the local database, so it should be initialized after the local database. The only service that will need the cdc generation service is storage_service, so we can place the generation service initialization code right before storage_service initialization code. So the order will be cdc_generation_service -> cdc_service -> storage_service.	2021-02-22 12:43:10 +01:00
Liu Lan	d2378129a3	docs: fix invalid path in README.mds Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com> Closes #8126	2021-02-21 13:49:12 +02:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Pekka Enberg	d483922671	Update tools/java submodule * tools/java 0187829d5e...142f517a23 (2): > nodetool: Enable resetlocalschema > sstableloader: Make progress printout less eager.	2021-02-19 12:37:04 +02:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Nadav Har'El	02dde2aca1	cql-pytest: port Cassandra's unit test validation/entities/json_test In this patch, we port validation/entities/json_test.java, containing 21 tests for various JSON-related operations - SELECT JSON, INSERT JSON, and the fromJson() and toJson() functions. In porting these tests, I uncovered 19 (!!) previously unknown bugs in Scylla: Refs #7911: Failed fromJson() should result in FunctionFailure error, not an internal error. Refs #7912: fromJson() should allow null parameter. Refs #7914: fromJson() integer overflow should cause an error, not silent wrap-around. Refs #7915: fromJson() should accept "true" and "false" also as strings. Refs #7944: fromJson() should not accept the empty string "" as a number. Refs #7949: fromJson() fails to set a map<ascii, int>. Refs #7954: fromJson() fails to set null tuple elements. Refs #7972: toJson() truncates some doubles to integers. Refs #7988: toJson() produces invalid JSON for columns with "time" type. Refs #7997: toJson() is missing a timezone on timestamp. Refs #8001: Documented unit "µs" not supported for assigning a "duration" type. Refs #8002: toJson() of decimal type doesn't use exponents so can produce huge output. Refs #8077: SELECT JSON output for function invocations should be compatible with Cassandra. Refs #8078: SELECT JSON ignores the "AS" specification. Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest error, not internal error. Refs #8086: INSERT JSON cannot handle user-defined types with case- sensitive component names. Refs #8087: SELECT JSON incorrectly quotes strings inside map keys. Refs #8092: SELECT JSON missing null component after adding field to UDT definition. Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY. Due to these bugs, 8 out of the 21 tests here currently xfail and one has to be skipped (issue #8100 causes the sanitizer to detect a use after free, and crash Scylla). As usual in these sort of tests, all 21 tests pass when running against Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>	2021-02-18 20:44:04 +02:00
Takuya ASADA	32d4ec6b8a	scylla_util.py: resolve /dev/root to get actual device on aws When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. This prevents the error and reports correct free disks. Fixes #8055 Closes #8040	2021-02-18 20:25:45 +02:00
Avi Kivity	90a7f76fb6	Merge 'cdc: log: fix a use-after-free in process_bytes_visitor' from Michał Chojnowski Due to small value optimization used in `bytes`, views to `bytes` stored in `vector` can be invalidated when the vector resizes, resulting in use-after-free and data corruption. Fix that. Closes #8105 * github.com:scylladb/scylla: cdc: log: avoid an unnecessary copy cdc: log: fix use-after-free in process_bytes_visitor	2021-02-18 20:23:41 +02:00
Michał Chojnowski	96c22cf3f8	cdc: log: avoid an unnecessary copy There is no need to copy `bytes_view` into `bytes` here.	2021-02-18 14:08:18 +01:00
Michał Chojnowski	8cc4f39472	cdc: log: fix use-after-free in process_bytes_visitor Due to small value optimization used in `bytes`, views to `bytes` stored in `vector` can be invalidated when the vector resizes, resulting in use-after-free and data corruption. Fix that. Fixes #8117	2021-02-18 14:08:17 +01:00
Konstantin Osipov	32952a744a	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	e49d5f89a5	raft: do not account for the same vote twice While a duplicate vote from the same server is not possible by a conforming Raft implementation, Raft assumptions on network permit duplicates. So, in theory, it is possible that a vote message is delivered multiple times. The current voting implementation does reject votes from non-members, but doesn't check for duplicate votes. Keep track of who has voted yet, and reject duplicate votes. A unit test follows.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	7ea064ac04	raft: remove fsm::set_configuration() Set either tracker or votes configuration explicitly. This saves a few lines and simplifies unit tests.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	4083026b65	raft: consistently use configuration from the log	2021-02-18 16:04:44 +03:00
Konstantin Osipov	c4552ffb9a	raft: add ostream serialization for enum vote_result	2021-02-18 16:04:44 +03:00
Konstantin Osipov	2ae04d8a47	raft: advance commit index right after leaving joint configuration Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}. The leader's view of stable indexes is: Server Match Index A 5 B 5 C 6 D 7 E 8 The commit index would be 5 if we use joint configuration, and 6 if we assume we left it. Let it happen without an extra FSM step.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	132db931da	raft: add tracker test	2021-02-18 16:04:44 +03:00
Konstantin Osipov	6e3932bbc7	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	ed65a8635e	raft: update raft::log::apply_snapshot() assert apply_snapshot() doesn't support applying the same snapshot twice. The caller must check the current snapshot before applying.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	e58a3e42ca	raft: add a unit test for raft::log	2021-02-18 16:04:44 +03:00
Konstantin Osipov	51c968bcb4	raft: rename log::non_snapshoted_length() to log::in_memory_size() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_size(), it's a more specific name. Rename max_log_length to max_log_size to use 'size' rather than 'length' consistently for log size.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	cfe407b402	raft: inline raft::log::truncate_tail() It's the core of apply_snapshot() work and is only used in it. Now that truncate_tail is inline, rename truncate_head() to truncate_uncommitted().	2021-02-18 16:04:44 +03:00
Konstantin Osipov	e0011c6e4d	raft: ignore AppendEntries RPC with a very old term Do not assert on an outdated message.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	805d52eb16	raft: remove log::start_idx() Replace it with a private _first_idx, which is maintained along with the rest of class log state. _first_idx is a name consistent with counterpart last_idx(). Do not use a function since going forward we may want to remove Raft index from struct log_entry, so should rely less on it. This fixes a bug when _last_conf_idx was not reset after apply_snapshot() because start_idx() was pointing to a non-existent entry.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	af8770da63	raft: return a correct last term on an empty log If the log is empty, we must use snapshot's term, since the log could be right after taking a snapshot when no trailing entries were kept. This fixes a rare possible bug when a log matching rule could be violated during elections by a follower with a log which was just truncated after a snapshot. A separate unit test for the issue will follow.	2021-02-18 16:04:43 +03:00
Konstantin Osipov	cb035a7c8d	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-18 16:04:43 +03:00
Konstantin Osipov	04b4d97d6a	raft: rename progress.hh to tracker.hh class tracker is the main class of this module.	2021-02-18 16:04:43 +03:00
Konstantin Osipov	97a16c0f77	raft: extend single_node_is_quiet test	2021-02-18 16:04:43 +03:00
Avi Kivity	f0950e023d	Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started	2021-02-18 12:49:43 +02:00
Kamil Braun	4bf28aad7a	tests: add a simple CDC cql pytest	2021-02-18 11:44:59 +01:00
Kamil Braun	841f07e9b7	cdc: add config option to disable streams rewriting Rewriting stream descriptions is a long, expensive, and prone-to-failure operation. Due to #8061 it may consume a lot of memory. In general, it may keep failing (and being retried) endlessly, straining the cluster. As a backdoor we add this flag for potential future needs of admins or field engineers. I don't expect it will ever be used, but it won't hurt and may save us some work in the worst case scenario.	2021-02-18 11:44:59 +01:00
Kamil Braun	9bdd000e97	cdc: rewrite streams to the new description table Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. This commit adds an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations).	2021-02-18 11:44:59 +01:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	7c91894ddf	cdc: introduce no_generation_data_exception exception type	2021-02-18 11:44:59 +01:00
Kamil Braun	99cc9b8051	docs: cdc: mention system.cdc_local table	2021-02-18 11:44:59 +01:00
Kamil Braun	44aab61aea	cdc: coroutinize do_update_streams_description	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00
Kamil Braun	ba920361b3	cdc: use chunked_vector for streams in streams_version The vector may get quite long (say... 1,6M stream IDs). We prevent a large allocation by using utils::chunked_vector.	2021-02-18 11:44:59 +01:00
Kamil Braun	9ae4467970	cdc: remove `streams_version::expired` field This field was not used anywhere.	2021-02-18 11:44:59 +01:00
Kamil Braun	3d7b990300	system_distributed_keyspace: use mutation API to insert CDC streams The `storage_proxy::mutate` low-level API is much more powerful than the CQL API. This power is not needed for this commit but for the next.	2021-02-18 11:44:59 +01:00
Kamil Braun	0df15ca8cc	storage_service: don't use `sys_dist_ks` before it is started It could happen that system_distributed_keyspace was used by storage_service before it was fully started (inside `handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned (on shard 0). It only checked whether `local_is_initialized()` returns true, so it only ensured that the service is constructed. Currently, sys_dist_ks' `start` only announces migrations, so this was mostly harmless. More concretely: it could result in the node trying to send CQL requests using a table that it didn't yet recognize by calling sys_dist_ks' methods before the `announce_migration` call inside `start` has returned. This would result in an exception; however, the exception would be catched by the caller and the procedure would be retried, succeeding eventually. See `handle_cdc_generation` for details. Still, the initial intention of the code was to wait for the sys_dist_ks service to be fully started before it was used. This commit fixes that.	2021-02-18 11:44:59 +01:00
Tomasz Grabiec	f94f70cda8	Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja Test log consistency after apply_snapshot() is called. Ensure log::last_term() log::last_conf_index() and log::size() work as expected. Misc cleanups. * scylla-dev/raft-confchange-test: raft: add a unit test for voting raft: do not account for the same vote twice raft: remove fsm::set_configuration() raft: consistently use configuration from the log raft: add ostream serialization for enum vote_result raft: advance commit index right after leaving joint configuration raft: add tracker test raft: tidy up follower_progress API raft: update raft::log::apply_snapshot() assert raft: add a unit test for raft::log raft: rename log::non_snapshoted_length() to log::length() raft: inline raft::log::truncate_tail() raft: ignore AppendEntries RPC with a very old term raft: remove log::start_idx() raft: return a correct last term on an empty log raft: do not use raft::log::start_idx() outside raft::log() raft: rename progress.hh to tracker.hh raft: extend single_node_is_quiet test	2021-02-18 10:55:59 +01:00
Raphael S. Carvalho	5206a97915	compaction: Fix leak of expired sstable in the backlog tracker expired sstables are skipped in the compaction setup phase, because they don't need to be actually compacted, but rather only deleted at the end. that is causing such sstables to not be removed from the backlog tracker, meaning that backlog caused by expired sstables will not be removed even after their deletion, which means shares will be higher than needed, making compaction potentially more aggressive than it have to. to fix this bug, let's manually register these sstables into the monitor, such that they'll be removed from the tracker once compaction completes. Fixes #6054. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>	2021-02-18 11:12:00 +02:00
Takuya ASADA	d7f202f900	dist/debian: fix renaming debian/scylla-* files rule Current renaming rule of debian/scylla-* files is buggy, it fails to install some .service files when custom product name specified. Introduce regex based rewriting instead of adhoc renaming, and fixed wrong renaming rule. Fixes #8113 Closes #8114	2021-02-18 10:35:19 +02:00
Pekka Enberg	843bf57c3c	Update tools/jmx submodule * tools/jmx 949cefc...bf8bb16 (1): > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA	2021-02-18 10:35:00 +02:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	4b46793c19	row_cache: scanning_and_populating_reader: add _read_next_partition flag Instead of resetting _reader in scanning_and_populating_reader::fill_buffer in the `reader_finished` case, use a gentler, _read_next_partition flag on which `read_next_partition` will be called in the next iteration. Then, read_next_partition can close _reader only before overwriting it with a new reader. Otherwise, if _reader is always closed in the ``reader_finished` case, we end up hitting premature end_of_stream. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>	2021-02-17 19:06:21 +02:00
Benny Halevy	57540dae42	mutation_query: mark reconcilable_result_builder constructor noexcept With result_memory_accounter begin nothrow move constructible reconcilable_result_builder does not throw. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>	2021-02-17 18:56:12 +02:00
Benny Halevy	92e0e84ee5	database: futurize remove In preparation for futurizing the querier_cache api. Coroutinize drop_column_family while at it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>	2021-02-17 18:52:53 +02:00
Benny Halevy	5263ab0e9d	row_cache: read_context: use query-request is_single_partition helper Rather than hand-coding the same logic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>	2021-02-17 18:29:39 +02:00
Benny Halevy	35256d1b92	treewide: explicitly use flat_mutation_reader_opt Unlike flat_mutation_reader_opt that is defined using optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate to `false` after being moved, only after it is explicitly reset. Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader> to make it easier to check if it was closed before it's destroyed or being assigned-over. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>	2021-02-17 17:57:34 +02:00
Avi Kivity	c63e26e26f	Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8048 * github.com:scylladb/scylla: cdc: Limit size of topology description cdc: Extract create_stream_ids from topology_description_generator	2021-02-17 15:43:53 +02:00
Piotr Jastrzebski	649f254863	cdc: Limit size of topology description Currently, whole topology description for CDC is stored in a single row. This means that for a large cluster of strong machines (say 100 nodes 64 cpus each), the size of the topology description can reach 32MB. This causes multiple problems. First of all, there's a hard limit on mutation size that can be written to Scylla. It's related to commit log block size which is 16MB by default. Mutations bigger than that can't be saved. Moreover, such big partitions/rows cause reactor stalls and negatively influence latency of other requests. This patch limits the size of topology description to about 4MB. This is done by reducing the number of CDC streams per vnode and can lead to CDC data not being fully colocated with Base Table data on shards. It can impact performance and consistency of data. This is just a quick fix to make it easily backportable. A full solution to the problem is under development. For more details see #7961, #7993 and #7985. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-17 13:24:40 +01:00
Avi Kivity	001652815c	Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578 Closes #8106 * github.com:scylladb/scylla: imr: switch back to open-coded description of structures utils: managed_bytes: add a few trivial helper methods utils: fragment_range: move FragmentedView helpers to fragment_range.hh utils: fragment_range: add single_fragmented_mutable_view utils: fragment_range: implement FragmentRange for fragment_range utils: mutable_view: add front() types: remove an unused helper function test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions test: mutation_test: remove an obsolete assertion test: mutation_test: initialize an uninitialized variable test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test	2021-02-17 13:40:16 +02:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	25a9569cc4	utils: managed_bytes: add a few trivial helper methods We will use them in the upcoming IMR removal patch.	2021-02-16 23:43:07 +01:00
Michał Chojnowski	3f248ca7cc	utils: fragment_range: move FragmentedView helpers to fragment_range.hh In the upcoming IMR removal patch we will need read_simple() and similar helpers for FragmentedView outside of types.hh. For now, let's move them to fragment_range.hh, where FragmentedView is defined. Since it's a widely included header, we should consider moving them to a more specialized header later.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	8a06a576aa	utils: fragment_range: add single_fragmented_mutable_view We will use it later in the upcoming IMR removal patch.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	7b662b9315	utils: fragment_range: implement FragmentRange for fragment_range This will allow us to pass FragmentedView instances to places where FragmentRange is expected.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	f972f90193	utils: mutable_view: add front() We will use it in the upcoming patches.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	9e591c6634	types: remove an unused helper function	2021-02-16 21:35:14 +01:00
Michał Chojnowski	6b8a69e01f	test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions The off-by-one error would cause test_multishard_combining_reader_non_strictly_monotonic_positions to fail if the added range_tombstones filled the buffer exactly to the end. In such situation, with the old loop condition, make_fragments_with_non_monotonic_positions would add one range_tombstone too many to the deque, violating the test assumptions.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	5b79d6ca4c	test: mutation_test: remove an obsolete assertion Due to small value optimizations, the removed assertions are not true in general. Until now, atomic_cell did not use small value optimizations, but it will after upcoming changes.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	aa60f28a09	test: mutation_test: initialize an uninitialized variable It was assumed to be zero-initialized, but C++ does not guarantee that. It has to be initialized explicitly.	2021-02-16 21:35:14 +01:00
Michał Chojnowski	52bd190bb3	test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test sstable_run_based_compaction_test assumed that sstables are freed immediately after they are fully processed. Hovewer, since commit `b524f96a74`, mutation_reader_merger releases sstables in batches of 4, which breaks the assumption. This fix adjusts the test accordingly. Until now, the test only kept working by chance: by coincidence, the number of test sstables processed by merging_reader in a single fill_buffer() call was divisible by 4. Since the test checks happen between those calls, the test never witnessed a situation when an sstable was fully processed, but not released yet. The error was noticed during the work on an upcoming patch which changes the size of mutation_fragment, and reduces the number of test sstables processed in a single fill_buffer() call, which breaks the test.	2021-02-16 21:35:14 +01:00
Konstantin Osipov	d293966366	raft: add a unit test for voting Test duplicate votes, votes from non-members and voting in joint configuration.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	3478389d60	raft: do not account for the same vote twice While duplicate votes are not allowed by Raft rules, it is possible that a vote message is delivered multiple times. The current voting implementation does reject votes from non-members, but doesn't check for duplicate votes. Keep track of who has voted yet, and reject duplicate votes. A unit test follows.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	ffd38de5fe	raft: remove fsm::set_configuration() Set either tracker or votes configuration explicitly. This saves a few lines and simplifies unit tests.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	b941ca9bae	raft: consistently use configuration from the log	2021-02-16 23:15:16 +03:00
Konstantin Osipov	75eddaf493	raft: add ostream serialization for enum vote_result	2021-02-16 23:15:16 +03:00
Konstantin Osipov	e099003c7c	raft: advance commit index right after leaving joint configuration Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}. Server stable indexes are: Server Stable Index A 5 B 5 C 6 D 7 E 8 The commit index would be 5 if we use joint configuration, and 6 if we assume we left it. Left it happen without an extra FSM step.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	1bdb3fc8a9	raft: add tracker test	2021-02-16 23:15:16 +03:00
Konstantin Osipov	63965f46f4	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-16 23:15:16 +03:00
Konstantin Osipov	74879fab09	raft: update raft::log::apply_snapshot() assert apply_snapshot() doesn't support applying the same snapshot twice. The caller must check the current snapshot before applying.	2021-02-16 23:15:12 +03:00
Konstantin Osipov	6ee3aedcc2	raft: add a unit test for raft::log	2021-02-16 23:12:01 +03:00
Konstantin Osipov	c35f029be1	raft: rename log::non_snapshoted_length() to log::length() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_length(), it's a more specific name.	2021-02-16 23:12:01 +03:00
Konstantin Osipov	9e1a652805	raft: inline raft::log::truncate_tail() It's the core of apply_snapshot() work and is only used in it. Now that truncate_tail is inline, truncate_head() can be called simply truncate().	2021-02-16 23:10:58 +03:00
Konstantin Osipov	f7fb788edf	raft: ignore AppendEntries RPC with a very old term Do not assert on an outdated message.	2021-02-16 23:07:58 +03:00
Konstantin Osipov	7236f081c1	raft: remove log::start_idx() Replace it with a private _first_idx, which is maintained along with the rest of class log state. _first_idx is a name consistent with counterpart last_idx(). Do not use a function since going forward we may want to remove Raft index from struct log_entry, so should rely less on it. This fixes a bug when _last_conf_idx was not reset after apply_snapshot() because start_idx() was pointing to a non-existent entry.	2021-02-16 23:06:23 +03:00
Konstantin Osipov	59ea383c7d	raft: return a correct last term on an empty log If the log is empty, we must use snapshot's term, since the log could be right after taking a snapshot when no trailing entries were kept. This fixes a rare possible bug when a log matching rule could be violated during elections by a follower with a log which was just truncated after a snapshot. A separate unit test for the issue will follow.	2021-02-16 21:07:05 +03:00
Konstantin Osipov	6c14775b20	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-16 21:05:44 +03:00
Nadav Har'El	946e63ee6e	cql-pytest: remove "xfail" tag from two passing tests Issue #7595 was already fixed last week, in commit `b6fb5ee912`, so the two tests which failed because of this issue no longer fail and their "xfail" tag can be removed. Refs #7595. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>	2021-02-16 19:17:22 +02:00
Nadav Har'El	737c1c6cc7	cql-pytest: Additional JSON tests This patch adds several additional tests o test/cql-pytest/test_json.py to reproduce additional bugs or clarify some non-bugs. First, it adds a reproducer for issue #8087, where SELECT JSON may create invalid JSON - because it doesn't quote a string which is part of a map's key. As usual for these reproducers, the test passes on Cassandra, and fails on Scylla (so marked xfail). We have a bigger test translated from Cassandra's unit tests, cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys which demonstrates the same problem, but the test added in this patch is much shorter and focuses on demonstrating exactly where the problem is. Second, this patch adds a test test verifies that SELECT JSON works correctly for UDTs or tuples where one of their components was never set - in such a case the SELECT JSON should also output this component, with a "null" value. And this test works (i.e., produces the same result in Cassandra and Scylla). This test is interesting because it shows that issue #8092 is specific to the case of an altered UDT, and doesn't happen for every case of null component in a UDT. Refs #8087 Refs #8092 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>	2021-02-16 16:05:31 +01:00
Avi Kivity	2f3b265dac	Update seastar submodule * seastar 76cff58964...e53a1059f9 (18): > rpc: streaming sink: order outgoing messages Fixes #7552. > http: fix compilation issues when using clang++ > http/file_handler: normalize file-type for mime detection > http/mime_types: add support for svg+xml > reactor: simplify get_sched_stats() > Merge "output_stream: make api noexcept" from Benny > Merge " input_stream: make api noexcept" from Benny > rpc: mark 'protocol' class as final > tls: reloadable_certificate inotify flag is wrong Fixes #8082. > cli: Ignore the --num-io-queues option > io_queue: Do not carry start time in lambda capture > fstream: Cancel all IO-s on file_data_source_impl close > http: add "Transfer-Encoding: chunked" handling > http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked > http: add request content streaming > http: add reading/skipping all bytes in an input_stream > Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov > seastar-addr2line: split multiple addresses on the same line	2021-02-16 16:19:26 +02:00
Avi Kivity	789233228b	messaging: don't inherit from seastar::rpc::protocol messaging_service's rpc_protocol_server_wrapper inherits from seastar::rpc::protocol::server as a way to avoid a is unfortunate, as protocol.hh wasn't designed for inheritance, and is not marked final. Avoid this inheritance by hiding the class as a member. This causes a lot of boilerplate code, which is unfortunate, but this random inheritance is bad practice and should be avoided. Closes #8084	2021-02-16 16:04:44 +02:00
Gleb Natapov	c9392095ce	cql3: store cf_prop_defs as optional instead of shared_ptr It been a shard_ptr is a remnant of translation from Java. Message-Id: <20210216123931.80280-3-gleb@scylladb.com>	2021-02-16 15:58:38 +02:00
Gleb Natapov	805da054e7	cql3: store cf_name as optional in cf_statement instead of shared_ptr It been a shard_ptr is a remnant of translation from Java. Message-Id: <20210216123931.80280-2-gleb@scylladb.com>	2021-02-16 15:58:37 +02:00
Gleb Natapov	6335af625e	cql3: assert that unengaged optional is not accessed in keyspace_element_name::get_keyspace() Message-Id: <20210216085545.54753-2-gleb@scylladb.com>	2021-02-16 15:36:00 +02:00
Gleb Natapov	200ca974c3	Do not access potentially unengaged optional in keyspace_element_name Currently there are places that call keyspace_element_name::get_keyspace() without checking that _ks_name is engaged. Fix those places. Message-Id: <20210216085545.54753-1-gleb@scylladb.com>	2021-02-16 15:35:59 +02:00
Botond Dénes	4d309fc34a	repair: row_level: invoke on_internal_error() on out-of-order partitions repair_writer::do_write(): already has a partition compare for each mutation fragment written, do determine whether the fragment belongs to another partition or not. This equal compare can be converted to a tri_compare at no extra cost allowing for detecting out-of-order partitions, in which case `on_internal_error()` is called. Refs: #7623 Refs: #7552 Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>	2021-02-16 15:31:40 +02:00
Benny Halevy	50ca693a02	main: disable stall detector during startup We see long reactor stalls from `logalloc::prime_segment_pool` in debug mode yet the stall detector's purpose is to detect reactor stalls during normal operation where they can increase the latency of other queries running in parallel. Since this change doesn't actually fix the stalls but rather hides them, the following annotations will just refrence the respective github issues rather than auto-close them. Refs #7150 Refs #5192 Refs #5960 Restore blocked_reactor_notify_ms right before starting storage_proxy. Once storage_proxy is up, this node affects cluster latency, and so stalls should be reported so they can be fixed. Test: secondary_index_test --blocked-reactor-notify-ms 1 (release) DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>	2021-02-16 13:28:31 +02:00
Tomasz Grabiec	446ea07ac6	Merge "raft: server instance init and raft RPC handlers" from Pavel Solodovnikov This series provides a `raft_services` class to create and store a raft schema changes server instances, and also wires up the RPC handlers for Raft RPC verbs. * manmanson/raft-api-server-handlers-v10: raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances raft: move server address handling from `raft_rpc` to `raft_services` class raft: wire up schema Raft RPC handlers raft: raft_rpc: provide `update_address_mapping` and dispatcher functions raft: pass `group_id` as an argument to raft rpc messages raft: use a named constant for pre-defined schema raft group	2021-02-16 11:14:50 +01:00
Pavel Solodovnikov	1ada0abf81	raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances Store an instance inside `raft_services` and reuse it for all raft groups created and managed by `raft_services` instance. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-16 13:09:12 +03:00
Pavel Solodovnikov	8c2a904dc8	raft: move server address handling from `raft_rpc` to `raft_services` class This allows to decouple `raft_gossip_failure_detector` from being dependent on a particular rpc instance and thus makes it possible to share the same failure detector instance among all raft servers since they are managed in a centralized way by a `raft_services` instance. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-16 13:09:06 +03:00
Pavel Solodovnikov	63cdf4694d	raft: wire up schema Raft RPC handlers This patch adds registration and de-registration of the corresponding Raft RPC verbs handlers. There is a new `raft_services` class that is responsible for initializing the raft RPC verbs and managing raft server instances. The service inherits `seastar::peering_sharded_service<T>`, because we need to route the request to the appropriate shard which is handled by the `shard_for_group` function (currently only handling schema raft group to land on shard 0, otherwise throws an exception). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-16 13:08:59 +03:00
Nadav Har'El	1e1cbaf589	docs/alternator: clean up description of DynamoDB compatibility We had Alternator's current compatibility with DynamoDB described in two places - alternator.md and compatibility.md. This duplication was not only unnecessary, in some places it led to inconsistent claims. In general, the better description was in compatibility.md, so in this patch we remove the compatibility section from alternator.md and instead link to compatibility.md. There was a bit of information that was missing in compatibility.md, so this patch adds it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>	2021-02-16 08:48:28 +01:00
Pavel Emelyanov	9baf1226dc	test/memory_footpring: Print radix tree node sizes After switching cells storage onto compact radix tree it becomes useful to know the tree nodes' sizes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:41:09 +03:00
Pavel Emelyanov	1bdfa355ea	row: Remove old storages Now when the 3rd storage type (radix tree) is all in, old storage can be safely removed. The result is: 1. memory footprint sizeof(class row): 112 => 16 bytes sizeof(rows_entry): 126 => 120 bytes the "in cache" value depends on the number of cells: num of cells master patch 1 752 656 2 808 712 3 864 768 4 920 824 5 968 936 6 1136 992 ... 16 1840 1672 17 1904 1992 (+88) 18 1976 2048 (+72) 19 2048 2104 (+56) 20 2120 2160 (+40) 21 2184 2208 (+24) 22 2256 2264 ( +8) 23 2328 2320 ... 32 2960 2808 After 32 cells the storage switches into rbtree with 24-bytes per-cell overhead and the radix tree improvement rocketlaunches 64 7872 6056 128 15040 9512 256 29376 18568 2. perf_mutation test is enhanced by this series and the results differ depending on the number of columns used tps value --column-count master patch 1 59.9k 57.6k (-3.8%) 2 59.9k 57.5k 4 59.8k 57.6k 8 57.6k 57.7k <- eq 16 56.3k 57.6k 32 53.2k 57.4k (+7.9%) A note on this. Last time 1-column test was ~5% worse which was explained by inline storage of 5 cells that's present on current implementation and was absent in radix tree. An attempt to make inline storage for small radix trees resulted in complete loss of memory footprint gain, but gave fraction of percent to perf_mutation performance. So this version doesn't have inline nodes. The 1.2% improvement from v2 surprisingly came from the tree::clone_from() which in v2 was work-around-ed by slow walk+emplace sequence while this version has the optimized API call for cloning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:35:06 +03:00
Pavel Emelyanov	2053b1c202	row: Prepare row::equal for switch Same as the previous patch, re-implement the row::equal to use the radix_tree iterator for comparison of two index:cell sequences. The std::equal() doesn't work here, since the predicate-fn needs to look at both iterators to call it.key() on (radix tree API feature), while std::equal provides only the T&s in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:31:52 +03:00
Pavel Emelyanov	b5527b3635	row: Prepare row::difference for switch The method effectively walks two pairs of <colun_id, cell> and applies the difference to separare row instance. The code added is the copy of the same code below this hunk with the mechanical substitution: c.first -> c.key() c.second -> c->cell it->first -> it.key() it->second -> it.cell because first-s are column_id-s reported by radix tree iterator .key() method and second-s are cells, that were referenced by current code in get_..._vector() from boost::irange and are now directly pointed to by raidx tree iterator. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Pavel Emelyanov	f006acc853	row: Introduce radix tree storage type Currently class row uses a union of a vector and a set to keep the cells and switches between them. Add the 3rd type with the radix tree, but never switch to it, just to show how the operations would look like. Later on vector and set will be removed and the whole row will be immediately switched to the radix tree storage. NB: All the added places have indentation deliberately broken, so that next patch will just remove the surrounding (old) code away and (most of) the new one will happen in its place instantly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Pavel Emelyanov	5f276b279e	row-equal: Re-declare the cells_equal lambda For further patching it's handy to have this helper to accept column_id and atomic_cell_or_collection arguments, instead of an std::pair of these two. This is to facilitate next patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Pavel Emelyanov	aa85bc790b	test: Add tests for radix tree Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Pavel Emelyanov	a5bd68ae5d	utils: Compact radix tree The tree uses integral type as a search key. On each level the local index is next 7 bits from the key, respectively for 32-bit key we have 5 levels. The tree uses 2 memory packing techniques -- prefix compaction and growing node layouts. The prefix compaction is used when a node has only one child. In this case such a node is replaced in its parent with this only child and the child in question keeps "prefix:length" pair on board, that's used to check if the short-cut lookup took the correct path. The growing node layouts makes the nodes occupy as much memory as needed to keep the _present_ keys and there are 2 kinds of layouts. Direct layout is array, intra-node search is plain indexing. The layout storage grows in vector-like manner, but there's a special case for the maximum-sized layout that helps avoiding some boundary checks. Indirect layout keeps two arrays on board -- with values and with indices. The intra-node search is thus a lookup in the latter array first. This layout is used to save memory for sparse keys. Lookup is optimized with SIMD instructions. Inner nodes use direct layouts, as they occupy ~1% of memory and thus need not to be memory efficient. At the same time lookup of a key in the tree potentially walks several inner nodes, so speeding up search for them is beneficial. Leaf nodes are indirect, since they are 99% of memory and thus need to be packed well. The single indirect lookup when searching in the tree doesn't slow things down notably even on insertion stress test. Said that * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects or header + 16 bytes bitmap + 128 objects The header is - backreference (8 bytes) - prefix (4 bytes) - size, layout, capacity (1 byte each) The iterator is one-direction (for simplicity) but it enough for its main target -- the sparse array of cells on a row. Also the iterator has an .index() method that reports back the index of the entry at which it points. This greatly simplifies the tree scans by the class row further. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 19:25:10 +03:00
Piotr Sarna	495b7b5596	alternator: use unique_ptr for storing attribute paths Previous commit eliminated the only copying of the attribute paths, so it's now safe to make the object noncopyable. Message-Id: <5468e8c17d3d42a03c1dd33706bbaac0c58959ce.1613398751.git.sarna@scylladb.com>	2021-02-15 18:22:59 +02:00
Piotr Sarna	7e1641224c	alternator: batch: pass attrs_to_get by a shared pointer The attrs_to_get object was previously copied, but it's quite a heavyweight operations, since this object may contain an instance of std::map or std::unordered_map. To avoid copying whole maps, the object is wrapped in a shared const pointer. Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>	2021-02-15 18:22:56 +02:00
Tomasz Grabiec	f86108aef1	Merge "raft: move ticking to external code" from Alejo As Gleb suggested in a previous review, remove ticker from raft and leave calling tick() to external code. While there, tick faster to speed up tests. * https://github.com/alecco/scylla/tree/tests-17-remove-ticker: raft: replication test: reduce ticker from 100ms to 1ms raft: drop ticker from raft	2021-02-15 18:14:03 +02:00
Botond Dénes	c24f350846	scylla-gdb.py: nonwrapping_interval_printer: fix compatibility with 4.2+ Use the `_interval` member instead of the old `_range` field, but stay compatible with pre 4.2 releases, falling back to `_range` when `_interval` doesn't exist. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>	2021-02-15 18:14:03 +02:00
Pavel Emelyanov	d43ad8738c	array-search: Add helpers to search for a byte in array The radix tree code will need the code to find 8-bit value in an array of some fixed size, so here are the helpers. Those that allow for SIMD implementations are such for x86_64 TODO: Add aarch64 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:47:59 +03:00
Pavel Emelyanov	0ad361b380	test/perf_collection: Add callback to check the speed of clone In some places scylla clones collections of objects, so it's sometimes needed to measure the speed of this operation. This patch adds a placeholder for it, but no implementations for any supported collections. It will be added soon for radix tree. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:46:37 +03:00
Pavel Emelyanov	767253fe24	test/perf_mutation: Add option to run with more than 1 columns The --column-count makes the test generate schema with the given numbers of columns and make mutation maker fill random column with the value on each iteration. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:45:42 +03:00
Pavel Emelyanov	fc84ab3418	test/perf_mutation: Prepare to have several regular columns Teach the schema builder and test itself to work on more than one regular column, but for now only use 1, as before. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:44:34 +03:00
Pavel Emelyanov	21adff2a41	test/perf_mutation: Use builder to build schema The test will be taught to use more than one regular column, so switch to builder in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:44:06 +03:00
Piotr Sarna	cbbb7f08a0	Merge 'Alternator: support nested attribute paths... in all expressions' from Nadav Har'El. This series fixes #5024 - which is about adding support for nested attribute paths (e.g., a.b.c[2]) to Alternator. The series adds complete support for this feature in ProjectionExpression, ConditionExpression, FilterExpression and UpdateExpression - and also its combination with ReturnValues. Many relevant tests - and also some new tests added in this series - now pass. The first patch in the series fixes #8043 a bug in some error cases in conditions, which was discovered while working in this series, and is conceptually separate from the rest of the series. Closes #8066 * github.com:scylladb/scylla: alternator: correct implemention of UpdateItem with nested attributes and ReturnValues alternator: fix bug in ReturnValues=UPDATED_NEW alternator: implemented nested attribute paths in UpdateExpression alternator: limit the depth of nested paths alternator: prepare for UpdateItem nested attribute paths alternator: overhaul ProjectionExpression hierarchy implementation alternator: make parsed::path object printable alternator-test: a few more ProjectionExpression conflict test cases alternator-test: improve tests for nested attributes in UpdateExpression alternator: support attribute paths in ConditionExpression, FilterExpression alternator-test: improve tests for nested attributes in ConditionExpression alternator: support attribute paths in ProjectionExpression alternator: overhaul attrs_to_get handling alternator-test: additional tests for attribute paths in ProjectionExpression alternator-test: harden attribute-path tests for ProjectionExpression alternator: fix ValidationException in FilterExpression - and more	2021-02-15 15:45:49 +02:00
Tomasz Grabiec	508f928220	tests: sstables: Test sstable write fails on missing partition_end mid-stream Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>	2021-02-15 15:45:49 +02:00
Benny Halevy	e532585126	test: sstables::test_env: do_with: futurize_invoke func Otherwise, if `func` throws, test_env isn't stopped, as it should. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210214190157.211858-1-bhalevy@scylladb.com>	2021-02-15 15:45:49 +02:00
Wojciech Mitros	1819be5ebc	canonical_mutation: make the data type non-contiguous The canonical_mutation type can contain a large mutation, particularly when the mutation is a result of converting a big schema. Its data was stored in a field of type 'bytes', which is non-contiguous and may cause a large allocation. This is fixed by simply changing the type to 'bytes_ostream', which is fragmented. The change is compatible because the idl type 'bytes' is compatible with 'bytes_ostream' as a result of `dcf794b`, and all canonical_mutations's methods use the field as an input stream (ser::as_input_stream), which can be used on 'bytes_ostream' too. Fixes #8074 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #8075	2021-02-15 10:24:47 +01:00
Nadav Har'El	f884104eed	cql-pytest: add more JSON tests This patch adds several more tests reproducing bugs in toJson() and SELECT JSON. First add two xfailing tests reproducing two toJson() issues - #7988 and #8002. The first is that toJson() incorrectly formats values of the "time" type - it should be a string but Scylla forgets the quotes. The second is that toJson() format "decimal" values as JSON numbers without using an exponent, resulting in memory allocation failure for numbers with high exponents, like 1e1000000000. The actual test for 1e1000000000 has to be skipped because in debug build mode we get a crash trying this huge allocation. So instead, we check 1e1000 - this generates a string of 1000 characters, which is much too much (should just be "1e1000") but doesn't crash. Then we add a reproducing test for issue #8077: When using SELECT JSON on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has a specific way how it formats the result in JSON, and Scylla should do it the same way unless we have a good reason not to. As usual, the new tests passes on Cassandra, fails on Scylla, so is marked xfail. Refs #7988 Refs #8002 Refs #8077. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>	2021-02-15 10:55:43 +02:00
Nadav Har'El	9e029f09e5	docs: improve CONTRIBUTING.md Start improving CONTRIBUTING.md, as suggested in issue #8037: 1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md. This was mostly a pointer to Seastar's coding style anyway, so it's not helpful to have a separate file which hopeful developers will not find anyway. 2. Mention the Scylla developers mailing list, not just the Scylla users mailing list. The Scylla developers mailing list is where all the action happens, and it's very odd not to mention it. 3. The decisions that github pull requests are forbidden was retracted a long time ago, so change the explanation on pull requests. 4. Some smaller phrasing changes. Refs #8037. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>	2021-02-14 22:09:24 +02:00
Nadav Har'El	8c9935359c	build: Stir -O levels Merged patch series from Pavel Emelyanov: The default -O<> levels are considered to produce slow and tedious to test code, so it's tempting to increase the level. On the other hand, there was some complains about re-compile-mostrly work that would suffer from slower builds. This set tries to find a reasonable compromise -- raise the default opt levels and provide the ability to configure one if needed. * 'br-cxx-o-levels-2' of github.com:xemul/scylla: configure: Switch debug build from -O0 to -Og configure: Switch dev build from -O1 to -O2 configure: Make -O flag configurable	2021-02-14 22:09:24 +02:00
Avi Kivity	cb4e1bb0b9	logalloc: reduce gap between std min_free and logalloc min_free With the larger gap, logalloc reserved more memory for std than the background reclaim threshold for running, so it was triggered rarely. With the gap reduced, background reclaim is constantly running in an allocating workload (e.g. cache misses).	2021-02-14 19:09:29 +02:00
Avi Kivity	ca0c006b37	logalloc: background reclaim Set up a coroutine in a new scheduling group to ensure there is a "cushion" of free memory. It reclaims in preemptible mode in order to reduce reactor stalls (constrast with synchronous reclaim that cannot preempt until it achieved its goal). The free memory target is arbitrarily set at 60MB. The reclaimer's shares are proportional to the distance from the free memory target; so a workload that allocates memory rapidly will have the background reclaimer working harder. I rolled my own condition variable here, mostly as an experiment. seastar::condition_variable requires several allocations, while the one here requires none. We should formalize it after we gain more experience with it.	2021-02-14 19:09:29 +02:00
Avi Kivity	35076dd2d3	logalloc: preemptible reclaim Add an option (currently unused by all callers) to preempt reclaim. If reclaim is preempted, it just stops what it is doing, even if it reclaimed nothing. This is useful for background reclaim. Currently, preemption checks are on segment granularity. This is probably too coarse, and should be refined later, but is already better than the current granularity which does not allow preemption until the entire requested memory size was reclaimed.	2021-02-14 19:09:29 +02:00
Alejo Sanchez	5e49650146	raft: replication test: reduce ticker from 100ms to 1ms To speed up replication test reduce the tick time from 100ms to 1ms Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:59:06 -04:00
Alejo Sanchez	b41a6822e8	raft: drop ticker from raft Remove ticker callbacks from raft::server. External code should periodically call raft::server::tick(). Update replication_test accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:41:42 -04:00
Nadav Har'El	ea338db581	cql-pytest: reproduce bug in setting time column with integer This test reproduces issue #7987, where Scylla cannot set a time column with an integer - wheres the documentation says this should be possible and it also works in Cassandra. The test file includes tests for both ways of setting a time column (using an integer and a string), with both prepared and unprepared statements, and demonstrates that only one combination fails in Scylla - an unprepared statement with an integer. This test xfails on Scylla and passes on Cassandra, and the rest pass on both. Refs #7987. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210128215103.370723-1-nyh@scylladb.com>	2021-02-14 15:09:38 +02:00
Nadav Har'El	49cd9b3fd5	alternator: correct implemention of UpdateItem with nested attributes and ReturnValues This patch fixes the last missing part of nested attribute support in UpdateItem - returning the correct attributes when ReturnValues is requested. When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD or UPDATED_NEW, only the actual updated attribute a.b should be returned, not the entire top-level attribute a as we did before this patch. This patch was made very simple because our existing hierarchy_filter() function already does exactly the right thing, and can trivially be made to accept any attribute_path_map<T> (in our case attribute_path_map<action>), not just attrs_to_get as it did until now. This patch also adds several more checks to the test in test_returnvalues.py to improve the test's coverage even more. Interestingly, I discovered two esoteric cases where DynamoDB does something which makes little sense, but apparently simplified their implementation - but the beautiful thing is that it also simplifies our implementation! See long comments about these two cases in the test code. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	964500e47a	alternator: fix bug in ReturnValues=UPDATED_NEW Commit `0c460927bf` broke UpdateItem's ReturnValues=UPDATED_NEW by moving previous_item while it is still needed. None of the existing tests broke because none of them needed previous_item after it was moved - but it started to break when we add support for nested attribute paths, which need this previous_item. So this patch returns the move to a copy, as it was before the aforementioned patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	33685a683e	alternator: implemented nested attribute paths in UpdateExpression This patch adds full support for nested attribute paths (e.g., a.b[3].c) in UpdateExpression. After in previous patches we already added such support for ProjectionExpression, ConditionExpression and FilterExpression this means the nested attribute paths feature is now complete, so we remove the warning from the documents. However, there is one last loose end to tie and we will do it in the next patch: After this patch, the combination of UpdateExpression with nested attributes and ReturnValues is still wrong, and the test for it in test_returnvalues.py still xfails. Note that previous patches already implemented support for attribute paths in expression evaluations - i.e., the right-hand side of UpdateExpression actions, and in this patch we just needed to implement the left hand side: When an update action is on an attribute a.b we need to read the entire content of the top-level a (an RWM operation), modify just the b part of its json with the result of the action, and finally write back the entire content of a. Of course everything gets complicated by the fact that we can have multiple actions on multiple pieces of the same JSON, and we also need to detect overlapping and conflicting actions (we already have this detection in the attribute_path_map<> class we introduced in a previous patch). I decided to leave one small esoteric difference, reproduced by the xfailing test_update_expression.py::test_nested_attribute_remove_from_missing_item: As expected, "SET x.y = :val" fails for an item if its attribute x doesn't exist or the item itself does not exist. For the update expression "REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly silently passes if the entire item doesn't exist. Alternator does not currently reproduce this oddity - it will fail this write as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	7789606545	alternator: limit the depth of nested paths DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d") to 32 levels. This patch adds the same limit also to Alternator. The exact value of this limit is less important (although it did make sense to choose the same limit as DynamoDB does), but it's important to have some limit: It's often convenient to handle paths with a recursive algorithm, and if we allow unlimited path depth, it can result in unlimited recursion depth, and a crash. Let's avoid this possibility. We detect the over-long path while building the parsed::path object in the parser, and generate a parse error. This patch also includes a test that verifies that both Alternator and DynamoDB have the same 32-level nesting limit on paths. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	4c7e27c688	alternator: prepare for UpdateItem nested attribute paths This patch prepares UpdateItem for updating of nested attribute paths (e.g., "SET a.b = :val"), but does not yet support them. Instead of _update_expression holding an unsorted list of "actions", we change it to hold a attribute_path_map of actions. This will allow us to process all the actions on a top-level attribute together, and moreover gets us "for free" the correct checking for overlapping and conflicting updates - exactly the same checking we already had in attribute_path_map for ProjectionExpression. Other than this change, most of this patch is just code movement, not functional changes. After this patch, the tests for update path overlap and conflict pass: test_update_expression_multi_overlap_nested and test_update_expression_multi_conflict_nested. We can also mark test_update_expression_nested_attribute_rhs as passing - this test involves an attribute path in the right-hand-side of an update, but the left-hand-side is still a top-level attribute, so it works (it actually worked before this patch - it started working when we implemented attribute paths in expressions, for ConditionExpression and FilterExpression). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	7c5db2da83	alternator: overhaul ProjectionExpression hierarchy implementation For ProjectionExpression we implemented a hierarchical filter object which can be used to hold a tree of attribute paths groups by a the top-level attributes, and also detect overlapping and conflicting entries. For UpdateExpression, we need almost exactly the same object: We need to group update actions (e.g., SET a.b=3) by the top-level attribute, and also detect and fail overlapping or conflicting paths. So in this patch we rewrite the data structure we had for ProjectionExpression in a more genric manner, using the template attribute_path_map<T> - which holds data of type T for each attribute path. We also implement a template function attribute_path_map_add() to add a path/value pair to this map, and includes all the overlap and conflict detecting logic. There shouldn't be functional changes in this patch. The ProjectionExpression code uses the new generic code instead of the specific code, but should work the same. In the next patch we can use the new generic code to implement UpdateExpression as well. The only somewhat functional change is better error messages for conflicting or overlapping paths - which now include one of the conflicting paths. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	f78d33dd73	alternator: make parsed::path object printable Make the parsed::path object printable - which is useful for error messages. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	c2f18e56ea	alternator-test: a few more ProjectionExpression conflict test cases Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:34 +02:00
Nadav Har'El	de62a8c2d3	alternator-test: improve tests for nested attributes in UpdateExpression We already had many tests for nested attributes in UpdateExpression, but this patch adds even more: * Test nested attribute in right-hand-side in assignment: z = a.c.x. * Test for making multiple changes to the same and different top-level attributes in the same update. * Additional cases of overlap between multiple changes. * Tests for conflict between multiple changes. * Tests for writing to a nested path on a non-existent attribute or item. * A stronger test for array append sorts the added items. As this feature was not yet implemented, these tests fail on Alternator, and pass on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-14 12:21:24 +02:00
Pavel Solodovnikov	2ed445bfdd	raft: raft_rpc: provide `update_address_mapping` and dispatcher functions Provide several utility functions which will be used in rpc message handlers: 1. `update_address_mapping` -- add a new (server_id -> inet_address) mapping for a `raft_rpc` instance. This is used to update rpc module with a caller address upon receiving an rpc message from a yet unknown server. 2. A set of dispatcher functions for every rpc call that forward calls to an appropriate `raft::rpc_server` instance (for which `raft::rpc` has a back-pointer). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-12 17:55:48 +03:00
Pavel Emelyanov	ffc9cc9aec	range-streamer: Remove global storage service reference The reference is used by range streamer and (!) storage service itself to find out if the consistent_rangemovement option is ON/OFF. Both places already have the database with config at hands and can be simplified. v2: spellchecking Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210212095403.22662-1-xemul@scylladb.com>	2021-02-12 15:50:30 +01:00
Tomasz Grabiec	26aa000493	Merge "raft: replication test fixes" from Alejo Fix rare debug mode hang and a minor fix. * alejo/tests-16-fix-debug-hang-disruptive-ticks-master-v3: raft: replication test: fix debug mode hangs raft: replication test: remove unnecessary param	2021-02-11 20:35:35 +01:00
Nadav Har'El	a03a8a89a9	cql-pytest: fix flaky timeuuid_test.py The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out the reason was a bug in the test - which this patch fixes. The buggy test created a timeuuid and then compared the time stored in it to the result of the dateOf() CQL function. The problem is that dateOf() returns a CQL "timestamp", which has millisecond resolution, while the timeuuid may have finer than millisecond resolution. The reason why this test rarely failed is that in our implementation, the timeuuid almost always gets a millisecond-resolution timestamp. Only if now() gets called more than once in one millisecond, does it pick a higher time incremented by less than a millisecond. What this patch does is to truncate the time read from the timeuuid to millisecond resolution, and only then compare it to the result of dateOf(). We cannot hope for more. Fixes #8060 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210211165046.878371-1-nyh@scylladb.com>	2021-02-11 19:03:58 +02:00
Alejo Sanchez	97338ab53f	raft: replication test: fix debug mode hangs For certain situations where barely enough nodes to elect a new leader are connected a disruptive candidate can occassionally block the election. For example having servers A B C D E and only A B C are active in a partition. If the test wants to elect A, it has to first make all 3 servers reach election timeout threshold (to make B and C receptive). Then A is ticked till it becomes a candidate and has to send vote requests to the other servers. But all servers have a timer (_ticker) calling their periodic tick() functions. If one of the other servers, say B, gets its timer tick before A sends vote requests, B becomes a (disruptive) candidate and will refuse to vote for A. In our case of only having 3 out of 5 servers connected a single missing vote can hang the election. This patch disables timer ticks for all servers when running custom elections and partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-11 11:42:31 -04:00
Pavel Solodovnikov	d8dfdfba1e	raft: pass `group_id` as an argument to raft rpc messages This will be used later to filter the requests which belong to the schema raft group and route them to shard 0. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-11 16:25:33 +03:00
Pavel Solodovnikov	3b50cdf1ed	raft: use a named constant for pre-defined schema raft group Introduce a static `schema_raft_state_machine::group_id` constant, which denotes the raft group id for the schema changes server. Also fix the comment on the state machine class declaration to emphasize that the instance will be managed by shard 0. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-02-11 16:24:39 +03:00
Tomasz Grabiec	234f9dbe85	Merge 'Fix mixed cluster schema sync' from Eliran Sinvani When a table is altered in a mixed cluster by a node with a more recent version, the request can fail if there is a difference in schema_features between the two versions. This miniset handles the two problems that prevents the sync. Closes #8011 * github.com:scylladb/scylla: schema: recalculate digest when computed_columns feature is enabled schema tables: Remove mutations to unknown tables when adapting schema mutations schema tables: Register 'scylla_tables' versions that were sent to other nodes	2021-02-11 13:03:38 +01:00
Eliran Sinvani	63b794d104	schema: recalculate digest when computed_columns feature is enabled The schema digest is affected by the computed_columns feature, this means that we have to recalculate our schema digest when this feature is enabled.	2021-02-11 13:48:58 +02:00
Eliran Sinvani	178ced9014	schema tables: Remove mutations to unknown tables when adapting schema mutations Whenever an alter table occurs, the mutations for the just altered table are sent over to all of the replicas from the coordinator. In a mixed cluster the mutations should be adapted to a specific version of the schema. However, the adaptation that happens today doesn't omit mutations to newly added schema tables, to be more specific, mutations to the `computed_columns` table which doesn't exist for example in version 2019.1 This makes altering a table during a rolling upgrade from 2019.1 to 2020.1 dangerous.	2021-02-11 13:48:55 +02:00
Eliran Sinvani	ff1ba9bc2b	schema tables: Register 'scylla_tables' versions that were sent to other nodes In a mixed cluster there can be a situation where `scylla_tables` needs to be sent over to another node because a schema sync or because the node pulls it because it is referenced by a frozen_mutation. The former is not a problem since the sending node chooses the version to send. However, the former is problematic since `scylla_tables` versions are not registered anywhere. This registers every `scylla_tables` schema version which is used to adapted mutations since after this happens a schema pull for this version might follow.	2021-02-11 13:47:16 +02:00
Takuya ASADA	856fe12e13	dist/debian: install scylla-node-exporter.service correctly node-exporter systemd unit name is "scylla-node-exporter.service", not "node-exporter.service". Fixes #8054 Closes #8053	2021-02-11 12:19:38 +02:00
Benny Halevy	d01e7e7b58	stream_session: prepare: fix missing string format argument As seen in mv_populating_from_existing_data_during_node_decommission_test dtest: ``` ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found) ``` Fixes #8067 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>	2021-02-11 12:05:32 +02:00
Wojciech Mitros	17634d141b	sstables: add test for checking the latency of updating the sstable_set in a table Added a test which measures the time it takes to replace sstables in a table's sstable_set, using the leveled compaction strategy. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	693b4e0fcd	sstables: move column_family_test class from test/boost to test/lib Column_family_test allows performing private methods on column_family's sstable_set. It may be useful not only in the boost tests, so it's moved from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh. sstable_test.hh includes sstable_test_env.hh, so no includes need to be changed. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	0feff8712e	sstables: use fast copying of the sstable_set instead of rebuilding it The sstable_set enables copying without iterating over all its elements, so it's faster to copy a set and modify it than copy all its elements while filtering the ones that were erased. The modifications are done on a temporary version of the set, so that if an operation fails the base version remains unchanged Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	aa0cd940d6	sstables: replace the sstable_set with a versioned structure Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that allows copying without actually copying all the sstables in the set, while providing the same methods(and some extra) without majorly decreasing their speed. This is achieved by associating all copies with sstable_set versions which hold the changes that were performed in them, and references to the versions that were copied, a.k.a. their parents. The set represented by a version is the result of combining all changes of its ancestors. This causes most methods of the version to have a time complexity dependent on the number of its ancestors. To limit this number, versions that represent copies that have already been deleted are merged with its descendants. The strategy used for deciding when and with which of its children should a version be merged heavily depends on the use case of sstable_sets: there is a main copy of the set in a table class which undergoes many insertions and deletions, and there are copies of it in compaction or mutation readers which are further copied or edited few or zero times. It's worth to mention, that when a copy is made, the copied set should not be modified anymore, because it would also modify the results given by the copy. In order to still allow modifying the copied set, if a change is to be performed on it, the version assiociated with this set is replaced with a new version depending on the previous one. As we can see, in our use case there is a main chain of versions(with changes from the table), and smaller branches of versions that start from a version from this chain, but are deleted soon after. In such case we can merge a version when it has exactly one descendant, as this limits the number of concurrent ancestors of a version to the number of copies of its ancestors are concurrently used. During each such merge, the parent version is removed and the child version is modified so that all operations on it give the same results. In order to preserve the same interface, the sstable_set still contains a lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for unordered_set<shared_sstable>) is now a new structure. Each sstable_set contains a sstable_list but not every sstable_list has to be contained by a sstable_set, and we also want to allow fast copying of sstable_lists, so the reference to the sstable_set_version is kept by the sstable_lists and the sstable_set can access the sstable_set_version it's associated with through its sstable_list. Accessing sstables that are elements of a certain sstable_set copy(so the select, select_sstable_runs and sstable_list's iterator) get results from containers that hold all sstables from all versions(which are stored in a single, shared "versioned_sstable_set_data" structure), and then filter out these sstables that aren't present in the version in question. This version of the sstable_set allows adding and erasing the same sstable repeatedly. Inserting and erasing from the set modifies the containers in a version only when it has an actual effect: if an sstable has been added in the parent version, and hasn't been erased in the child version, adding it again will have no effect. This ensures that when merging versions, the versions have disjoint sets of added, and erased sstables (an sstable can still be added in one and erased in the second). It's worth noting hat if an sstable has been added in one of the merged sets and erased in the second, the version that remains after merging doesn't need to have any info about the sstable's inclusion in the set - it can be inferred from the changes in previous versions (and it doesn't matter if the sstable has been erased before or after being added). To release pointers to sstables as soon as possible (i.e. when all references to versions that contain them die), if an sstable is added/erased in all child versions that are based on a version which has no external references, this change gets removed from these versions and added to the parent version. If an sstable's insertion gets overwritten as a result, we might be able to remove the sstable completely from the set. We know how many times this needs to happen by counting, for each sstable, in how many different verisions has it been added. When a change that adds an sstable gets merged with a change that removes it, or when a such a change simply gets deleted alongside its associated version, this count is reduced, and when an sstable gets added to a version that doesn't already contain it, this count is increased. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. Fixes #2622 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	48153a1e2c	sstables: remove potential ub If the range expression in a range based for loop returns a temporary, its lifetime is extended until the end of the loop. The same can't be said about temporaries created within the range expression. In our case, *t->get_sstables_including_compacted_undeleted() returns a reference to a const sstable_list, but the t->get_sstables_including_compacted_undeleted() is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the end of the loop, and it may be the sole owner of the referenced sstable_list, so the referenced sstable_list may be already deleted inside the loop too. Fix by creating a local copy of the lw_shared_ptr, and get reference from it in the loop. Fixes #7605 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Wojciech Mitros	e1b494633b	sstables: make sstable_set constructor less error-prone Adding an non-empty set of sstables as the set of all sstables in an sstable_set could cause inconsistencies with the values returned by select_sstable_runs because the _all_runs map would still be initialized empty. For similar reasons, the provided sstable_set_impl should also be empty. Dispel doubts by removing the unordered_set from the constructor, and adding a check of emptiness of the sstable_set_impl. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-02-11 11:02:55 +01:00
Shlomi Livne	718976e794	scylla_io_setup did not configure pre tuned gce instances correctly scylla_io_setup condition for nr_disks was using the bitwise operator (&) instead of logical and operator (and) causing the io_properties files to have incorrect values Fixes #7341 Reviewed-by: Lubos Kosco <lubos@scylladb.com> Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Closes #8019	2021-02-11 11:06:00 +02:00
Avi Kivity	9cbbf40710	Merge "register_inactive_read: error handling" from Benny " Currently, register_inactive_read accepts an eviction_notify_handler to be called when the inactive_read is evicted. However, in case there was an error in register_inactive_read the notification function isn't called leaving behind state that needs to be cleaned up. This series separates the register_inactive_reader interface into 2 parts: 1. register_inactive_reader(flat_mutation_reader) - which just registers the reader and return an inactive_read_handle, if permitted. Otherwise, the notification handler is not called (it is not known yet) and the caller is not expected to do anything fance at this point that will require cleanup. This optimizes the server when overloaded since we do less work that we'd need to undo in case the reader_concurrecy_semaphore runs out of resources. 2. After register_inactive_reader succeeded to return a valid inactive_read_handle, the caller sets up its local state and may call `set_notify_handler` to set the optional notify_handler and ttl on the o_r_h. After this state, the notify_handler will be called when the inactive_reader is evicted, for any reason. querier_cache::insert_querier was modified to use the above procedure and to handle (and log/ignore) any error in the process. inactive_read_handle and inactive_read keeping track of each other was simplified by keeping an iterator in the handle and a backpointer in the inactive_read object. The former is used to evict the reader and to set the notify_handler and/or ttl without having to lookup the i_r. The latter is used to invalidate the i_r_h when the i_r is destroyed. Test: unit(release), querier_cache_test(debug) " * tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla: querier_cache: insert_querier: ignore errors to register inactive reader querier_cache: insert_querier: handle errors querier_utils: mark functions noexcept reader_concurrency_semaphore: register_inactive_read: make noexcept reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional reader_concurrency_semaphore: inactive_read: use intrusive list reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged reader_concurrency_semaphore: inactive_read_handle: swap definition order reader_lifecycle_policy: retire low level try_resume method reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader	2021-02-10 19:09:21 +02:00
Alejo Sanchez	941eceb9c8	raft: replication test: remove unnecessary param Remove unnecessary param from wait_log() Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-10 09:11:17 -04:00
Piotr Sarna	2aa4631148	test: fix a flaky timeout test depending on TTL One of the USING TIMEOUT tests relied on a specific TTL value, but that's fragile if the test runs on the boundary of 2 seconds. Instead, the test case simply checks if the TTL value is present and is greater than 0, which makes the test robust unless its execution lasts for more than 1 million seconds, which is highly unlikely. Fixes #8062 Closes #8063	2021-02-10 14:20:02 +02:00
Piotr Sarna	8f98c0585f	failure_detector: add a missing const qualifier The mean() method is effectively const, so it should be marked as such. Message-Id: <14dd39e8419136909fcf10508c34de3752faa7fe.1612953601.git.sarna@scylladb.com>	2021-02-10 13:04:37 +02:00
Piotr Sarna	aa39130a20	bounded_stats_queue: add missing const qualifiers Most of the methods of this utility are effectively const. Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>	2021-02-10 13:04:35 +02:00
Piotr Jastrzebski	390cef6a96	cdc: Extract create_stream_ids from topology_description_generator This new function will be used in the following patches in additional places. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-02-10 10:24:06 +01:00
Gleb Natapov	d06d21bfae	database: remove add_keyspace() function It is not longer used. Message-Id: <20210209175931.1796263-2-gleb@scylladb.com>	2021-02-10 00:36:02 +01:00
Nadav Har'El	54785604b4	Merge 'Add max concurrent requests to alternator' from Piotr Sarna Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297 Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279. This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client. Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are: `botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17 ` Fixes #7294 Closes #8039 * github.com:scylladb/scylla: alternator: server: return api_error instead of throwing alternator: add requests_shed metrics alternator: add handling max_concurrent_requests_per_shard alternator: add RequestLimitExceeded error	2021-02-09 19:55:31 +02:00
Gleb Natapov	d8345c67d9	Consolidate system and non system keyspace creation The code that creates system keyspace open code a lot of things from database::create_keyspace(). The patch makes create_keyspace() suitable for both system and non system keyspaces and uses it to create system keyspaces as well. Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>	2021-02-09 17:18:04 +01:00
Gleb Natapov	51037e94ec	lwt: handle an error during prune operation The error is benign but if it is not handled "unhandled exception" error will be printed in the logs. Message-Id: <20210209150313.GA1708015@scylladb.com>	2021-02-09 16:26:00 +01:00
Tomasz Grabiec	3dd9c5596a	Merge 'Minor tweaks to the failure detector interface' from Piotr Sarna The interface of the failure detector service is cleaned up a little: - an unimplemented method is removed (is_alive) - a return type of another method is fixed (arrival_samples) - a getter for the most recent successful update is added (last_update) This code was tested manually during various overload protection experiments, which check if the failure detector can be used to reject requests which have a very small chance of succeeding within their timeout. Closes #8052 * github.com:scylladb/scylla: failure_detector: add getting last update time point failure_detector: return arrival samples by const reference failure_detector: remove unimplemented is_alive method	2021-02-09 15:23:09 +01:00
Konstantin Osipov	86dec79c1b	raft: rename progress.hh to tracker.hh class tracker is the main class of this module.	2021-02-09 17:07:25 +03:00
Konstantin Osipov	41387225c3	raft: extend single_node_is_quiet test	2021-02-09 17:04:13 +03:00
Piotr Sarna	4acc6fecf0	Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz The same trick is used as in C: `79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)` The edited CQL test relied on quietly accepting non-existing DCs, so it had to be removed. Also, one boost-test referred to nonexistent `datacenter2` and had to be removed. Fixes #7595 Closes #8056 github.com:scylladb/scylla: tests: Adjusted tests for DC checking in NTS locator: Check DC names in NTS	2021-02-09 14:45:20 +02:00
Botond Dénes	3d001b5587	query: use local limit for non-limited queries in mixed cluster Since `fea5067df` we enforce a limit on the memory consumption of otherwise non-limited queries like reverse and non-paged queries. This limit is sent down to the replicas by the coordinator, ensuring that each replica is working with the same limit. This however doesn't work in a mixed cluster, when upgrading from a version which doesn't have this series. This has been worked around by falling back to the old max_result_size constant of 1MB in mixed clusters. This however resulted in a regression when upgrading from a pre `fea5067df` to a post `fea5067df` one. Pre `fea5067df` already had a limit for reverse queries, which was generalized to also cover non-paged ones too by `fea5067df`. The regression manifested in previously working reverse queries being aborted. This happened because even though the user has set a generous limit for them before the upgrade, in the mix cluster replicas fall back to the much stricter 1MB limit temporarily ignoring the configured limit if the coordinator is an old node. This patch solves this problem by using the locally configured limit instead of the max_result_size constant. This means that the user has to take extra care to configure the same limit on all replicas, but at least they will have working reverse queries during the upgrade. Fixes: #8022 Tests: unit(release), manual test by user who reported the issue Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>	2021-02-09 14:45:20 +02:00
Avi Kivity	2f50bf2029	Update seastar submodule * seastar 4c7c5c7c4...76cff5896 (6): > rpc: Make is possible for rpc server instance to refuse connection > reactor: expose cumulative tasks processed statistic > fair_queue: add missing #include <optional> > reactor: optimize need_preempt() thread-local-storage access > Merge " Use reference for backend->reactor link" from Pavel E > test: coroutines: failed coroutine does not throw	2021-02-09 14:45:20 +02:00
Avi Kivity	37b41d7764	test: add missing #include <fstream> std::ofstream is used, but there is no direct include for it. This fails the build with libstdc++ 11. Closes #8050	2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz	97bb15b2f2	tests: Adjusted tests for DC checking in NTS CQL test relied on quietly acceptiong non-existing DCs, so it had to be removed. Also, one boost-test referred to nonexisting `datacenter2` and had to be removed.	2021-02-09 08:29:35 +01:00
Juliusz Stasiewicz	b6fb5ee912	locator: Check DC names in NTS The same trick is used as in C*: `79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)` Fixes #7595	2021-02-09 07:04:17 +01:00
Benny Halevy	d2b8b3041d	querier_cache: insert_querier: ignore errors to register inactive reader Since the reader may normally dropped upon registration, hitting an error is equivalent to having it evicted at any time, so just log the exception and ignore it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	9bdb8190ce	querier_cache: insert_querier: handle errors Make insert_querier exception safe. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	b8f935457a	querier_utils: mark functions noexcept They all are trivially noexcept. Mark them so to simplify error handing assumptions in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	6e92f07630	reader_concurrency_semaphore: register_inactive_read: make noexcept Catch error to allocate an inactive_read and just log them. Return an empty inactive_read_handle in this case, as if the inactive reader was evicted due to lack of resources. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	46c2229b78	reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader Register the inactive reader first with no evict_notify_handler and ttl. Those can be set later, only if registration succeeded. Otherwise, as in the querier example, there is no need to to place the querier in the index and erase it on eviction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	d752ea7e91	reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional By default it will be unarmed and with no callback so there's no need to wrap it in a std::optional. This saves an allocation and another potential error case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	a12c9638b6	reader_concurrency_semaphore: inactive_read: use intrusive list To simplify insertion and eviction into the inactive_reads container, use an intrusive list thta requires a single allocation for the inactive_read object itself. This allows passing a reference to the inactive_read to evict it. Note that the reader will be unlinked automatically from the inactive_readers list if the inactive_read_handle is destroyed. This is okay since there is no need to track the inactive_read if the caller loses the i_r_h (e.g. if an error is thrown). It is also safe to evict the inactive_reader while the i_r_h is alive. In this case the i_r will be unlinked after the flat_mutation_reader it holds is moved out of it. bi::auto_unlink will detect that it's alredy unlinked when destroyed and do nothing. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 22:31:01 +02:00
Benny Halevy	f751e42bf9	reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:52:16 +02:00
Benny Halevy	81cd3d0c51	reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason So try_evict_one_inactive_read could be used also in do_wait_admission in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	e072199b8d	reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error Calling unregister_inactive_read on the wrong semaphore is a blatant bug so better call on_internal_error so it'd be easier to catch and fix. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	9c9b4c85ae	reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged There is no need to lookup the inactive_read if the i_r_h is disengaged, it should not be registered so just return quickly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	769dff6c54	reader_concurrency_semaphore: inactive_read_handle: swap definition order For using boost::intrusive::list for _inactive_reads. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	d565e3fb57	reader_lifecycle_policy: retire low level try_resume method The caller can now just call sem.unregister_inactive_read(irh) directly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	4e8f29ef14	reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader There's no need to hold a unique_ptr<flat_mutation_reader> as flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl> and functions as a unique ptr via flat_mutation_reader_opt. With that, unregister_inactive_read was modified to return a flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>, keeping exactly the same semantics. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Nadav Har'El	e52785be08	alternator: support attribute paths in ConditionExpression, FilterExpression This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3]) for the ConditionExpression in conditional updates, and FilterExpression in queries and scans. After this patch, all previously-xfailing tests in test_projection_expression.py and test_filter_expression.py now pass. The fix is simple: Both ConditionExpression and FilterExpression use the function calculate_value() to calculate the value of the expression. When this function calculates the value of a path, it mustn't just take the top-level attribute - it needs to walk into the specific sub-object as specified by the attribute path. This is not the end of attribute path support, UpdateExpression and ReturnValues are not yet fully supported. This will come in following patches. Refs #5024 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 19:19:09 +02:00
Nadav Har'El	579c7b8dae	alternator-test: improve tests for nested attributes in ConditionExpression Strengthen the tests in test_condition_expression.py for nested attribute paths (e.g., b.y[1]): 1. The test test_update_condition_nested_attributes only tested successful conditions involving nested attributes. Let's also add an unsuccessful condition, to verify we don't accidentally pass every condition involving a nested attribute. 2. Test a case where a non-existant nested attribute is involved in the condition. 3. In the test for an attribute path with references - "#name1.#name2", make sure the test doesn't pass if #name2 is silently ignored. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 19:19:09 +02:00
Piotr Sarna	faca59efa6	failure_detector: add getting last update time point It can be useful to use the information how long ago an endpoint responded to heartbeat.	2021-02-08 16:45:58 +01:00
Tomasz Grabiec	c16e4a0423	migration_manager: Propagate schema changes with reads like we do on writes This fixes the problem where the cordinator already knows about the new schema and issues a read which uses new objects, but the replica doesn't know those objects yet. The read will fail in this case. We can avoid this if we propagate schema changes with reads, like we already do for writes. Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>	2021-02-08 16:49:55 +02:00
Avi Kivity	4082f57edc	Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund Refs #6148 Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing. This patch set does: * Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle. * Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results) * Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp. * Force more eager flush/recycle if we're out of segments Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should. Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating. Closes #7879 * github.com:scylladb/scylla: commitlog: Force earlier cycle/flush iff segment reserve is empty commitlog: Make segment allocation wait iff disk usage > max commitlog: Do partial (memtable) flushing based on threshold commitlog: Make flush threshold configurable table: Add a flush RP mark to table, and shortcut if not above	2021-02-08 16:44:05 +02:00
Avi Kivity	af2d1fa0de	Update abseil submodule Compiles with newer compilers. Added new library wyhash.a to configure.py. * abseil 1e3d25b...9c6a50f (51): > Export of internal Abseil changes > Do not set mvsc linker flags for clang-cl (fixes #874) (#891) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add support for Elbrus 2000 (e2k) (#889) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Add missing word 'library' in the 'status' description (#868) > Export of internal Abseil changes > Export of internal Abseil changes > Include the status library into the main README. (#863) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > fix build dll (#797) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Fix stacktrace on aarch64 architecture. Fixes #805 (#827) > moved deleted functions to public for better compiler errors. (#828) > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes > Export of internal Abseil changes	2021-02-08 15:41:46 +02:00
Gleb Natapov	b9a5aff7a6	distributed_loader: drop execute_futures function execute_futures() is just a local reimplementation of when_all_succeed(). Use the former directly. Message-Id: <20210208114816.GA1658725@scylladb.com>	2021-02-08 13:24:19 +01:00
Nadav Har'El	104ef5242b	alternator: support attribute paths in ProjectionExpression This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3]) for the ProjectionExpression in the various operations where this parameter is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all xfailing tests in test_projection_expression.py now pass. In the previous patch we remembered in the "attrs_to_get" object not only the top-level attributes to read from the table, but also how to filter from it only the desired pieces of the nested document. In this patch we add a filter() function to do this filtering, and call it in the right places to post-process the JSON objects we read from the table. We also had to fix reference resolution in paths to resolve all the components of the path (e.g., #name1.#name2) and not just the top-level attribute. This is not the end of attribute path support, there are still other expressions (ConditionExpression, UpdateExpression, FilterExpression, ReturnValues) where they are not yet supported. This will come in following patches. Refs #5024 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 14:16:40 +02:00
Nadav Har'El	6340619e69	alternator: overhaul attrs_to_get handling In the existing code, the variable "attrs_to_get" is a list of top-level attributes to fetch for an item. It is used to implement features like ProjectionExpression or AttributesToGet in GetItem and other places. However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression, i.e., issue #5024, we need more than that. We still need to know the top- level attribute "a", because this is the granularity we have in the Scylla table (all the content inside "a" is serialized as a single JSON); But we also need to remember exactly which parts inside "a" we will need to extract and return. So in this patch we add a new type, "attrs_to_get", which is more than just a list of top-level attributes. Instead, it is a map, whose keys are the top-level attributes, and the value for each of them is a "hierarchy_filter", an object which describes which part of the attribute is needed. This patch includes the code which converts the AttributesToGet and ProjectionExpression into the new attrs_to_get structure. During this conversion, we recognize two kinds of errors which DynamoDB complains about: We recognize "overlapping" attributes (e.g., requesting both a.b and a.b.c) and "conflicting" attributes (e.g, requesting both a.b and a[1]). After this, two xfailing tests we had for detecting these overlap and conflicts finally pass and their "xfail" label is removed. After this patch, we have the attrs_to_get object which can allow us to filter only the requested pieces of the top-level attributes, but we don't use it yet - so this patch is not enough for complete support of attribute paths in ProjectionExpression. We will complete this support in the next patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 14:16:40 +02:00
Nadav Har'El	b2dbd56a3a	alternator-test: additional tests for attribute paths in ProjectionExpression This patch adds more tests for attribute paths in ProjectionExpression, that deal with document paths which do not fit the content of the item - e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an integer or a dictionary. Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB fails this request as a "conflict". The reasoning is that no single item can ever have both a.b and a[2] (the first is only valid for dictionaries, the second for lists). It's not clear to me why we still can't return whichever of the two actually is relevant, but the fact is that DynamoDB does not allow it. The new tests fail on Alternator (marked xfailed) and pass on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 14:16:40 +02:00
Nadav Har'El	2a2c5563ba	alternator-test: harden attribute-path tests for ProjectionExpression We have 7 xfailing tests for usage of nested attribute paths (e.g., "a.b.c[7]") in a ProjectionExpression. But some of these tests were too "easy" to pass - a trivial and wrong implementation that just ignores the path and uses the top level attribute (in the above example, "a"), would cause some of them to start passing. So this patch strengthens these tests. They still pass on AWS DynamoDB, and now continue to fail with the aforementioned broken implementation. Refs #5024. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 14:16:40 +02:00
Nadav Har'El	653610f4bc	alternator: fix ValidationException in FilterExpression - and more The first condition expressions we implemented in Alternator were the old "Expected" syntax of conditional updates. That implementation had some specific assumptions on how it handles errors: For example, in the "LT" operator in "Expected", the second operand is always part of the query, so an error in it (e.g., an unsupported type) resulted it a ValidationException error. When we implemented ConditionExpression and FilterExpression, we wrongly used the same functions check_compare(), check_BETWEEN(), etc., to implement them. This results in some inaccurate error handling. The worst example is what happens when you use a FilterExpression with an expression such as "x < y" - this filter is supposed to silently skip items whose "x" and "y" attributes have unsupported or different types, but in our implementation a bad type (e.g., a list) for y resulted in a ValidationException which aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH) we actually noticed the slightly different behavior needed and implemented the same operator twice - with ugly code duplication. But in other operators we missed this problem completely. This patch first adds extensive tests of how the different expressions (Expected, QueryFilter, FilterExpression, ConditionExpression) and the different operators handle various input errors - unsupported types, missing items, incompatible types, etc. Importantly, the tests demonstrate that there is often different behavior depending on whether the bad input comes from the query, or from the item. Some of the new tests fail before this patch, but others pass and were useful to verify that the patch doesn't break anything that already worked correctly previously. As usual, all the tests pass on Cassandra. Finally, this patch fixes all these problems. The comparison functions like check_compare() and check_BETWEEN() now not only take the operands, they also take booleans saying if each of the operands came from the query or from an item. The old-syntax caller (Expected or QueryFilter) always say that the first operand is from the item and the second is from the query - but in the new-syntax caller (ConditionExpression or FilterExpression) any or all of the operands can come from the query and need verification. The old duplicated code for check_BEGINS_WITH() - which a TODO to remove it - is finally removed. Instead we use the same idea of passing booleans saying if each of its operands came from an item or from the query. Fixes #8043 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-02-08 14:16:30 +02:00
Pavel Emelyanov	a05adb8538	database: Remove global storage proxy reference The db::update_keyspace() needs sharded<storage_proxy> reference, but the only caller of it already has it and can pass one as argument. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210205175611.13464-3-xemul@scylladb.com>	2021-02-08 12:59:46 +01:00
Pavel Emelyanov	8490c9ff6a	transport: Remove global storage service reference On start the transport controller keeps the storage service on server config's lambda just to let the server grab a database config option. The same can be achieved by passing the sharded database reference to sharded<server>::start, so that each server instance get local database with config. As an nice side effect transport::server's config looks more like a config with simple values and without methods and/or lambdas on board. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210205175611.13464-1-xemul@scylladb.com>	2021-02-08 12:58:49 +01:00
Piotr Sarna	d23584c8f7	failure_detector: return arrival samples by const reference There's no point in always returning the whole map by value - callers can decide to copy the map of their own if need be.	2021-02-08 11:50:32 +01:00
Piotr Sarna	445e6e44f4	failure_detector: remove unimplemented is_alive method The method was never implemented, so it makes no sense to keep it in the header.	2021-02-08 11:49:50 +01:00
Amnon Heiman	4498bb0a48	API: Fix aggregation in column_familiy Few method in column_familiy API were doing the aggregation wrong, specifically, bloom filter disk size. The issue is not always visible, it happens when there are multiple filter files per shard. Fixes #4513 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #8007	2021-02-08 12:11:30 +02:00
Raphael S. Carvalho	e1261d10f1	table: Avoid useless allocations when updating cache on memtable flush completion we're unconditionally using make_combined_mutation_source(), which causes extra allocations, even if memtable was flushed into a single sstable, which is the most common case. memtable will only be flushed into more than one sstable if TWCS is used and memtable had old data written into it due to out-of-order writes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>	2021-02-06 20:03:33 +02:00
Pavel Emelyanov	7e68ed6a5d	configure: Switch debug build from -O0 to -Og Previous patch changed the -O flag for dev builds. This had no effect on unit tests compile+run time, and was aimed at improving the individual tests, dtest, stress- and other tests runtimes. This change is mainly focused on imprving the debug-mode full unit tests running, while keeping the debuggability: the compile+run time gets ~10 minutes shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-05 19:46:28 +03:00
Pavel Emelyanov	4fd5ef92ae	configure: Switch dev build from -O1 to -O2 Based on the original patch from Nadav. The -O1-generated code is too slow. Raising the opt level slows compilation down ~9%, but greatly improves the testing time. E.g. running the alternator test alone is 2.5 times faster with -O2 (118 vs 48 seconds). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-05 19:46:28 +03:00
Pavel Emelyanov	7ced07d22c	configure: Make -O flag configurable It was noticed, that current optimization levels do not generate fast enough code for dev builds. On the other hand just increasing the default optimization level will make re-compile-mostly work much more frustrating. The new configure.py option allows to select the desired -O option value by hands. Current hard-coded values are used as defaults. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-05 19:46:28 +03:00
Botond Dénes	7910e745bc	scylla-gdb.py: std_list: restore python2 compatibility std_list has an iterator object which provides the python3 `__next__()` method only. Python2 wants a method called `next()`. As it is trivial to provide both, do that to allow debugging on centos7. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>	2021-02-05 12:47:53 +01:00
Gleb Natapov	8dbe222331	raft: compile raft by default	2021-02-05 12:40:20 +01:00
Konstantin Osipov	adc87aa278	raft: re-lookup progress object after a configuration change Fix raft_fsm_test failure in debug mode. ASAN complained that follower_progress is used in append_entries_reply() after it was destroyed. This could happen if in maybe_commit() we switched to a new configuration and destroyed old progress objects. The fix is to lookup the object one more time after maybe_commit().	2021-02-05 12:40:19 +01:00
Piotr Sarna	d7848750d8	alternator: server: return api_error instead of throwing Throwing a C++ exception creates unnecessary overhead, so when an unsupported operation is encountered, the api error is directly returned instead of being thrown.	2021-02-04 17:23:41 +01:00
Piotr Sarna	868e04e8e2	alternator: add requests_shed metrics The counter shows the total number of requests shed due to overload.	2021-02-04 17:23:41 +01:00
Piotr Sarna	1b8c946ad7	alternator: add handling max_concurrent_requests_per_shard The config value is already used to set an upper limit of concurrent CQL requests, and now it's also abided by alternator. Excessive requests result in returning RequestLimitExceeded error to the client. Tests: manual Running multiple concurrent requests via the test suite results in: botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17	2021-02-04 17:23:41 +01:00
Piotr Sarna	32dc692b8b	alternator: add RequestLimitExceeded error The error code is used when requests are shed due to crossing the user-defined threshold of the rate of incoming requests.	2021-02-04 17:14:21 +01:00
Avi Kivity	7f3083739f	Merge "sstables: Share partition index pages between readers" from Tomasz " Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. It used to be like that a few years ago, but we moved to per-reader cache to implement incremental promoted index parsing, to avoid OOMs with large partitions. At that time, the solution involved caching input streams inside partition index entries, which couldn't be reused between readers. This could have been solved differently. Instead of caching input streams, we can cache information needed to created them (temporary_buffer<>). This solution takes this approach. This series is also needed before we can implement promoted index caching. That's because before the promoted index can be shared by readers, the partition index entries, which hold the promoted index, must also be shareable. The pages live as long as there is at least one index reader referencing them. So it only helps when there is concurrent access. In the future we will keep them for longer and evict on memory pressure. Promoted index cursor is no longer created when the partition index entry is parsed, by it's created on-demand when the top-level cursor enters the partition. The promoted index cursor is owned by the top-level cursor, not by the partition index entry. Below are the results of an experiment performed on my laptop which demonstrates the improvement in performance. Load driver command line: ./scylla-bench \ -workload uniform \ -mode read \ --partition-count=10 \ -clustering-row-count=1 \ -concurrency 100 Scylla command line: scylla --developer-mode=1 -c1 -m1G --enable-cache=0 The workload is IO-bound. Before, we needed 2 I/O per read, now we need 1 (amortized). The throughput is ~70% higher. Before: time ops/s rows/s errors max 99.9th 99th 95th 90th median mean 1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms 2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms 3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms 4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms 5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms 6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms 7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms 8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms 9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms After: time ops/s rows/s errors max 99.9th 99th 95th 90th median mean 1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms 2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms 3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms 4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms 5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms 6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms 7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms 8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms 9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms Tests: - unit [release] - scylla-bench " * tag 'share-partition-index-v1' of github.com:tgrabiec/scylla: sstables: Share partition index pages between readers sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream() sstables: index_reader: Do not store cluster index cursor inside partition indexes	2021-02-04 17:27:49 +02:00
Nadav Har'El	1953b1b006	alternator-test: increase timeout in tracing test Our test for tracing Alternator requests can't be sure when tracing a request finished, because tracing is asynchronous and has no official ending signal. So before we can conclude that tracing failed, we need to wait until a timeout, which in the current code was roughly 6.4 seconds (the timeout logic is unnecessarily convoluted, but to make a long story short it has exponential sleeps starting with 0.1 second and ending with 3.2 seconds, totaling 6.4 seconds). It turns out that sporadically, in test runs on overcommitted test machines with the very slow debug build, we fail this test with this timeout. So this patch increases the timeout to 51.2 seconds. It should be more than enough for everyone. Famous last words :-) Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210204151554.582260-1-nyh@scylladb.com>	2021-02-04 17:17:07 +02:00
Tomasz Grabiec	63188abb87	sstables: Share partition index pages between readers Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. This change is also needed before we can implement promoted index caching. That's because before the promoted index can be shared by readers, the partition index entries, which hold the promoted index, must also be shareable. The pages live as long as there is at least one index reader referencing them. So it only helps when there is concurrent access. In the future we will keep them for longer and evict on memory pressure. Promoted index cursor is no longer created when the partition index entry is parsed, by it's created on-demand when the top-level cursor enters the partition. The promoted index cursor is owned by the top-level cursor, not by the partition index entry.	2021-02-04 15:24:07 +01:00
Tomasz Grabiec	c232d71fc8	sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()	2021-02-04 15:24:07 +01:00
Tomasz Grabiec	5ed559c8c6	sstables: index_reader: Do not store cluster index cursor inside partition indexes Currently, the partition index page parser will create and store promoted index cursors for each entry. The assumption is that partition index pages are not shared by readers so each promoted index cursor will be used by a single index_reader (the top-level cursor). In order to be able to share partition index entries we must make the entries immutable and thus move the cursor outside. The promoted index cursor is now created and owned by each index_reader. There is at most one such active cursor per index_reader bound (lower/upper).	2021-02-04 15:23:55 +01:00
Avi Kivity	713a159600	tools: toolchain: add simplified procedure for creating dbuild images The current procedure for building images is complicated, as it requires access to x86_64, aarch64, and s390x machines. Add an alternative procedure that is fully automated, as it relies on emulation on a single machine. It is slow, but requires less attention. Closes #8024	2021-02-04 15:37:36 +02:00
Avi Kivity	bd7fbcc0cf	tools: toolchain: dbuild: keep original user's groups The supplementary groups are removed by default, so add them back. Supplementary groups are useful for group-shared directories like ccache. I added them to the podman-only branch since I don't know if this works for docker. If a docker user verifies it works there too, we can move it to the generic code. Closes #8020	2021-02-04 15:36:55 +02:00
Gleb Natapov	e9043565b3	raft: add counters to raft server The patch adds set of counters for various events inside raft implementation to facilitate monitoring and debugging. Message-Id: <20210204125313.GA1513786@scylladb.com>	2021-02-04 14:19:54 +01:00
Benny Halevy	f5fe8283cc	test: reader_permit: do not include reader_concurrency_semaphore.hh in header file We can do with a forward declaration instead to reduce the dependency, and include reader_concurrency_semaphore.hh in test/lib/reader_permit.cc instead. We need to include "../../reader_permit.hh" to get the definition of class reader_permit. We need the include path to prevent recursive include (or rename test/lib/reader_permit.hh but this creates a lot of code churn). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>	2021-02-04 15:02:16 +02:00
Benny Halevy	338c190842	reader_concurrency_semaphore: inactive_read_handle: mark methods noexcept All are trivially noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210204113327.1027792-1-bhalevy@scylladb.com>	2021-02-04 13:57:42 +02:00
Benny Halevy	ba4b8dd6e5	sstables: row.hh: no need to include reader_concurrency_semaphore.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>	2021-02-04 13:42:06 +02:00
Tomasz Grabiec	2e3f6a9622	tests: perf_fast_forward: Print outpout directory Message-Id: <20210203180053.230627-1-tgrabiec@scylladb.com>	2021-02-04 10:39:41 +02:00
Tomasz Grabiec	e0ceb454c0	tests: perf_fast_forward: Print error hints to stdout They point to lines printed to stdout, so should be aligned with them. Message-Id: <20210203180016.230547-1-tgrabiec@scylladb.com>	2021-02-04 10:39:41 +02:00
Avi Kivity	fcd48adcc4	Update seastar submodule * seastar b5b2ee53d...4c7c5c7c4 (1): > Merge "add support for printing backtraces on one line" from Benny Fixes #5464.	2021-02-03 14:01:45 +02:00
Benny Halevy	ca6f5cb0bc	test: commitlog_test: test_allocation_failure: fill memory using smaller allocations commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output). So if there are discontiguous small memory blocks, they can be used to satisfy an allocation even if no contiguous memory blocks are available. To prevent that, as Avi suggested, this change allocates in 128K blocks and frees the last one to succeed (so that we won't fail on allocating continuations). Fixes #8028 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>	2021-02-03 12:21:20 +02:00
Pavel Solodovnikov	856b0b3a58	raft: introduce `raft_gossip_failure_detector` class This is an implementation of `raft::failure_detector` for Scylla that uses gms::gossiper to query `is_alive` state for a given raft server id. Server ids are translated to `gms::inet_address` to be consumed by `gms::gossiper` with the help of `raft_rpc` class, which manages the mapping. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>	2021-02-03 10:45:18 +01:00
Tomasz Grabiec	f8ae46f294	Merge "raft: RPC module implementation" from Pavel Solodovnikov This series provides additional RPC verbs and corresponding methods in `messaging_service` class, as well as a scylla-specific Raft RPC module implementation that uses `netw::messaging_service` under the hood to dispatch RPC messages. * https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6: raft: introduce `raft_rpc` class raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls configure.py: compile serializer.cc	2021-02-03 10:43:58 +01:00
Benny Halevy	55e3df8a72	dist: scylla_util: prevent IndexError when no ephemeral_disks were found Currently we call firstNvmeSize before checking that we have enough (at least 1) ephemeral disks. When none are found, we hit the following error (see #7971): ``` File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in if idata.is_recommended_instance(): File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance diskSize = self.firstNvmeSize File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize firstDisk = ephemeral_disks[0] IndexError: list index out of range ``` This change reverses the order and first checks that we found enough disks before getting the fist disk size. Fixes #7971 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8027	2021-02-03 11:30:18 +02:00
Avi Kivity	10606aadb5	Update tools/java submodule * tools/java 78c8ef4f54...0187829d5e (1): > nodetool: alternate way to specify table name which includes a dot Fixes #6521.	2021-02-03 11:27:33 +02:00
Botond Dénes	46b795b5fd	mutation: consume(): add reverse mode `mutation::consume()` is used by range scans to convert the immediate `reconcilable_result` to the final `query::result` format. When the range scan is in reverse, `mutation::consume()` has to feed the clustering fragments to the consumer in reverse order, but currently `mutation::consume()` always uses the natural order, breaking reverse range scans. This patch fixes this by adding a `consume_in_reverse` parameter to `mutation::consume()`, and consequently support for consuming clustering fragments in reverse order. Fixes: #8000 Tests: unit(release, debug), dtest(thrift_tests.py:TestMutations.test_get_range_slice) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>	2021-02-03 11:00:47 +02:00
Piotr Sarna	c03363b520	README: fix a dead link for building instructions The link was outdated, since its destination was moved to a subdirectory. Message-Id: <b0e0eedaea4f26acf050a91ab9eed1ca37a838bb.1612338584.git.sarna@scylladb.com>	2021-02-03 10:59:50 +02:00
Avi Kivity	913d970c64	Merge "Unify inactive readers" from Botond " Currently inactive readers are stored in two different places: * reader concurrency semaphore * querier cache With the latter registering its inactive readers with the former. This is an unnecessarily complex (and possibly surprising) setup that we want to move away from. This series solves this by moving the responsibility if storing of inactive reads solely to the reader concurrency semaphore, including all supported eviction policies. The querier cache is now only responsible for indexing queriers and maintaining relevant stats. This makes the ownership of the inactive readers much more clear, hopefully making Benny's work on introducing close() and abort() a little bit easier. Tests: unit(release, debug:v1) " * 'unify-inactive-readers/v2' of https://github.com/denesb/scylla: reader_concurrency_semaphore: store inactive readers directly querier_cache: store readers in the reader concurrency semaphore directly querier_cache: retire memory based cache eviction querier_cache: delegate expiry to the reader_concurrency_semaphore reader_concurrency_semaphore: introduce ttl for inactive reads querier_cache: use new eviction notify mechanism to maintain stats reader_concurrency_semaphore: add eviction notification facility reader_concurrency_semaphore: extract evict code into method evict()	2021-02-03 10:59:04 +02:00
Piotr Sarna	d395305ddd	api: fix retrieving replied RPC messages The API call referred to a nonexistent callback, which is now renamed to better match the API path and actually implemented. Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>	2021-02-03 09:42:17 +02:00
Pekka Enberg	5670276163	Update seastar submodule * seastar cb3aaf07...b5b2ee53 (1): > perftune.py: fix assignment after extend and add asserts Fixes #8008	2021-02-02 15:27:13 +02:00
Tomasz Grabiec	873e732042	Merge "Switch partition rows onto B-tree" from Pavel Emelyanov This is the continuaiton of the row-cache performance improvements, this time -- the rework of clustering keys part. The goal is to solve the same set of problems: - logN eviction complexity - deep and sparse tree Unlike partitions, this cache has one big feature that makes it impossible to just use existing B+ tree: There's no copyable key at hands. The clustering key is the managed_bytes() that is not nothrow-copy-constructibe, neither it's hash-able for lookup due to prefix lookup. Thus the choice is the B-tree, which is also N-ary one, but doesn't copy keys around. B-trees are like B+, but can have key:data pairs in inner nodes, thus those nodes may be significantly bigger then B+ ones, that have data-s only in leaf trees. Not to make the memory footprint worse, the tree assumes that keys and data live on the same object (the rows_entry one), and the tree itself manages only the key pointers. Not to invalidate iterators on insert/remove the tree nodes keep pointers on keys, not the keys themselves. The tree uses tri-compare instead of less-compare. This makes the .find and .lower_bound methods do ~10% less comparisons on random insert/lookup test. Numbers: - memory_footprint: B-tree master rows_entry size: 216 232 1 row in-cache: 968 960 (because of dummy entry) in-memtable: 1006 1022 100 rows in-cache: 50774 50856 in-memtable: 50620 50918 - mutation_test: B-tree master tps.average: 891177 833896 - simple_query: B-tree master tps.median: 71807 71656 tps.maximum: 71847 71708 * xemul/clustering-cache-over-btree-4: mutation_partition: Save one keys comparison partition_snapshot_row_cursor: Remove rows pointer mutation_partition: Use B-tree insertion sugar perf-test : Print B-tree sizes mutation_partition: Switch cache of rows onto B-tree partition_snapshot_reader: Rename cmp to less for explicity mutation_partition: Make insertion bullet-proof mutation_partition: Use tri-compare in non-set places flat_mutation_reader: Use clear() in destroy_current_mutation() rows_entry: Generalize compare utils: Intrusive B-tree (with tests) tests: Generalize bptree compaction test tests: Generalize bptree stress test	2021-02-02 12:26:02 +01:00
Tomasz Grabiec	75eb97b12c	Merge 'Commitlog multi-entry write' from Calle Wilund Fixes #7615 Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles. Guarantees that all entries are written or none are, and that they will be flushed to disk together. Small test included. Closes #7616 * github.com:scylladb/scylla: commitlog_test: Add multi-entry write test commitlog: Add "add_entries" call to allow inputting N mutations commitlog: Make commitlog entries optionally multi-entry commitlog: Move entry_writer definition to cc file	2021-02-02 12:23:19 +01:00
Tomasz Grabiec	7b17969a6e	Merge 'sstable: reader: preempt after every fragment' from Avi Kivity Whenever we push a fragment, we check whether the buffer is full and return proceed::no if so, so that the state machine pauses and lets the consumer continue. This patch adds an additional condition - if preemption is needed, we also return proceed::no. This drops us back to the outer loop (in sstable_mutation_reader::fill_buffer), which will yield to the reactor as part of seastar::do_until(). Two cases (partition_start and partition_end) did not have the check for is_buffer_full(); it is added now. This can trigger is the partition has no rows. Unlike the previous attempt, push_ready_fragments() is not touched. The extra preemption opportunities triggered a preexisting bug in clustering_ranges_walker; it is fixed in the first patch of the series. I tested this by reading from a large partition with a simple schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE. However, even without the patch I only got sporadic stalls with the detector set to 1ms, so it's possible I'm not testing correctly. Test: unit (dev, debug, release) Fixes #7883. Closes #7928 * github.com:scylladb/scylla: sstable: reader: preempt after every fragment clustering_range_walker: fix false discontiguity detected after a static row	2021-02-02 12:21:58 +01:00
Benny Halevy	0fecc78d88	user_function: throw on_internal_error if executed outside a seastar thread Rather than asserting, as seen in #7977. This shouldn't crash the server in production. Add unit test that reproduces this scenario and verifies the internal error exception. Fixes #7977 Test: unit(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>	2021-02-02 13:03:39 +02:00
Calle Wilund	720a47fe8a	commitlog_test: Add multi-entry write test	2021-02-02 10:41:08 +00:00
Calle Wilund	c5f6125039	commitlog: Add "add_entries" call to allow inputting N mutations Fixes #7615 Allows N mutations to be written "atomically" (i.e. in the same call). Either all are added to segement, or none. Returns rp_handle vector corresponding to the call vector.	2021-02-02 10:41:08 +00:00
Calle Wilund	5fcc2066ed	commitlog: Make commitlog entries optionally multi-entry Allows writing more than one blob of data using a single "add" call into segment. The old call sites will still just provide a single entry. To ensure we can determine the health of all the entries as a unit, we need to wrap them in a "parent" entry. For this, we bump the commitlog segment format and introduce a magic marker, which if present, means we have entries in entry, totalling "size" bytes. We checksum the entra header, and also checksum the individual checksums of each sub-entry (faster). This is added as a post-word. When parsing/replaying, if v2+ and marker, we have to read all entries + checksums into memory, verify, and _then_ we can actually send the info to caller.	2021-02-02 10:41:08 +00:00
Calle Wilund	6bef3f9cc3	commitlog: Move entry_writer definition to cc file Should not be public/visible	2021-02-02 10:32:44 +00:00
Juliusz Stasiewicz	29e4737a9b	transport: Fix abort on certain configurations of native_transport_port(_ssl) The reason was accessing the `configs` table out of index. Also, native_transport_port-s can no longer be disabled by setting to 0, as per the table below. Rules for port/encryption (the same apply to shard_aware counterpart): np := native_transport_port.is_set() nps := native_transport_port_ssl.is_set() ceo := ceo.at("enabled") == "true" eq := native_transport_port_ssl() == native_transport_port() +-----+-----+-----+-----+ \| np \| nps \| ceo \| eq \| +-----+-----+-----+-----+ \| 0 \| 0 \| 0 \| * \| => listen on native_transport_port, unencrypted \| 0 \| 0 \| 1 \| * \| => listen on native_transport_port, encrypted \| 0 \| 1 \| 0 \| * \| => nonsense, don't listen \| 0 \| 1 \| 1 \| * \| => listen on native_transport_port_ssl, encrypted \| 1 \| 0 \| 0 \| * \| => listen on native_transport_port, unencrypted \| 1 \| 0 \| 1 \| * \| => listen on native_transport_port, encrypted \| 1 \| 1 \| 0 \| * \| => listen on native_transport_port, unencrypted \| 1 \| 1 \| 1 \| 0 \| => listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted \| 1 \| 1 \| 1 \| 1 \| => native_transport_port(_ssl), encrypted +-----+-----+-----+-----+ Fixes #7783 Fixes #7866 Closes #7992	2021-02-02 11:32:31 +02:00
Avi Kivity	285303b131	Update tools/jmx submodule * tools/jmx 2c95650...949cefc (2): > dist/redhat: stop using systemd macros, call systemctl directly > Remove obsolete FIXME See scylladb/scylla-jmx#94.	2021-02-02 11:29:36 +02:00
Takuya ASADA	7b310c591e	dist/redhat: stop using systemd macros, call systemctl directly Fedora version of systemd macros does not work correctly on CentOS7, since CentOS7 does not support "file trigger" feature. To fix the issue we need to stop using systemd macros, call systemctl directly. See scylladb/scylla-jmx#94 Closes #8005	2021-02-02 11:28:07 +02:00
Avi Kivity	da4fa0629a	Merge "sstables: add sstable_origin to scylla_metadata" from Benny " This series extends the scylla_metadata sstable component to hold an optional testual description of the sstable origin. It describes where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) The origin string is provided by the sstable writer via sstable_writer_config, written to the scylla_metadata component, and loaded on sstable::load(). A get_origin() method was added to class sstable to retrieve its origin. It returns an empty string by default if the origin is missing. Compaction now logs the sstable origin for each sstable it compacts, and it generates the sstable origin for all sstables in generates. Regular compaction origin is simply set to "compaction" while other compaction types are mentioned by name, as "cleanup", "resharding", "reshaping", etc. A unit test was added to test the sstable_origin by writing either an empty origin and a random string, and then comparing the origin retrieved by sstable::load to the one written. Test: unit(release) Fixes #7880 " * tag 'sstable-origin-v2' of github.com:bhalevy/scylla: compaction: log sstable origin sstables: scylla_metadata: add support for sstable_origin sstables: sstable_writer_config: add origin member	2021-02-02 10:35:11 +02:00
Pavel Emelyanov	54ddb5a70a	mutation_partition: Save one keys comparison The apply_monotonically checks if the cursor is behind the source position to decide whether or not to push it forward (with the lower_bound call). The 2nd comparison is done to check if either the cursor was ahead or if lower_bound result actually hit the key. This 2nd comparison can be avoided: - the 1st case needs B-tree lower_bound API extention that reports if the bound is match or not. - the 2nd one is covered with reusing tri-compare result from the 1st comparison Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	4ccce97396	partition_snapshot_row_cursor: Remove rows pointer The pointer is needed to erase an element by its iterator from the rows container. The B-tree has this method on iterator and it does NOT need to walk up the tree to find its root, so the complexity is still amortized constant. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	8e7c1e049b	mutation_partition: Use B-tree insertion sugar The B-tree .insert methods accept unique pointers and release them Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	a92eb2f7a9	perf-test : Print B-tree sizes After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes from the B-tree, so it's useful to know their sizes too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	5c0f9a8180	mutation_partition: Switch cache of rows onto B-tree The switch is pretty straightforward, and consists of - change less-compare into tri-compare - rename insert/insert_check into insert_before_hint - use tree::key_grabber in mutation_partition::apply_monotonically to exception-safely transfer a row from one tree to another - explicitly erase the row from tree in rows_entry::on_evicted, there's a O(1) tree::iterator method for this - rewrite rows_entry -> cache_entry transofrmation in the on_evicted to fit the B-tree API - include the B-tree's external memory usage into stats That's it. The number of keys per node was is set to 12 with linear search and linear extention of 20 because - experimenting with tree shows that numbers 8 through 10 keys with linear search show the best performance on stress tests for insert/find-s of keys that are memcmp-able arrays of bytes (which is an approximation of current clustring key compare). More keys work slower, but still better than any bigger value with any type of search up to 64 keys per node - having 12 keys per nodes is the threshold at which the memory footprint for B-tree becomes smaller than for boost::intrusive::set for partitions with 32+ keys - 20 keys for linear root eats the first-split peak and still performs well in linear search As a result the footpring for B tree is bigger than the one for BST only for trees filled with 21...32 keys by 0.1...0.7 bytes per key. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	165255e2bd	partition_snapshot_reader: Rename cmp to less for explicity This is less comparator, cmp is used as a sign of tri-compare in this set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	ee9e104541	mutation_partition: Make insertion bullet-proof The bi::intrusive::set::insert-s are non-throwing, so it's safe to add new entry like this auto* ne = new entry; set.insert(ne); and not worry about memory leak. B-tree's insert will be throwing, so we need some way to free the new entries in case of exception. There's alreay a way for this: std::unique_ptr<entry> ne = std::make_unique<entry>(); set.insert(*ne); ne.release(); so make every insertion into the set work this way in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	926f748a3d	mutation_partition: Use tri-compare in non-set places The mutation_partition::_rows will be switched on B-tree with tri comparator, so to clearly identify not affected by it places, switch them onto tri-compare in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	bfcd6a4bb7	flat_mutation_reader: Use clear() in destroy_current_mutation() Currently the code uses a look of unlink_leftmost_without_rebalance calls. B-tree does have it, but plain clearing of the tree is a bit faster with clear(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	306c40939b	rows_entry: Generalize compare Turn the rows_entry less-comparator's calls into a template as they are nothing but wrappers on top of rows_entyry tri-comparator. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	2f7c03d84c	utils: Intrusive B-tree (with tests) The design of the tree goes from the row-cache needs, which are 1. Insert/Remove do not invalidate iterators 2. Elements are LSA-manageable 3. Low key overhead 4. External tri-comparator 5. As little actions on insert/remove as possible With the above the design is Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes and N pointers on keys (not keys themselves). Two differences: inner nodes have array of pointers on kids, leaf nodes keep pointer on the tree (to update left- and rightmost tree pointers on node move). Nodes do not keep pointers/references on trees, thus we have O(1) move of any object, but O(logN) to get the tree size. Fortunately, with big keys-per-node value this won't result in too many steps. In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter is for constant-time begin() and end(). Keys are managed by user with the help of embeddable member_hook instance, which is 1 pointer in size. The code was copied from the B+ tree one, then heavily reworked, the internal algorythms turned out to differ quite significantly. For the sake of mutation_partition::apply_monotonically(), which needs to move an element from one tree into another, there's a key_grabber helping wrapper that allows doing this move respecting the exception-safety requirement. As measured by the perf_collections test the B-tree with 8 keys is faster, than the std::set, but slower than the B+tree: vs set vs b+tree fill: +13% -6% find: +23% -35% Another neat thing is that 1-key insertion-removal is ~40% faster than for BST (the same number of allocations, but the key object is smaller, less pointers to set-up and less instructions to execute when linking node with root). v4: - equip insertion methods with on_alloc_point() calls to catch potential exception guarantees violations eariler - add unlink_leftmost_without_rebalance. The method is borrowed from boost intrusive set, and is added to kill two birds -- provide it, as it turns out to be popular, and use a bit faster step-by-step tree destruction than plain begin+erase loop v3: - introduce "inline" root node that is embedded into tree object and in which the 1st key is inserted. This greatly improves the 1-key-tree performance, which is pretty common case for rows cache v2: - introduce "linear" root leaf that grows on demand This improves the memory consumption for small trees. This linear node may and should over-grow the NodeSize parameter. This comes from the fact that there are two big per-key memory spikes on small trees -- 1-key root leaf and the first split, when the tree becomes 1-key root with two half-filled leaves. If the linear extention goes above NodeSize it can flatten even the 2nd peak - mitigate the keys indirection a bit Prefetching the keys while doing the intra-node linear scan and the nodes while descending the tree gives ~+5% of fill and find - generalize stress tests for B and B+ trees - cosmetic changes TODO: - fix few inefficincies in the core code (walks the sub-tree twice sometimes) - try to optimize the leaf nodes, that are not lef-/righmost not to carry unused tree pointer on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:29 +03:00
Pavel Emelyanov	6d63bdbefe	tests: Generalize bptree compaction test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:28:59 +03:00
Pavel Emelyanov	8bdad0bb28	tests: Generalize bptree stress test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:28:57 +03:00
Avi Kivity	db4b9215dd	sstable: reader: preempt after every fragment Whenever we push a fragment, we check whether the buffer is full and return proceed::no if so, so that the state machine pauses and lets the consumer continue. This patch adds an additional condition - if preemption is needed, we also return proceed::no. This drops us back to the outer loop (in sstable_mutation_reader::fill_buffer), which will yield to the reactor as part of seastar::do_until(). Two cases (partition_start and partition_end) did not have the check for is_buffer_full(); it is added now. This can trigger is the partition has no rows. Unlike the previous attempt, push_ready_fragments() is not touched. I tested this by reading from a large partition with a simple schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE. However, even without the patch I only got sporadic stalls with the detector set to 1ms, so it's possible I'm not testing correctly. Test: unit (dev) Fixes #7883.	2021-02-01 19:32:07 +02:00
Avi Kivity	7634a90dd2	clustering_range_walker: fix false discontiguity detected after a static row clustering_range_walker detects when we jump from one row range to another. When a static row is included in the query, the constructor sets up the first before/after bounds to be exactly that static row. That creates an artificial range crossing if the first clustering range is contiguous with the static row. This can cause the index to be consulted needlessly if we happen to fall back to sstable_mutation_reader after reading the static row. A unit test is added. Ref #7883.	2021-02-01 19:32:07 +02:00
Pavel Solodovnikov	9d17a654a6	raft: use null_sharder for raft tables Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>	2021-02-01 18:52:04 +02:00
Gleb Natapov	382ee066bf	database: drop duplicated function The database lass have to duplicated functions keyspaces() and get_keyspaces(). Drop the former since it is used in one place only. Message-Id: <20210201135333.GA1403508@scylladb.com>	2021-02-01 18:52:04 +02:00
Tomasz Grabiec	eac9c1d80a	Merge "raft: configuration changes with joint consensus" from Kostja Support configuration changes based on joint consensus. When a user adds a configuration entry, commit an interim "joint consensus" configuration to the log first, and transition to the final configuration once both C_old and C_new configurations accept the joint entry. Misc cleanups. * scylla-dev/raft-config-changes-v2: raft: update README.md raft: add a simple test for configuration changes raft: joint consensus, wire up configuration changes in the API raft: joint consensus, count votes using joint config raft: joint consensus, wire up configuration changes in FSM raft: joint consensus, update progress tracker with joint configuration raft: joint consensus, don't store configuration in FSM raft: joint consensus, keep track of the last confchange index in the log raft: joint consensus, implement helpers in class configuration raft: joint consensus, use unordered_set for server_address list raft: joint consensus, switch configuration to joint raft: rename check_committed() to maybe_commit() raft: fix spelling and add comments	2021-02-01 18:52:04 +02:00
Benny Halevy	4b309e0829	compaction: log sstable origin Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Benny Halevy	77328a936a	sstables: scylla_metadata: add support for sstable_origin Add new scylla_metadata_type::SSTableOrigin. Store and retrive a sstring to the scylla metadata component. Pass sstable_writer_config::origin from the mx sstable writer and ignore it in the k_l writer. Add unit test to verify the sstable_origin extension using both empty and a random string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Benny Halevy	22f6023ac3	sstables: sstable_writer_config: add origin member Add a string describing where the sstables originated from (e.g. memtable, repair, streaming, compaction, etc.) If configure_writer is called with a nullptr, the origin will be equal to an empty string. Introduce test_env_sstables_manager that provides an overload of configure_writer with no parmeters that calls the base-class' configure_writer with "test" origin. This was to reduce the code churn in this patch and to keep the tests simple. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-01 16:45:52 +02:00
Nadav Har'El	75a4281bff	cql-pytest: test the units supposed to be usable for "duration" type This patch adds a test for the different units which are supposed to be usable for assigning a "duration" type in CQL. It turns out that all documented units are supported correctly except µs (with a unicode mu), so the test reproduces issue #8001. The test xfails on Scylla (because µs is not supported) and passes on Cassandra. Refs: #8001. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210131192220.407481-1-nyh@scylladb.com>	2021-02-01 11:05:10 +01:00
Avi Kivity	bb202db1ff	Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA Since makeself script changes current umask, scylla_setup causes "scylla does not work with current umask setting (0077)" error. To fix that we need use latest version of makeself, and specfiy --keep-umask option. Fixes #6243 Closes #6244 * github.com:scylladb/scylla: dist/offline_redhat: fix umask error dist/offline_installer/redhat: support cross build	2021-01-31 18:47:27 +02:00
Takuya ASADA	49e4f318a0	dist/offline_redhat: fix umask error Since makeself script changes current umask, scylla_setup causes "scylla does not work with current umask setting (0077)" error. To fix that we need use latest version of makeself, and specfiy --keep-umask option. Fixes #6243	2021-01-31 21:37:49 +09:00
Takuya ASADA	74d7e31576	dist/offline_installer/redhat: support cross build Supported cross build by running CentOS7 on docker, now it's able to build on Fedora. It also supported switch container image, tested on Oracle Linux 7 and CentOS 7/8.	2021-01-31 21:37:49 +09:00
Avi Kivity	9271e4bf6e	Update seastar submodule * seastar 52d41277a...cb3aaf07e (2): > tls: reloadable_credentials_base: add_dir_watch: fix root dir detection > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility	2021-01-31 13:28:45 +02:00
Raphael S. Carvalho	298d54ceb0	utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned last fragment is unconditionally pushed to set of fragments, so if data size is fragment-aligned, an empty fragment will be needlessly pushed to the back of the fragment set. note: i haven't tested if empty fragment at back of set will cause issues, i think it won't, but this should be avoided anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Raphael S. Carvalho	e745f1e697	utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Raphael S. Carvalho	08e838d4b5	utils/fragmented_temporary_buffer: simplify allocate_to_fit() 1) reuse default_fragment_size for knowledge of max fragment size 2) fragments_count is not a good name as it doesn't include last non-full fragment (if present), so rename it. 3) simplify calculation of last fragment size Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Pavel Solodovnikov	b9a280161d	raft: introduce `raft_rpc` class The patch contains a skeleton implementation for the Scylla-specific Raft RPC module. It uses `netw::messaging_service` as underlying mechanism to send RPC messages. The instance is supposed to be bound to a single raft group. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-30 01:12:35 +03:00
Pavel Solodovnikov	1a979dbba2	raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls All RPC module APIs except for `send_snapshot` should resolve as soon as the message is sent, so these messages are passed via `send_message_oneway_timeout`. `send_snapshot` message is sent via `send_message_timeout` and returns a `future<>`, which resolves when snapshot transfer finishes or fails with an exception. All necessary functions to wire the new Raft RPC verbs are also provided (such as `register` and `unregister` handlers). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-30 01:11:17 +03:00
Pavel Solodovnikov	e30a55ba2f	configure.py: compile serializer.cc This file was not added to the configure.py, which `raft_sys_table_storage` series was supposed to do. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-30 01:09:32 +03:00
Konstantin Osipov	a8f2fa7fa0	raft: update README.md	2021-01-29 22:07:08 +03:00
Konstantin Osipov	b7692af8bc	raft: add a simple test for configuration changes Test adding, removing replacing a node. With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-29 22:07:08 +03:00
Konstantin Osipov	c7b5a60320	raft: joint consensus, wire up configuration changes in the API Now that we've implemented joint consensus based configuration changes, replace add_server()/remove_server() with a more general set_configuration().	2021-01-29 22:07:08 +03:00
Konstantin Osipov	afadc7c0a1	raft: joint consensus, count votes using joint config Send RequestVote to a joint config. We need to exclude self from the list of peers if we're not part of the current configuration. Avoid disrupting the cluster in this case. Maintain separate status for previous and current config when counting votes.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	8b86d91754	raft: joint consensus, wire up configuration changes in FSM When add_entry() with new configuraiton is submitted, create a joint configuration and switch to it immediately. Refuse to enter joint configuration if a configuration change is already in progress. When the leader it committed an entry with joint configuration, append a new entry with final configuration and switch to it. Resign leadership if the current leader is not part of a new configuration. When we change from A, B, C to B, C, D and the leader is A, then, when C_new starts to be used, the leader is not part of the current configuration, so it doesn't have to be in the tracker. Do not try to find & advance leader progress unconditionally then.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	18a684ba11	raft: joint consensus, update progress tracker with joint configuration The leader doesn't have to be part of the current configuration, so add a way to access follower_progress for the leader only if it is present. Upon configuration changes, preserve progress information for intact nodes, remove for removed, and create a new progress object for added nodes. When tracking commit progress in joint configuration mode, calculate two commit indexes for two configurations, and choose the smallest one.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	20df1955b2	raft: joint consensus, don't store configuration in FSM In follower state, FSM doesn't know the current cluster configuration. Instead of trying to watch the follower log for configuration changes to keep FSM copy up to date, remove it from FSM altogether since the follower doesn't need it anyway. When entering candidate or leader state, fetch the most recent configuration from the log and initialize the state specific state with it.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	b29181875c	raft: joint consensus, keep track of the last confchange index in the log When initializing the log, find the most recent configuration change index, if present. Maintain the most recent configuration change index when the log is truncated or entries are appended to it. The last configuration change index will be used by FSM when it enters candidate or leader state to fetch the current configuration. We never truncate beyond a single in-progress configuration change, so storing the previous value of last_conf_idx helps avoid log backward scan on truncation in 100% of cases. Remove all unused log constructors.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	6e128aa357	raft: joint consensus, implement helpers in class configuration	2021-01-29 22:07:07 +03:00
Konstantin Osipov	1ca738d9a2	raft: joint consensus, use unordered_set for server_address list	2021-01-29 22:07:07 +03:00
Konstantin Osipov	df944f953c	raft: joint consensus, switch configuration to joint In order to work correctly in transitional configuration, participants must enter it after crashes, restarts and state changes. This means it must be stored in Raft log and snapshot on the leader and followers. This is most easily done if transitional configuration is just a flavour of standard configuration. In FSM, rename _current_config to _configuration, it now contains both current and future configuration at all times.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	076e46af9e	raft: rename check_committed() to maybe_commit() This is what the function does, and it's the name used in other implementations.	2021-01-29 22:07:07 +03:00
Gleb Natapov	aad0209b1c	raft: fix spelling and add comments Fix spelling errors in a few comments, improve comments. With fix-ups by Gleb Natapov <gleb@scylladb.com>	2021-01-29 22:07:07 +03:00
Pavel Emelyanov	575c992a35	test: Bring test_apply_monotonically_is_monotonic back to work The idea of the monotonicity checking test is: try to apply one one random partition to another random one sequentually failing allocations. Each time allocation fails (with the bad_alloc exception) -- check the exception guarantee is respected, then apply (!) the very same two partitions to each other. At the end of the test we make sure, that an exception may pop up at any point of application and it will be safe. This idea is flawed currently. When verifying the guarantee the test moves the 2nd partition and leaves it empty for the next loop iteration. So right on the 2nd attempt to apply partitions it becomes a no-op, doesn't fail and no more exceptions arise. Fix by restoring both partitions at the end of each check. Broken since `74db08165d`. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210129153641.5449-1-xemul@scylladb.com>	2021-01-29 18:47:15 +01:00
Tomasz Grabiec	16eb4c6ce2	Merge "raft: system table backed persistency module" from Pavel Solodovnikov This series contains an initial implementation of raft persistency module that uses `raft` system table as the underlying storage model. "system.raft" table will be used as a backend storage for implementing raft persistence module in Scylla. It combines both raft log, persisted vote and term, and snapshot info. The table is partitioned by group id, thus allowing multi-raft operation. The rest of the table structure mirrors the fields of corresponding core raft structures defined in `raft.hh`, such as `raft::log_entry`. The raft table stores the only the latest snapshot id while the actual snapshot will be available in a separate table called `system.raft_snapshots`. The schema of `raft_snapshots` mirrors the fields of `raft::snapshot` structure. IDL definitions are also added for every raft struct so that we automatically provide serialization and deserialization facilities needed both for persistency module and for future RPC implmementation. The first patch is a side-change needed to provide complete serialization/deserialization for `bytes_ostream`, which we need when persisting the raft log in the table (since `data` is a variant containing `raft::command` (aka `bytes_ostream`) among others). `bytes_ostream` was lacking `deserialize` function, which is added in the patch. The second patch provides serializer for `lw_shared_ptr<T>` which will be used for `raft::append_entries`, which has a field with `std::vector<const lw_shared_ptr<raft::log_entry>>` type. There is also a patch to extend `fragmented_temporary_buffer` with a static function `allocate_to_fit` that allocates an instance of the fragmented buffer that has a specified size. Individual fragment size is limited to 128kb. The patch-set also contains the test suite covering basic functionality of the persistency module. * manmanson/raft-api-impl-v11: raft/sys_table_storage: add basic tests for raft_sys_table_storage raft: introduce `raft_sys_table_storage` class utils: add `fragmented_temporary_buffer::allocate_to_fit` raft: add IDL definitions for raft types raft: create `system.raft` and `system.raft_snapshots` tables serializer: add `serializer<lw_shared_ptr<T>>` specialization serializer: add `deserialize` function overload for `bytes_ostream`	2021-01-29 11:40:39 +02:00
Pavel Solodovnikov	e309502c42	raft/sys_table_storage: add basic tests for raft_sys_table_storage The test suite covers the most basic use cases for the system table backed raft persistency module: * store/load vote and term * store/load snapshot * store snapshot with log tail truncation * store/load log entries * log truncation Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 02:00:27 +03:00
Pavel Solodovnikov	aebb1987b5	raft: introduce `raft_sys_table_storage` class This is the implementation of raft persistency module that uses `raft` system table as the underlying storage model. The instance is supposed to be bound to a single raft group. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 02:00:12 +03:00
Pavel Solodovnikov	d14dc030ac	utils: add `fragmented_temporary_buffer::allocate_to_fit` Introduce `fragmented_temporary_buffer::allocate_to_fit` static function returning an instance of the buffer of a specified size. The allocated buffer fragments have a size of at most 128kb. `bytes_ostream` has the same hard-coded limit, so just use the same here. This patch will be later needed for `raft::log_entry` raw data serialization when writing to the underlying persistent storage. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 01:59:16 +03:00
Pavel Solodovnikov	e1504bbf0e	raft: add IDL definitions for raft types Changes to the `configuration` and `tagged_uint64` classes are needed to overcome limitations of the IDL compiler tool, i.e. we need to supply a constructor to the struct initializing all the members (raft::configuration) and also need to make an accessor function for private members (in case of raft::tagged_uint64). All other structs mirror raft definitions in exactly the same way they are declared in `raft.hh`. `tagged_id` and `tagged_uint64` are used directly instead of their typedef-ed companions defined in `raft.hh` since we don't want to introduce indirect dependencies. In such case it can be guaranteed that no accidental changes made outside of the idl file will affect idl definitions. This patch also fixes a minor typo in `snapshot_id_tag` struct used in `snapshot_id` typedef.	2021-01-29 01:59:10 +03:00
Pavel Solodovnikov	cf5b8c4b79	raft: create `system.raft` and `system.raft_snapshots` tables System raft table will be used as a backend storage for implementing raft persistence module in Scylla. It combines both raft log, persisted vote and term, and snapshot info. The table is partitioned by group id, thus allowing multi-raft operation. The rest of the table structure mirrors the fields of corresponding core raft structures defined in `raft.hh`, such as `raft::log_entry`. The raft table stores the only the latest snapshot id while the actual snapshot will be available in a separate table called `system.raft_snapshots`. The schema of `raft_snapshots` mirrors the fields of `raft::snapshot` structure. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 01:59:04 +03:00
Pavel Solodovnikov	83c26e542d	serializer: add `serializer<lw_shared_ptr<T>>` specialization This one works similar to `serializer<optional<T>>` and will be later needed for serializing `raft::append_request`, which has a field containing `lw_shared_ptr`. Users to be warned, though: this code assumes that the pointer is never null. This is done to mirror the serialize implementation for `lw_shared_ptr:s` in the messaging_service.cc, which is subject to being deleted in favor of the impl in the `serializer_impl.hh`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 01:58:46 +03:00
Avi Kivity	b32ece6975	Update tools/java submodule * tools/java 4a55b81941...78c8ef4f54 (1): > nodetool: do no treat table name with dot as a secondary index Fixes #6521.	2021-01-28 16:16:47 +02:00
Kamil Braun	bf115e7d69	schema_tables: put schema tables on shard 0 We use a custom sharder for all schema tables: every table under the `system_schema` keyspace, plus `system.scylla_table_schema_history`. This sharder puts all data on shard 0. To achieve this, we hardcode the sharder in initial schema object definitions. Furthermore - since the sharder is not stored inside schema mutations yet - whenever we deserialize schema objects from mutations, we modify the sharder based on the schema's keyspace and table names. A regression test is added to ensure no one forgets to set the special sharder for newly added schema tables. This test assumes that all newly added schema tables will end up in the `system_schema` keyspace (other tables may go unnoticed, unfortunately). Closes #7947	2021-01-28 13:28:22 +02:00
Avi Kivity	32cdcc0c8b	Merge "sstables: consolidate reader factory methods" from Botond " Currently there are three different methods for creating an sstable reader: * one for single key reads * one for ranged reads * and one nobody uses This patch-set consolidates all these into a single `make_reader()` method, which behind the scenes uses the same logic to dispatch to the right sstable reader constructor that `sstables::as_mutation_source()` uses. This patch-set is part of an effort to clean up the jungle that is the various reader creation methods. The next step is to clean up the sstable_set, which has even more methods. One very sad discovery I made while working on this patch-set is that we still default `mutation_reader::forwarding` to `yes` in the sstable range reader creator method and in the `mutation_source::make_reader()`. I couldn't assume that all callers are passing what they mean as the value for that parameter. I found many sites in tests that create forwardable single partition readers. This is also something we should address soon. Tests: unit(release, debug:v3) " * 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla: cql_query_test: add unit test covering the non-optimal TWCS sstable read path sstable_mutation_reader: consolidate constructors tests: don't pass temporary ranges to readers sstables: sstable_mutation_reader: remove now unused whole sstable constructor sstables: stats: remove now unused sstable_partition_reads counter sstable: remove read_.row._flat() methods tree-wide: use sstables::make_reader() instead of the read_.row._flat() methods sstables: pass partition_range to create_single_key_sstable_reader() sstables: sstable: add make_reader()	2021-01-28 12:05:06 +02:00
Botond Dénes	1e9ce62ee6	cql_query_test: add unit test covering the non-optimal TWCS sstable read path The sstable read path for TWCS tables takes a different path when the optimized read path cannot be used. This path was found to be not covered at all by unit tests which allowed a trivial use-after-free to slip in. Add a unit test to cover this path as well, so ASAN can catch such bugs in the future.	2021-01-28 11:34:03 +02:00
Avi Kivity	55609f2033	Update seastar submodule * seastar a287bb1a3...52d41277a (8): > fair_queue: Preempted requests got re-queued too far > scripts/perftune.py: remove repeated items after merging options from file > file.hh: Remove fair_queue.hh > Merge "Reloadable TLS certificate tolerance" from Calle > Merge "Cancellable IO" from Pavel E > abort-source: Improve the subscriptions management > fair_queue: Improve requests preemption while in pending state > http: add support for Default handler (/*)	2021-01-28 08:45:33 +01:00
Konstantin Osipov	b4f875f08e	uuid: reduce code dependency on UUID_gen.hh Do not include UUID_gen.hh in trace_state.hh and lists.hh to reduce header level dependency on it. Message-Id: <20210127173114.725761-2-kostja@scylladb.com>	2021-01-27 20:08:29 +02:00
Botond Dénes	6024ef5dad	sstable_mutation_reader: consolidate constructors The two remaining sstable constructor are very similar apart from the content of the initialize lambda. Speaking of which, the two remaining initializer lambdas can be easily merged into one too. So this patch does just that, consolidates the two constructors one and moves consolidates as well as extracts the initializer method into a member method. This means we have to store the previously captured variables as members, but this is actually a good thing: when debugging we can see the range and slice the reader is reading, and we are not actually paying for it either -- they were already stored, just out of sight.	2021-01-27 17:38:17 +02:00
Botond Dénes	dd26a96e63	tests: don't pass temporary ranges to readers The sstable_mutation_reader, like all other mutation readers expects that the partition-range passed to it is kept alive by its creator for the duration of its lifetime. However, the single-key constructor of the sstable reader was more tolerant, as it only extracted the key from the range, essentially requiring only the key to be kept alive (but not the containing range). Naturally in time some code come to rely on it and ended up passing temporary ranges to the reader. This behaviour will no longer be acceptable as we are about to consolidate the various sstable reader constructors, uniformly requiring that the range is kept alive. So this patch fixes up the tests so they work with this stricter requirement. Only two occurences were found.	2021-01-27 17:38:17 +02:00
Botond Dénes	43ad64db78	sstables: sstable_mutation_reader: remove now unused whole sstable constructor	2021-01-27 17:38:17 +02:00
Botond Dénes	ec6c540c30	sstables: stats: remove now unused sstable_partition_reads counter	2021-01-27 17:38:17 +02:00
Botond Dénes	5f18e9eb37	sstable: remove read_.row._flat() methods	2021-01-27 17:38:17 +02:00
Botond Dénes	c3b4e990a2	tree-wide: use sstables::make_reader() instead of the read_.row._flat() methods	2021-01-27 17:38:17 +02:00
Botond Dénes	080bc2ffec	sstables: pass partition_range to create_single_key_sstable_reader() We want to unify the various sstable reader creation methods and this method taking a ring position instead of a partition range like everybody else stands in the way of that. This is effect reverts `68663d0de`.	2021-01-27 17:38:14 +02:00
Wojciech Mitros	a1f93e4297	api: use a list instead of a vector to remove a large allocation in api handler Follow-up to #7917 The size of an cf::column_family_info is 224 bytes, so an std::vector that contains one for each column family may be very large, causing allocations of over 1MB. Considering the vector is used only for iteration, it can be changed to a non-contiguous list instead. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #7973	2021-01-27 16:02:07 +02:00
Avi Kivity	aec231ba2e	Merge "Unify query paths" from Botond " Currently we have two parallel query paths: * database::query() -> table::query() -> data_query() * mutation::query() The former is used by single partition queries, the latter by range scans, as mutation::query() is used to convert reconcilable_result to query::result (which means it is also used in single partition queries if it triggers read repair). This is a rather unfortunate situation as we have two parallel implementation of the query code, which means they are prone to diverge, and in fact they already have -- more on that later. This patchset aims to remedy this situation by retiring `mutation::query()` and migrating users to an implementation based on the "standard" query path, in other words one using the same building blocks as the `database::query()` path. This means using `compact_mutation` for compacting and `query_result_builder` for result building. These components however were created to work with `flat_mutation_reader`, however introducing a reader into this pipeline would mean that we'd have to make all the related APIs asynchronous, which would cause an insane amount of churn. To avoid this, this patchset adds an API compatible `consume()` method to `mutation`, which can accept a `compact_mutation` instance as-is. This allows an elegant and succinct reimplementation. So far so good. Like mentioned above, the two implementations have diverged in time, or have been different from the start. The difference manifest when calculating digests, more precisely in which tombstones are included in the digest. The retired `mutation::query()` path incorporates only non-purgeable tombstones in the digest. The standard query path however incorporates all tombstones, even those that can be purged. After some scrutiny however this difference proved to be completely theoretical, as the code path where this would matter -- converting reconcilable result to query result -- passes min timestamp as the query time to the compaction, so nothing is compacted and hence the difference has no chance to manifest. This patch-set was motivated by the desire to provide a single solution to #7434, instead of two, one for each path. Tests: unit(release:v2, debug:v2, dev:v3) " * 'unified-query-path/v3' of https://github.com/denesb/scylla: mutation: remove now unused query() and query_compacted() treewide: use query_mutations() instead of mutation::query() mutation_test: test_query_digest: ensure digest is produced consistently mutation_query: introduce query_mutation() mutation_query: to_data_query_result(): migrate to standard query code mutation_query: move to_data_query_result() to mutation_partition.cc mutation: add consume() flat_mutation_reader: move mutation consumer concepts to separate header mutation compactor: query compaction: ignore purgeable tombstones	2021-01-27 15:58:47 +02:00
Botond Dénes	a5a8037f6e	sstables: sstable: add make_reader() This will be the only method to create sstable readers with. For now we leave the other variants, they as well as their users will be removed in a following patch.	2021-01-27 15:20:06 +02:00
Nadav Har'El	2113849a2b	cql-pytest: reproducer for toJson() bug with doubles This patch adds a cql-pytest, test_json.py::test_tojson_double(), which reproduces issue #7972 - where toJson() prints some doubles incorrectly - truncated to integers, but some it prints fine (I still don't know why, this will need to be debugged). The test is marked xfail: It fails on Scylla, and passes on Cassandra. Refs #7972. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210127124338.297544-1-nyh@scylladb.com>	2021-01-27 14:00:25 +01:00
Pavel Solodovnikov	10b117aada	raft: create dummy impl for schema changes state machine This patch introduces `schema_raft_state_machine` class which is currently just a dummy implementation throwing a "not implemented" exceptions for every call. Will be needed later to construct an instance of `raft::server`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>	2021-01-27 12:33:27 +01:00
Pavel Solodovnikov	223c823963	serializer: add `deserialize` function overload for `bytes_ostream` For some reason we had a distinct specialization of `serialize` function to handle `bytes_ostream` but not `deserialize`. This will be used in the following patches. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-26 23:21:15 +03:00
Asias He	c82250e0cf	gossip: Allow deferring advertise of local node to be up Currently the replacing node sets the status as STATUS_UNKNOWN when it starts gossip service for the first time before it sets the status to HIBERNATE to start the replacing operation. This introduces the following race: 1) Replacing node using the same IP address of the node to be replaced starts gossip service without setting the gossip STATUS (will be seen as STATUS_UNKNOWN by other nodes) 2) Replacing node waits for gossip to settle and learns status and tokens of existing nodes 3) Replacing node announces the HIBERNATE STATUS. After Step 1 and before Step 3, existing nodes will mark the replacing node as UP, but haven't marked the replacing node as doing replacing yet. As a result, the replacing node will not be excluded from the read replicas and will be considered a target node to serve CQL reads. To fix, we make the replacing node avoid responding echo message when it is not ready. Fixes #7312 Closes #7714	2021-01-26 19:02:11 +01:00
Pekka Enberg	9fc83ac627	Update tools/java submodule * tools/java 8080009794...4a55b81941 (1): > cassandra.in.sh: remove debug message	2021-01-26 15:56:58 +02:00
Avi Kivity	90a6c3bd7a	build: reduce release mode inline tuning on aarch64 I see a miscompile on aarch64 where a call to format("{}", uuid) translates a function pointer to -1. When called, this crashes. Reduce the inline threshold from 2500 to 600. This doesn't guarantee no miscompiles but all the tests pass with this parameter. Closes #7953	2021-01-26 11:14:42 +02:00
Tomasz Grabiec	90f6bb754e	Merge "raft: replication tests: fixes for debug mode" from Alejo The following patches fix issues seen occasionally in debug mode. Notes: - In debug mode there's still the UB nullptr arithmetic warning. * https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation: raft: replication test: wait for log propagation raft: replication test: move wait for log to a function raft: replication test: remove unused member raft: replication test: use later() raft: testing: remove election wait time and just yield	2021-01-26 11:14:42 +02:00
Avi Kivity	f58151d191	test: mutation_test: fix initialization order bug with thread local storage test_cell_external_memory_usage uses with_allocator() to observe how some types allocate memory. However, compiler reordering (observed with clang 11 on aarch64) can move the various thread-local CQL type object initialization into the with_allocator() scope; so any managed object allocated as part of this initialization also gets measured, and the test fails. The code movement is legal, as far as I can tell. Fix this by initializing the type object early; use an atomic_thread_fence as an optimization barrier so the compiler doesn't eliminate the or move the early initialization. Closes #7951	2021-01-26 11:14:42 +02:00
Nadav Har'El	356250f720	cql-pytest: tests for fromJson() failing to set tuple elements to null This patch adds a test for trying to set a tuple element to null with fromJson(), which works on Cassandra but fails on Scylla. So the test xfails on Scylla. Reproduces issue #7954. Refs #7954. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210124082311.126300-1-nyh@scylladb.com>	2021-01-26 11:14:42 +02:00
Avi Kivity	05c435dddc	Merge "mutation readers: remove next_partition() workarounds" from Botond " `next_partition()` used to return void, so readers that had to call future returning code had to work around this. Now that `next_partition()` returns a future, we can get rid of these workarounds. Tests: unit(release, debug) " * 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla: mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag mutation_reader: evictable_reader: remove next_partition() workaround mutation_reader: shard_reader: remove next_partition() workaround mutation_reader: foreign_reader: remove next_partition() workaround	2021-01-26 11:14:42 +02:00
Nadav Har'El	067330c08f	Merge 'redis: support large redis message' from Takuya ASADA If the message is larger than current buffer size, we need to consume more data until we reach to tail of the message. To do so, we need to return nullptr when it's not on the tail. Fixes #7273 Closes #7903 * github.com:scylladb/scylla: redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol: - *[0-9]+ defined array size - $[0-9]+ defined string size redis: fix large message handling	2021-01-25 10:11:17 +02:00
Takuya ASADA	229940aaff	redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol: - *[0-9]+ defined array size - $[0-9]+ defined string size Currently, array size is stored to args_count, and string size is stored to _arg_size / _size_left. It's bit hard to understand since both uses same word "arg(s)", let's rename string size variables to _bytes_count / _bytes_left.	2021-01-25 10:26:37 +09:00
Takuya ASADA	7a6ee9858f	redis: fix large message handling If the message is larger than current buffer size, we need to consume more data until we reach to tail of the message. To do so, we need to return nullptr when it's not on the tail. Fixes #7273	2021-01-25 10:26:37 +09:00
Alejo Sanchez	0d694990cf	raft: replication test: wait for log propagation Wait until entries propagate after adding and before changing leader using the same code as done for partitioning. This fixes occasional hangs in debug mode when a test switches to a different leader without leaving enough time for full propagation. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:33:54 -04:00
Alejo Sanchez	4d1ec88f90	raft: replication test: move wait for log to a function Move wait for log propagation to its own function for reuse. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Alejo Sanchez	72f9b108e3	raft: replication test: remove unused member Initial state doesn't need to specify total entries anymore. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Alejo Sanchez	db95d6e7f1	raft: replication test: use later() Instead of sleep 1us use later() Also use later to yield after sending append entries in rpc test impl. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Alejo Sanchez	f875ff72c9	raft: testing: remove election wait time and just yield Replace sleep time for elect_me_leader with yield to speed things up. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Pekka Enberg	8258556832	Update tools/python3 submodule * tools/python3 c579207...199ac90 (1): > dist: debian: adjust .orig tarball name for .rc releases	2021-01-24 21:30:59 +02:00
Gleb Natapov	020da49c89	storage_proxy: remove no longer needed range_slice_read_executor After support for mixed cluster compatibility feature DIGEST_MULTIPARTITION_READ was dropped in `854a44ff9b` range_slice_read_executor and never_speculating_read_executor become identical, so remove the former for good. Message-Id: <20210124122731.GA1122499@scylladb.com>	2021-01-24 14:45:22 +02:00
Benny Halevy	088f92e574	paxos_state: learn: fix injected error description It was copy-pasted from another injection point. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>	2021-01-24 11:51:23 +02:00
Takuya ASADA	5d527bd17e	scylla_ntp_setup: use chrony on all distributions To simplify scylla_ntp_setup, use chrony on all distributions. Closes #7922	2021-01-24 11:45:58 +02:00
Takuya ASADA	984dc44ebf	dist: drop /etc/security/limits.d/scylla.conf Drop limits.d conf file, since we don't use it. We set these parameters via systemd unit file instead. Fixes #7925 Closes #7941	2021-01-24 11:43:39 +02:00
Benny Halevy	1847d49971	test: test_env: pick the highest sstable version by default If possible, test the highest sstable format version, as it's the mostly used. If there pre-written sstables we need to load from the test directory from an older version, either specify their version explicitly, or use the new test_env::reusable_sst method that looks up the latest sstable version in the given directory and generation. Test: unit(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>	2021-01-24 10:38:55 +02:00
Botond Dénes	226088d12e	mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag Its not used anymore.	2021-01-22 16:18:59 +02:00
Botond Dénes	4eb65b12a0	mutation_reader: evictable_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 16:18:30 +02:00
Botond Dénes	febd2feb4c	mutation_reader: shard_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 15:53:05 +02:00
Botond Dénes	9c96d74b72	mutation: remove now unused query() and query_compacted()	2021-01-22 15:36:37 +02:00
Botond Dénes	1a3ee71b39	treewide: use query_mutations() instead of mutation::query() We want to retire the latter.	2021-01-22 15:36:37 +02:00
Botond Dénes	81da6b756f	mutation_reader: foreign_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 15:30:36 +02:00
Nadav Har'El	cb9e2ee00a	cql-pytest: tests for fromJson() setting a map<ascii, int> The fromJson() function can take a map JSON and use it to set a map column. However, the specific example of a map<ascii, int> doesn't work in Scylla (it does work in Cassandra). The xfailing tests in this patch demonstrate this. Although the tests use perfectly legal ASCII, scylla fails the fromJson() function, with a misleading error. Refs #7949. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210121233855.100640-1-nyh@scylladb.com>	2021-01-22 14:29:25 +01:00
Botond Dénes	a9d726c7ba	mutation_test: test_query_digest: ensure digest is produced consistently Before we retire the mutation::query() code, expand the digest test to check that the new code replacing it produces identical digest on all possible equivalent mutations.	2021-01-22 15:27:48 +02:00
Botond Dénes	821ed96e0e	mutation_query: introduce query_mutation() This is a replacement of `mutation::query()`, but with an implementation based on the standard query result building code. This will allow us to migrate the remaining `mutation::query()` users off of said method, which in turn will allow us to retire it finally.	2021-01-22 15:27:48 +02:00
Botond Dénes	c4f12221b8	mutation_query: to_data_query_result(): migrate to standard query code Reimplement in terms of the standard query result building code. We want to retire the alternative query result code in `mutation::query()` and `to_data_query_result()` is one of the main users.	2021-01-22 15:27:48 +02:00
Botond Dénes	164582f33b	mutation_query: move to_data_query_result() to mutation_partition.cc We want to rewrite the above mentioned method's implementation in terms of the standard query result building code (that of the `data_query()` path), in order to retire the alternative query code in the mutation class. The `data_query()` code uses classes private to `mutation_partition.cc` and instead of making these public, just move `to_data_query_result()` to `mutation_partition.cc`.	2021-01-22 15:27:48 +02:00
Botond Dénes	d0c5f550a9	mutation: add consume() This consume method accepts a `FlattenedConsumer`, the same one that the name-sake `flat_mutation_reader::consume()` does. Indeed the main purpose of this method is to allow using the standard query result building stack with a mutation, the same way said stack is used with mutation readers currently. This will allow us to replace the parallel query result building code that currently exists in the `mutation::query()` and friends, with the standard one.	2021-01-22 15:27:48 +02:00
Botond Dénes	9153f63135	flat_mutation_reader: move mutation consumer concepts to separate header In the next patch we will want to use these concepts in `mutation.hh`. To avoid pulling in the entire `flat_mutation_reader.hh` just for these, and create a circular dependency in doing so, move them to a dedicated header instead.	2021-01-22 15:27:48 +02:00
Botond Dénes	73808c12eb	mutation compactor: query compaction: ignore purgeable tombstones This behaviour is makes query result building sensitive to whether the data was recently compacted or not, in particular different digests will be produced depending on whether purgeable tombstones happened to be compacted (and thus purged) or not. This means that two replicas can produce different digests for the same data if has compacted some purgeable tombstones and the other not. To avoid this, drop purgeable tombstones during query compaction as well.	2021-01-22 15:27:48 +02:00
Pavel Emelyanov	90d445464b	compaction: Remove compaction_manager::enabled() This method was marked with 'FIXME -- should not be public' when it was introduced. Since then it has stopped being used and can even be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210122083146.5886-1-xemul@scylladb.com>	2021-01-22 14:07:38 +02:00
Kamil Braun	570d15c7bc	multishard_combining_reader: do not use `smp::count` `multishard_combining_reader` currently only works under the assumption that every table uses the same sharder configured using the node's number of shards. But we could potentially specify a different sharder for a chosen table, e.g. one that puts everything on shard 0. Then this assumption will be broken and the reader causes a segfault. Fixes #7945.	2021-01-21 18:28:18 +02:00
Nadav Har'El	328be1ca7c	cql-pytest: tests for fromJson() not accepting empty string as integer When writing to an integer column, Cassandra's fromJson() function allows not just JSON number constants, it also allows a string containing a number. Strings which do not hold a number fail with a FunctionFailure. In particular, the empty string "" is an invalid number, and should fail. The tests in this patch check this for two integer types: int and varint. Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails to recognize the error for varint, while Cassandra fails to recognize the error for int. The tests in this patch reproduce these bugs. The tests demonstrating Scylla's bug are marked xfail, and the tests demonstrating Cassandra's bug is marked "cassandra_bug" (which means it is marked xfail only when running against Cassandra, but expected to succeed on Scylla. Refs #7944. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210121133833.66075-1-nyh@scylladb.com>	2021-01-21 15:24:48 +01:00
Nadav Har'El	702b1b97bf	cql: fix error return from execution of fromJson() and other functions As reproduced in cql-pytest/test_json.py and reported in issue #7911, failing fromJson() calls should return a FUNCTION_FAILURE error, but currently produce a generic SERVER_ERROR, which can lead the client to think the server experienced some unknown internal error and the query can be retried on another server. This patch adds a new cassandra_exception subclass that we were missing - function_execution_exception - properly formats this error message (as described in the CQL protocol documentation), and uses this exception in two cases: 1. Parse errors in fromJson()'s parameters are converted into a function_execution_exception. 2. Any exceptions during the execute() of a native_scalar_function_for function is converted into a function_execution_exception. In particular, fromJson() uses a native_scalar_function_for. Note, however, that functions which already took care to produce a specific Cassandra error, this error is passed through and not converted to a function_execution_exception. An example is the blobAsText() which can return an invalid_request error, so it is left as such and not converted. This also happens in Cassandra. All relevant tests in cql-pytest/test_json.py now pass, and are no longer marked xfail. This patch also includes a few more improvements to test_json.py. Fixes #7911 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>	2021-01-21 15:21:13 +01:00
Nadav Har'El	49440d67ad	Merge: Fix multiple issues with timeuuid type Merged patch series by Konstantin Osipov: "These series improve uniqueness of generated timeuuids and change list append/prepend logic to use client/LWT timestamp in timeuuids generated for list keys. Timeuuid compare functions are optimized. The test coverage is extended for all of the above." uuid: add a comment warning against UUID::operator< uuid: replace slow versions of timeuiid compare with optimized/tested versions. test: add tests for legacy uuid compare & msb monotonicity test: add a test case for append/prepend limit test: add a test case for monotonicity of timeuuid least significant bits uuid: implement optimized timeuuid compare test: add a test case for list prepend/append with custom timestamp lists: rewrite list prepend to use append machinery lists: use query timestamp for list cell values during append uuid: fill in UUID node identifier part of UUID test: add a CQL test for list append/prepend operations	2021-01-21 13:20:07 +02:00
Konstantin Osipov	e18e2cb9f2	uuid: add a comment warning against UUID::operator<	2021-01-21 13:03:59 +03:00
Konstantin Osipov	845f6c667b	uuid: replace slow versions of timeuiid compare with optimized/tested versions.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	56d8d166cb	test: add tests for legacy uuid compare & msb monotonicity	2021-01-21 13:03:59 +03:00
Konstantin Osipov	257c5b0879	test: add a test case for append/prepend limit	2021-01-21 13:03:59 +03:00
Konstantin Osipov	d6e65a3735	test: add a test case for monotonicity of timeuuid least significant bits Ensure that timeuuid least significant bits are compared correctly.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	0af3758aff	uuid: implement optimized timeuuid compare Introduce uint64_t based comparator for serialized timeuuids. Respect Cassandra legacy for timeuuid compare order. Scylla uses two versions of timeuuid compare: - one for timeuuid values stored in uuid columns - a different one for timeuuid values stored in timeuuid columns. This commit re-implements the implementations of these comparators in types.cc and deprecates the respective implementations types.cc. They will be removed in a following patch. A micro-benchmark at https://github.com/alecco/timeuuid-bench/ shows 2-4x speed up of the new comparators.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	b4500a55c7	test: add a test case for list prepend/append with custom timestamp Scylla now takes a custom timestamp into account when executing list append/prepend operations. Test the new semantics.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	232ce6f611	lists: rewrite list prepend to use append machinery Rewrite list prepend to use the same machinery as append, and thus produce correct results when used in LWT. After this patch, list prepend begins to honor user supplied timestamps. If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00 an exception is thrown. Fixes #7611	2021-01-21 13:03:59 +03:00
Konstantin Osipov	2b8ce83eea	lists: use query timestamp for list cell values during append Scylla list cells are represented internally as a map of timeuuid => value. To append a new value to a list the coordinator generates a timeuuid reflecting the current time as key and adds a value to the map using this key. Before this patch, Scylla always generated a timeuuid for a new value, even if the query had a user supplied or LWT timestamp. This could break LWT linearizability. User supplied timestamps were ignored. This is reported as https://github.com/scylladb/scylla/issues/7611 A statement which appended multiple values to a list or a BATCH generated an own microsecond-resolution timeuuid for each value: BEGIN BATCH UPDATE ... SET a = a + [3] UPDATE ... SET a = a + [4] APPLY BATCH UPDATE ... SET a = a + [3, 4] To fix the bug, it's necessary to preserve monotonicity of timeuuids within a batch or multi-value append, but make sure they all use the microsecond time, as is set by LWT or user. To explain the fix, it's first necessary to recall the structure of time-based UUIDs: 60 bits: time since start of GMT epoch, year 1582, represented in 100-nanosecond units 4 bits: version 14 bits: clock sequence, a random number to avoid duplicates in case system clock is adjusted 2 bits: type 48 bits: MAC address (or other hardware address) The purpose of clockseq bits is as defined in https://tools.ietf.org/html/rfc4122#section-4.1.5 is to reduce the probability of UUID collision in case clock goes back in time or node id changes. The implementation should reset it whenever one of these events may occur. Since LWT microsecond time is guaranteed to be unique by Paxos, the RFC provisioning for clockseq and MAC slots becomes excessive. The fix thus changes timeuuid slot content in the following way: - time component now contains the same microsecond time for all values of a statement or a batch. The time is unique and monotonic in case of LWT. Otherwise it's most always monotonic, but may not be unique if two timestamps are created on different coordinators. - clockseq component is used to store a sequence number which is unique and monotonic for all values within the statement/batch. - to protect against time back-adjustments and duplicates if time is auto-generated, MAC component contains a random (spoof) MAC address, re-created on each restart. The address is different at each shard. The change is made for all sources of time: user, generated, LWT. Conditioning the list key generation algorithm on the source of time would unnecessarily complicate the code while not increase quality (uniqueness) of created list keys. Since 14 bits of clockseq provide us with only 16383 distinct slots per statement or batch, 3 extra bits in nanosecond part of the time are used to extend the range to 131071 values per statement/batch. If the rang is exceeded beyond the limit, an exception is produced. A twist on the use of clockseq to extend timeuuid uniqueness is that Scylla, like Cassandra, uses int8 compare to compare lower bits of timeuuid for ordering. The patch takes this into account and sign-complements the clockseq value to make it monotonic according to the legacy compare function. Fixes #7611 test: unit (dev)	2021-01-21 13:03:59 +03:00
Konstantin Osipov	6d1781be36	uuid: fill in UUID node identifier part of UUID Before this patch, UUID generation code was not creating sufficiently unique IDs: the 6 byte node identifier was mostly empty, i.e. only containing shard id. This could lead to collisions between queries executed concurrently at different coordinators, and, since timeuuid is used as key in list append and prepend operations, lead to lost updates. To generate a unique node id, the patch uses a combination of hardware MAC address (or a random number if no hardware address is available) and the current shard id. The shard id is mixed into higher bits of MAC, to reduce the chances on NIC collision within the same network. With sufficiently unique timeuuids as list cell keys, such updates are no longer lost, but multi-value update can still be "merged" with another multi-value update. E.g. if node A executes SET l = l + [4, 5] and node B executes SET l = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4, 6, 5, 7], [6, 4, 5, 7] and so on. At least we are now less likely to get any value lost. Fixes #6208. @todo: initialize UUID subsystem explicitly in main() and switch to using seastar::engine().net().network_interfaces() test: unit (dev)	2021-01-21 13:03:53 +03:00
Avi Kivity	4cfaab208e	allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations Now that managed_bytes and its users do not assume that a managed_bytes instance allocated using standard_allocation_strategy is non-fragmented, we can set the preferred max contiguous allocation to 128k. This causes managed_bytes to fragment instances that are larger than this size. Note that managed_bytes is the only user. Closes #7943	2021-01-21 11:15:13 +02:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00
Pekka Enberg	7d98e05923	Update tools/python3 submodule * tools/python3 1763a1a...c579207 (1): > dist/debian: handle rc version correctly	2021-01-21 10:41:33 +02:00
Avi Kivity	daa0e964fc	dbuild: avoid --pids-limit with podman and cgroupsv1 Podman doesn't correctly support --pids-limit with cgroupsv1. Some versions ignore it, and some versions reject the option. To avoid the error, don't supply --pids-limit if cgroupsv2 is not available (detected by its presence in /proc/filesystems). The user is required to configure the pids limit in /etc/containers/containers.conf. Fixes #7938. Closes #7939	2021-01-21 10:41:33 +02:00
Botond Dénes	4d581f1bb3	docs/README.md: guides: also mention running and debugging Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>	2021-01-20 16:07:29 +02:00
Avi Kivity	f11a0700a8	Merge "mutation_writer: explicitly close writers" from Benny " _consumer_fut is expected to return an exception on the abort path. Wait for it and drop any exception so it won't be abandoned as seen in #7904. A future<> close() method was added to return _consumer_fut. It is called both after abort() in the error path, and after consume_end_of_stream, on the success path. With that, consume_end_of_stream was made void as it doesn't return a future<> anymore. Fixes #7904 Test: unit(release) " * tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla: mutation_writer: bucket_writer: add close mutation_writer/feed_writers: refactor bucket/shard writers mutation_writer: update bucket/shard writers consume_end_of_stream	2021-01-20 16:07:29 +02:00
Pekka Enberg	6cc981d089	scylla: Add "--build-mode" command line option This adds a "--build-mode" command line option to "scylla" executable: $ ./build/dev/scylla --build-mode dev This allows you to discover the build mode of a "scylla" executable without resorting to "readelf", for example, to verify that you are looking at the correct executable while debugging packaging issues. Closes #7865	2021-01-20 16:07:29 +02:00
Botond Dénes	7eb8c71342	tools/scylla-types: add link to cql3-type-mapping.md Just like scylla-sstable-index, scylla-types accepts types in (short) cassandra class name notation. The mapping from the clq3 type names to the class names is not straight-forward in all cases, so provide a link to a table which lists the cassandra class name of all supported types (and more). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>	2021-01-20 10:50:33 +02:00
Botond Dénes	882ade7c6a	types/scylla-sstable-index: update URL to cql3-type-mapping.md Said document was recently moved but the URL was not updated. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>	2021-01-20 10:50:33 +02:00
Avi Kivity	114da51d73	Revert "commitlog: fix size of a write used to zero a segment" This reverts commit `df2f67626b`. The fix is correct, but has an unfortunate side effect with O_DSYNC: each 128k write also needs to flush the XFS log. This translates to 32MB/128k = 256 flushes, compared to one flush with the original code. A better fix would be to prezero without O_DSYNC, then reopen the file with O_DSYNC, but we can do that later. Reopens #5857.	2021-01-20 10:23:43 +02:00
Avi Kivity	586f16bf79	Merge "Cut snitch -> storage service dependency" from Pavel E " Currently storage service and snitch implicitly depend on each other. Storage service gossips snitch data on start, snitch kicks the storage service when its configuration changes. This interdependency is relaxed: - snitch gossips all its state itself without using the storage service as a mediator - storage service listens for snitch updates with the help of self-breaking subscription Both changes make snitch independent from storage service, remove yet another call for global storage service from the codebase and make the storage service -> snitch reference robust against dagling pointers/references tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev) " * 'br-snitch-gossip-2' of https://github.com/xemul/scylla: storage-service: Subscribe to snitch to update topology snitch: Introduce reconfiguration signal snitch: Always gossip snitch info itself snitch: Do gossip DC and RACK itself snitch: Add generic gossiping helper	2021-01-20 10:23:43 +02:00
Pavel Solodovnikov	041072b59f	raft: rename `storage` to `persistence` The new naming scheme more clearly communicates to the client of the raft library that the `persistence` interface implements persistency layer of the fsm that is powering the raft protocol itself rather than the client-side workflow and user-provided `state_machine`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>	2021-01-20 10:23:43 +02:00
Gleb Natapov	248449816b	raft: fix snapshot transfer with existing log prefix Current code that checks when snapshot has to be transferred does not take in account the case where there can be log entries preceding the snapshot. Fix the code to correctly test for snapshot transfer condition. Message-Id: <20210117095801.GB733394@scylladb.com>	2021-01-20 10:23:43 +02:00
Gleb Natapov	1ab262e86b	raft: test: change replication_test to submit one entry at a time replication_test's state machine is not commutative, so if commands are applied in different order the states will be different as well. Since the preemption check was added into co_await in seastar even waiting for a ready future can preempt which will cause reordering of simultaneously submitted entries in debug mode. For a long time we tried to keep entries submission parallel in the test, but with the above seastar change it is no longer possible to maintain it without changing the state machine to be commutative. The patch changes the test to submit entries one by one. Message-Id: <20210117095147.GA733394@scylladb.com>	2021-01-20 10:23:43 +02:00
Benny Halevy	f29732573a	mutation_writer: bucket_writer: add close bucket_writer::close waits for the _consumer_fut. It is called both after consume_end_of_stream() and after abort(). _consumer_fut is expected to return an exception on the abort path. Wait for it and drop any exception so it won't be abandoned as seen in #7904. With that moved to close() time, consume_end_of_stream doesn't need to return a future and is made void all the way in the stack. This is ok since queue_reader_handle::push_end_of_stream is synchronous too. Added a unit test that aborts the reader consumer during `segregate_by_timestamp`, reproducing the Exceptional future ignored issue without the fix. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 19:03:58 +02:00
Benny Halevy	fc3f9a57ff	mutation_writer/feed_writers: refactor bucket/shard writers Consolidate shard_based_splitting_writer::shard_writer and timestamp_based_splitting_writer::bucket_writer common code into mutation_writer::bucket_writer. This provides a common place to handle consume_end_of_stream() and abort(), and in particular the handling of the underlying _conmsumer_fut. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 18:48:01 +02:00
Benny Halevy	a9d91a2d09	mutation_writer: update bucket/shard writers consume_end_of_stream After `61520a33d6` feed_writers doesn't call consume_end_of_stream after abort() so no need to test if (!_handle.is_terminated()) { and consume_end_of_stream is now called in then_wrapped rather than `finally` so it's ok if it throws. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 18:44:40 +02:00
Kamil Braun	1a8630e6a7	transport: silence "broken pipe" and "connection reset by peer" errors The code would already silence broken pipe exceptions since it's expected when the other side closes the connection or when we shutdown the socket during Scylla shutdown, but the code wouldn't handle the following: 1. "Connection reset by peer" errors: these can also happen in the aforementioned two scenarios; the conditions that determine which of the two types of errors occur are unclear. 2. The scenarios would sometimes result in a `seastar::nested_exception`, mainly during shutdown. The errors could happen once when trying to send a response to a request (`_write_buf.write(...)/flush(...)`) and then again when trying to close the connection in a `finally` block. These nested exceptions were not silenced. The commit handles each of these cases. Closes #7907. Closes #7931	2021-01-19 10:30:17 +02:00
Tomasz Grabiec	94749b01eb	Merge "futurize flat_mutation_reader::next_partition" from Benny The main motivation for this patchset is to prepare for adding a async close() method to flat_mutation_reader. In order to close the reader before destroying it in all paths we need to make next_partition asynchronous so it can asynchronously close a current reader before destoring it, e.g. by reassignment of flat_mutation_reader_opt, as done in scanning_reader::next_partition. Test: unit(release, debug) * git@github.com:bhalevy/scylla.git futurize-next-partition-v1: flat_mutation_reader: return future from next_partition multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard mutation_reader: filtering_reader: fill_buffer: futurize inner loop flat_mutation_reader::impl: consumer_adapter: futurize handle_result flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer flat_mutation_reader:impl: get rid of _consume_done member	2021-01-19 10:19:03 +02:00
Alejo Sanchez	8a61e7defc	raft: etcd unit tests: test log replication etcd TestLogReplication Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	417b18aaad	raft: boost test etcd: test fsm can vote from any state etcd TestVoteFromAnyState Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	5a75c0e06a	raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs Log truncation of follower when node re-gains leadership. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	f14c44c686	raft: replication test: add etcd test for cycling leaders This test cycles 3 nodes as leaders without adding entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	f627972186	raft: testing: provide primitives to wait for log propagation For tests to be able to transition in a consistent state, in some cases it's needed to allow the followers to catch up with the leader. This prevents occasional hangs in debug mode for incoming tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	948ae813e4	raft: etcd unit tests: initial boost tests First batch of ported etcd raft unit tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:12 -04:00
Gleb Natapov	6d47a535b9	raft: combine append_request _receive and _send Combine structs for append request send and receive into a single struct. Author: Gleb Natapov <gleb@scylladb.com> Date: Mon Nov 23 14:33:14 2020 +0200	2021-01-18 12:24:13 -04:00
Konstantin Osipov	bf1a031bd6	test: add a CQL test for list append/prepend operations Test single- and multi- value list append, prepend, append and prepend in a batch, conditional statements. This covers the parts of Cassandra which are working as documented and which we intend to preserve compatibility with.	2021-01-18 17:32:00 +03:00
Jenkins	faf71c6f75	release: prepare for 4.5.dev	2021-01-18 16:05:25 +02:00
Avi Kivity	df3ef800c2	Merge 'Introduce load and stream feature' from Asias He storage_service: Introduce load_and_stream === Introduction === This feature extends the nodetool refresh to allow loading arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from disk and calculates the owning nodes of the data and streams to the owners automatically. From example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process. This can make restores and migrations much easier. === Performance === I managed to get 40MB/s per shard on my build machine. CPU: AMD Ryzen 7 1800X Eight-Core Processor DISK: Samsung SSD 970 PRO 512GB Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32 shards, we can finish the load and stream 1TB of data in 13 mins on each node. 1TB / 40 MB per shard * 32 shard / 60 s = 13 mins === Tests === backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test which creates a cluster with 4 nodes and inserts data, then use load_and_stream to restore to a 2 nodes cluster. === Usage === curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true === Notes === Btw, with the old nodetool refresh, the node will not pick up the data that does not belong to this node but it will not delete it either. One has to run nodetool cleanup to remove those data manually which is a surprise to me and probably to users as well. With load and stream, the process will delete the sstables once it finishes stream, so no nodetool cleanup is needed. The name of this feature load and stream follows load and store in CPU world. Fixes #7831 Closes #7846 * github.com:scylladb/scylla: storage_service: Introduce load_and_stream distributed_loader: Add get_sstables_from_upload_dir table: Add make_streaming_reader for given sstables set	2021-01-18 15:08:19 +02:00
Avi Kivity	60f5ec3644	Merge 'managed_bytes: switch to explicit linearization' from Michał Chojnowski This is a revival of #7490. Quoting #7490: The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope. We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes. As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized(). Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view). By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure. This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator. Closes #7820 * github.com:scylladb/scylla: test: add hashers_test memtable: fix accounting of managed_bytes in partition_snapshot_accounter test: add managed_bytes_test utils: fragment_range: add a fragment iterator for FragmentedView keys: update comments after changes and remove an unused method mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator row_cache: more indentation fixes utils: remove unused linearization facilities in `managed_bytes` class misc: fix indentation treewide: remove remaining `with_linearized_managed_bytes` uses memtable, row_cache: remove `with_linearized_managed_bytes` uses utils: managed_bytes: remove linearizing accessors keys, compound: switch from bytes_view to managed_bytes_view sstables: writer: add write_* helpers for managed_bytes_view compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view types: change equal() to accept managed_bytes_view types: add parallel interfaces for managed_bytes_view types: add to_managed_bytes(const sstring&) serializer_impl: handle managed_bytes without linearizing utils: managed_bytes: add managed_bytes_view::operator[] utils: managed_bytes: introduce managed_bytes_view utils: fragment_range: add serialization helpers for FragmentedMutableView bytes: implement std::hash using appending_hash utils: mutable_view: add substr() utils: fragment_range: add compare_unsigned utils: managed_bytes: make the constructors from bytes and bytes_view explicit utils: managed_bytes: introduce with_linearized() utils: managed_bytes: constrain with_linearized_managed_bytes() utils: managed_bytes: avoid internal uses of managed_bytes::data() utils: managed_bytes: extract do_linearize_pure() thrift: do not depend on implicit conversion of keys to bytes_view clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view cql3: expression: linearize get_value_from_mutation() eariler bytes: add to_bytes(bytes) cql3: expression: mark do_get_value() as static	2021-01-18 11:01:28 +02:00
Asias He	4d32d03172	storage_service: Introduce load_and_stream === Introduction === This feature extends the nodetool refresh to allow loading arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from disk and calculates the owning nodes of the data and streams to the owners automatically. From example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process. This can make restores and migrations much easier. === Performance === I managed to get 40MB/s per shard on my build machine. CPU: AMD Ryzen 7 1800X Eight-Core Processor DISK: Samsung SSD 970 PRO 512GB Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32 shards, we can finish the load and stream 1TB of data in 13 mins on each node. 1TB / 40 MB per shard * 32 shard / 60 s = 13 mins === Tests === backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test which creates a cluster with 4 nodes and inserts data, then use load_and_stream to restore to a 2 nodes cluster. === Usage === curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true === Notes === Btw, with the old nodetool refresh, the node will not pick up the data that does not belong to this node but it will not delete it either. One has to run nodetool cleanup to remove those data manually which is a surprise to me and probably to users as well. With load and stream, the process will delete the sstables once it finishes stream, so no nodetool cleanup is needed. The name of this feature load and stream follows load and store in CPU world. Fixes #7831	2021-01-18 16:32:33 +08:00
Avi Kivity	ab44464911	Revert "docker: remove sshd from the image" This reverts commit `32fd38f349`. Some tests (in scylla-cluster-tests) depend on it.	2021-01-17 14:34:40 +02:00
Raphael S. Carvalho	00c29e1e24	table: Move notify_bootstrap_or_replace_*() out of line Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>	2021-01-17 10:36:13 +02:00
Asias He	28007f13f8	distributed_loader: Add get_sstables_from_upload_dir This function scans sstables under the upload directory and return a list of sstables for each shard. Refs #7831	2021-01-16 20:03:17 +08:00
Michał Chojnowski	5b72fb65ae	test: add hashers_test This test is a sanity check. It verifies that our wrappers over well known hashes (xxhash, md5, sha256) actually calculate exactly those hashes. It also checks that the `update()` methods of used hashers are linear with respect to concatenation: that is, `update(a + b)` must be equivalent to `update(a); update(b)`. This wasn't relied on before, but now we need to confirm that hashing fragmented keys without linearizing them won't break backward compatibility.	2021-01-15 18:28:24 +01:00
Michał Chojnowski	85048b349b	memtable: fix accounting of managed_bytes in partition_snapshot_accounter managed_bytes has a small overhead per each fragment. Due to that, managed_bytes containing the same data can have different total memory usage in different allocators. The smaller the preferred max allocation size setting is, the more fragments are needed and the greater total per-fragment overhead is. In particular, managed_bytes allocated in the LSA could grow in memory usage when copied to the standard allocator, if the standard allocator had a preferred max allocation setting smaller than the LSA. partition_snapshot_accounter calculates the amount of memory used by mutation fragments in the memtable (where they are allocated with LSA) based on the memory usage after they are copied to the standard allocator. This could result in an overestimation, as explained above. But partition_snapshot_accounter must not overestimate the amount of freed memory, as doing otherwise might result in OOM situations. This patch prevents the overaccounting by adding minimal_external_memory_usage(): a new version of external_memory_usage(), which ignores allocator-dependent overhead. In particular, it includes the per-fragment overhead in managed_bytes only once, no matter how many fragments there are.	2021-01-15 18:21:13 +01:00
Michał Chojnowski	d31771c0b2	test: add managed_bytes_test	2021-01-15 18:21:13 +01:00
Michał Chojnowski	72ecbd6936	utils: fragment_range: add a fragment iterator for FragmentedView A stylistic change. Iterators are the idiomatic way to iterate in C++.	2021-01-15 14:05:44 +01:00
Michał Chojnowski	2e38647a95	keys: update comments after changes and remove an unused method The comments were outdated after the latest changes (bytes_view vs managed_bytes_view). compound_view_wrapper::get_component() is unused, so we remove it.	2021-01-15 14:05:44 +01:00
Piotr Sarna	6ae94d31c1	treewide: remove shared pointer usage from the pager The pager interface doesn't really need to be virtual, so the next step could be to remove the need for pointers entirely, but migrating from shared_ptr to unique_ptr is a low-hanging fruit. Message-Id: <a5bdecb17ae58e914da020fb58a41f4574565c66.1610709560.git.sarna@scylladb.com>	2021-01-15 15:03:14 +02:00
Avi Kivity	f20736d93d	Merge 'Support unofficial distributions' from Takuya ASADA Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions. However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment. This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo. Closes #7858 * github.com:scylladb/scylla: dist: use sysconfig_parser to parse gentoo config file dist: add package name translation dist: support SLES/OpenSUSE install.sh: add systemd existance check install.sh: ignore error missing sysctl entries dist: show warning on unsupported distributions dist: drop Ubuntu 14.04 code dist: move back is_amzn2() to scylla_util.py dist: rename is_gentoo_variant() to is_gentoo() dist: support Arch Linux dist: make sysconfig directory detectable	2021-01-14 16:59:49 +02:00
Raphael S. Carvalho	97e076365e	Fix stalls on Memtable flush by preempting across fragment generation if needed Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer() generates mutation fragment until buffer is full[1] without yielding. this is the code path: flush_reader::fill_buffer() <---------\| flat_mutation_reader::consume_pausable() <--------\| partition_snapshot_flat_reader::fill_buffer() -\| [1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261 This is fixed by breaking the loop in do_fill_buffer() if preemption is needed, allowing do_until() to yield in sequence, and when it resumes, continue from where it left off, until buffer is full. Fixes #7885. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>	2021-01-14 16:30:55 +02:00
Ivan Prisyazhnyy	32fd38f349	docker: remove sshd from the image implicit revert of `6322293263` sshd previosly was used by the scylla manager 1.0. new version does not need it. there is no point of having it currently. it also confuses everyone. Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com> Closes #7921	2021-01-14 12:52:24 +02:00
Pavel Emelyanov	2b31be0daa	client-state,cdc: Remove call for storage_service from permissions check The client_state::check_access() calls for global storage service to get the features from it and check if the CDC feature is on. The latter is needed to perform CDC-specific checks. However it was noticed, that the check for the feature is excessive as all the guarded if-s will resolve to false in case CDC is off and the check_access will effectively work as it would with the feature check. With that observation, it's possible to ditch one more global storage service reference. tests: unit(dev), dtest(dev, auth) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210105063651.7081-1-xemul@scylladb.com>	2021-01-14 12:52:24 +02:00
Benny Halevy	29002e3b48	flat_mutation_reader: return future from next_partition To allow it to asynchronously close underlying readers on next_partition(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	ff931c2ecc	multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard The reader_meta in _readers[shard] is created on shard 0 and must be destroyed on it as well. A following patch changes next_partition() to return a future<> thus it introduces a continuation that requires access to `rm`. We cannot move it down to the conuation safely, since it will be wrongly destroyed in the invoked shard, so use do_with to hold it in the scope of the calling shard until the invoked function completes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	75c0c05f71	mutation_reader: filtering_reader: fill_buffer: futurize inner loop Prepare for futurizing next_partition(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	cd4d082e51	flat_mutation_reader::impl: consumer_adapter: futurize handle_result Prepare for futurizing next_partition. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	d8ae6d7591	flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer To support both sync and async consumers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	fdb3c59e35	flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer So that consumer_adapter and other consumers in the future may return a future from consumer(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Benny Halevy	515bed90bb	flat_mutation_reader:impl: get rid of _consume_done member It is only used in consume_pausable, that can easily do without it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Pavel Emelyanov	d3ee8774ad	storage-service: Subscribe to snitch to update topology Currently snitch explicitly calls storage service (if it's initialized) to update topology on snitch data change. Instead of it -- make storage service subscribe on the snitch reconfigure signal upon creation. This finally makes snitch fully independent from storage service. In tests the snitch instance is not created, so check for it before subscribing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-01-13 16:41:34 +03:00
Pavel Emelyanov	d1a2d0f894	snitch: Introduce reconfiguration signal Add a notifier to snitch_base that gets triggered when the snitch configuration changes to which others may subscribe. For now only the gossiping-file-snitch triggers it when it re-reads its config file. Other existing snitches are kinda static in this sense. The subscribe-trigger engine is based on scoped connection from boost::signals2. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-01-13 16:41:34 +03:00
Pavel Emelyanov	ca336409d7	snitch: Always gossip snitch info itself The gossiping_property_file_snitch updates the gossip RACK and DC values upon config change. Right now this is done with the help of storage service, but the needed code to gossip rack and dc is already available in the snitch itself. Said that -- gossip snitch info by snitch helper and remove the storage_service's one. This makes the 2nd step decoupling snitch and storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-01-13 16:41:34 +03:00
Pavel Emelyanov	99e71bd1f6	snitch: Do gossip DC and RACK itself This is the 2nd step in generalizing the snitch data gossiping and at the same the 1st step in decoupling storage service and snitch. During start storage service starts gossiper, which notifies the snicth with .gossiper_starting() call, then the storage service calls gossip_snitch_info. This patch makes snitch itself do the last step. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-01-13 16:41:34 +03:00
Pavel Emelyanov	bc1a3a358d	snitch: Add generic gossiping helper Nowadays some snitch implementations gossip the INTERNAL_IP value and storage_service gossip RACK and DC for all of them. This functionality is going to be generalized and the first step is in making a common method for a snitch to gossip its data. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-01-13 16:41:34 +03:00
Takuya ASADA	7a74f8cd2e	dist: use sysconfig_parser to parse gentoo config file Use sysconfig_parser instead of regex, to improve code readability.	2021-01-13 21:34:23 +09:00
Takuya ASADA	2a4d293841	dist: add package name translation Translate package name from CentOS package to different distribution package name, to use single package name for pkg_install().	2021-01-13 21:27:14 +09:00
Takuya ASADA	0a9843842d	dist: support SLES/OpenSUSE Add support SLES/OpenSUSE on setup script.	2021-01-13 19:32:46 +09:00
Takuya ASADA	a34edf8169	install.sh: add systemd existance check offline installer can run in non-systemd distributions, but it won't work since we only have systemd units. So check systemd existance and print error message.	2021-01-13 19:32:45 +09:00
Takuya ASADA	b8c35772b3	install.sh: ignore error missing sysctl entries On some kernel may not have specified sysctl parameter, so we should ignore the error.	2021-01-13 19:32:45 +09:00
Takuya ASADA	e8f74e800c	dist: show warning on unsupported distributions Add warning message on unsupported distributions, for scylla_cpuscaling_setup and scylla_ntp_setup.	2021-01-13 19:32:45 +09:00
Takuya ASADA	2f344cf50d	dist: drop Ubuntu 14.04 code We don't support Ubuntu 14.04 anymore, drop them	2021-01-13 19:32:45 +09:00
Takuya ASADA	8e59f70080	dist: move back is_amzn2() to scylla_util.py Distribution detection functions should be placed same place, so move back it to scylla_util.py	2021-01-13 19:32:45 +09:00
Takuya ASADA	921b1676c0	dist: rename is_gentoo_variant() to is_gentoo() is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL, and is_debian_variant() is the function to detect Debian/Ubuntu. Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants", we should rename it to is_gentoo().	2021-01-13 19:32:45 +09:00
Takuya ASADA	fffa8f5ded	dist: support Arch Linux Add support Arch Linux on setup script.	2021-01-13 19:32:45 +09:00
Takuya ASADA	0d11f9463d	dist: make sysconfig directory detectable Currently, install.sh provide a way to customize sysconfig directory, but sysconfig directory is hardcoded on script. Also, /etc/sysconfig seems correct to use default value, but current code specify /etc/default as non-redhat distributions. Instead of hardcoding, generate generate python script in install.sh to save specified sysconfig directory path in python code.	2021-01-13 19:32:45 +09:00
Wojciech Mitros	93613e20a3	api: remove potential large allocation in /column_family/ GET request handler The reply to a /column_family/ GET request contains info about all column families. Currently, all this info is stored in a single string when replying, and this string may require a big allocation when there are many column families. To avoid that allocation, instead of a single string, use a body_writer function, which writes chunks of the message content to the output stream. Fixes #7916 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com> Closes #7917	2021-01-13 12:04:18 +02:00
Avi Kivity	ed53b3347e	Merge 'idl: remove the large allocation in mutation_partition_view::rows()' from Wojciech Mitros After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations. I haven't run the scylla cluster test with it yet but it passes the unit tests. Closes #7919 * github.com:scylladb/scylla: idl: change the type of mutation_partition_view::rows() to a chunked_vector idl-compiler: allow fields of type utils::chunked_vector	2021-01-13 11:07:29 +02:00
Nadav Har'El	711b311d47	cql-pytest: tests for fromJson() integer overflow Numbers in JSON are not limited in range, so when the fromJson() function converts a number to a limited-range integer column in Scylla, this conversion can overflow. The following tests check that this conversion should result in an error (FunctionFailure), not silent trunction. Scylla today does silently wrap around the number, so these tests xfail. They pass on Cassandra. Refs #7914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>	2021-01-13 11:07:29 +02:00
Nadav Har'El	617e1be1b6	cql-pytest: expand tests for fromJson() failures This patch adds more (failing) tests for issue #7911, where fromJson() failures should be reported as a clean FunctionFailure error, not an internal server error. The previous tests we had were about JSON parse failures, but a different type of error we should support is valid JSON which returned the wrong type - e.g., the JSON returning a string when an integer was expected, or the JSON returning a string with non-ASCII characters when ASCII was expected. So this patch adds more such tests. All of them xfail on Scylla, and pass on Cassandra. Refs #7911. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>	2021-01-13 11:07:29 +02:00
Nadav Har'El	2ebe8055ee	cql-pytest: add test for fromJson() null parameter. This patch adds a reproducer test for issue #7912, which is about passing a null parameter to the fromJson() function supposed to be legal (and return a null value), and is legal in Cassandra, but isn't allowed in Scylla. There are two tests - for a prepared and unprepared statement - which fail in different ways. The issue is still open so the tests xfail on Scylla - and pass on Cassandra. Refs #7912. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>	2021-01-13 11:07:29 +02:00
dgarcia360	78e9f45214	docs: update url Related issue scylladb/sphinx-scylladb-theme#88 Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com Frequently asked questions: Should we change the links in the README/docs folder? GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections. Do I need to add this new domain in the custom dns domain section on GitHub settings? It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged. The DNS doesn't seem to have the right SSL certificates GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert. Closes #7877	2021-01-13 11:07:29 +02:00
Avi Kivity	d508a63d4b	row_cache: linearize key in cache_entry::do_read() do_read() does not linearize cache_entry::_key; this can cause a crash with keys larger than 13k. Fixes #7897. Closes #7898	2021-01-13 11:07:29 +02:00
dgarcia360	36f8d35812	docs: added multiversion_regex_builder Fixed makefile Added path Closes #7876	2021-01-13 11:07:29 +02:00
Benny Halevy	5e41228fe8	test: everywhere: use seastar::testing::local_random_engine Use the thread_local seastar::testing::local_random_engine in all seastar tests so they can be reproduced using the --random-seed option. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>	2021-01-13 11:07:29 +02:00
Benny Halevy	43ab094c88	configure: add utf8_test to pure_boost_tests Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210112103713.578301-1-bhalevy@scylladb.com>	2021-01-13 11:07:29 +02:00
Dejan Mircevski	d79c2cab63	cql3: Use correct comparator in timeuuid min/max The min/max aggregators use aggregate_type_for comparators, and the aggregate_type_for<timeuuid> is regular uuid. But that yields wrong results; timeuuids should be compared as timestamps. Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid, so aggregators can distinguish betwen the two. Then specialize the aggregation utilities for timeuuid. Add a cql-pytest and change some unit tests, which relied on naive uuid comparators. Fixes #7729. Tests: unit (dev, debug) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #7910	2021-01-13 11:07:29 +02:00
Avi Kivity	96d64b7a1f	Merge "Wire interposer consumer for memtable flush" from Raphael " Without interposer consumer on flush, it could happen that a new sstable, produced by memtable flush, will not conform to the strategy invariant. For example, with TWCS, this new sstable could span multiple time windows, making it hard for the strategy to purge expired data. If interposer is enabled, the data will be correctly segregated into different sstables, each one spanning a single window. Fixes #4617. tests: - mode(dev). - manually tested it by forcing a flush of memtable spanning many windows " * 'segregation_on_flush_v2' of github.com:raphaelsc/scylla: test: Add test for TWCS interposer on memtable flush table: Wire interposer consumer for memtable flush table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader table: Allow sstable write permit to be shared across monitors memtable: Track min timestamp table: Extend cache update to operate a memtable split into multiple sstables	2021-01-13 11:07:29 +02:00
Nadav Har'El	8164c52871	cql-pytest: add test for fromJson() parse error This patch adds a reproducer test for issue #7911, which is about a parse error in JSON string passed to the fromJson() function causing an internal error instead of the expected FunctionFailure error. The issue is still open so the test xfails on Scylla (and passes on Cassandra). Refs #7911. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>	2021-01-13 11:07:29 +02:00
Pavel Solodovnikov	10e3da692f	lwt: validate `paxos_grace_seconds` table option The option can only take integer values >= 0, since negative TTL is meaningless and is expected to fail the query when used with `USING TTL` clause. It's better to fail early on `CREATE TABLE` and `ALTER TABLE` statement with a descriptive message rather than catch the error during the first lwt `INSERT` or `UPDATE` while trying to insert to system.paxos table with the desired TTL. Tests: unit(dev) Fixes: #7906 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>	2021-01-13 11:07:29 +02:00
Gleb Natapov	51bf5f5846	raft: test: do not check snapshot during backpressure test Unfortunately snapshot checking still does not work in the presence of log entries reordering. It is impossible to know when exactly the snapshot will be taken and if it is taken before all smaller than snapshot idx entries are applied the check will fail since it assumes that. This patch disabled snapshot checking for SUM state machine that is used in backpressure test. Message-Id: <20201126122349.GE1655743@scylladb.com>	2021-01-13 11:07:29 +02:00
Wojciech Mitros	59769efd3b	idl: change the type of mutation_partition_view::rows() to a chunked_vector The value of mutation_partition_view::rows() may be very large, but is used almost exclusively for iteration, so in order to avoid a big allocation for an std::vector, we change its type to an utils::chunked_vector. Fixes #7918 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-01-13 04:25:53 +01:00
Wojciech Mitros	88e750f379	idl-compiler: allow fields of type utils::chunked_vector The utils::chunked_vector has practically the same methods as a std::vector, so the same code can be generated for it. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-01-13 04:09:18 +01:00
Avi Kivity	ccd09f1398	Update seastar submodule * seastar 6b36e84c3...a287bb1a3 (1): > merge: file: correct dma alignment for odd filesystems Ref #7794.	2021-01-11 20:38:59 +02:00
Tomasz Grabiec	6cfc949e62	Merge "sstables: validate the writer's input with the mutation fragment stream validator" from Botond We have recently seen a suspected corrupt mutation fragment stream to get into an sstable undetected, causing permanent corruption. One of the suspected ways this could happen is the compaction sstable write path not being covered with a validator. To prevent events like this in the future make sure all sstable write paths are validated by embedding the validator right into the sstable writer itself. Refs: #7623 Refs: #7640 Tests: unit(release) * https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2: sstable_writer: add validation test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation mutation_fragment_stream_validator: make it easier to validate concrete fragment types flat_mutation_reader: extract fragment stream validator into its own header	2021-01-11 14:57:48 +01:00
Calle Wilund	4be718ebfa	commitlog: Force earlier cycle/flush iff segment reserve is empty Attempt to hurry flushing/segment delete/recycle if we are trying to get a segment for allocation, and reserve is empty when above disk threshold. This is minimize time waited in allocation semaphore.	2021-01-11 12:45:36 +00:00
Calle Wilund	be8c359a62	commitlog: Make segment allocation wait iff disk usage > max Instead of allowing new segments to be added, explicitly wait for either disk delete or recycle to happen iff current disk usage is larger than limit.	2021-01-11 12:45:36 +00:00
Calle Wilund	ab55a1b4e6	commitlog: Do partial (memtable) flushing based on threshold Instead of asking to flush data for all segments, just request up to an RP where we get comfortably below disk usage threshold.	2021-01-11 12:45:10 +00:00
Pekka Enberg	42806c6f40	Update seastar submodule * seastar ed345cdb...6b36e84c (3): > perftune.py: Don't print nic driver name to avoid Fixes #7905 > io_tester: Make file sizes configurable > io_queue: Limit tickets for oversized requests	2021-01-11 14:12:06 +02:00
Pavel Solodovnikov	0981b786a8	db/query_options: specify serial consistency for DEFAULT specific_options Cassandra constructs `QueryOptions.SpecificOptions` in the same way that we do (by not providing `serial_constency`), but they do have a user-defined constructor which does the following thing: this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency; This effectively means that DEFAULT `SpecificOptions` always have `SerialConsistency` set to `SERIAL`, while we leave this `std::nullopt`, since we don't have a constructor for `specific_options` which does this. Supply `db::consistency_level::SERIAL` explicitly to the `specific_options::DEFAULT` value. Tests: unit(dev) Fixes: #7850 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>	2021-01-11 12:12:29 +02:00
Nadav Har'El	a3f9bd9c3f	cql-pytest: add xfailing reproducer for issue #7888 This adds a simple reproducer for a bug involving a CONTAINS relation on frozen collection clustering columns when the query is restricted to a single partition - resulting in a strange "marshalling error". This bug still exists, so the test is marked xfail. Refs #7888. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>	2021-01-11 08:49:16 +01:00
Nadav Har'El	678da50a10	cql-pytest: add reproducers for reversed frozen collection bugs We add a reproducer for issues #7868 and #7875 which are about bugs when a table has a frozen collection as its clustering key, and it is sorted in reverse order: If we tried to insert an item to such a table using an unprepared statement, it failed with a wrong error ("invalid set literal"), but if we try to set up a prepared statement, the result is even worse - an assertion failure and a crash. Interestingly, neither of these problems happen without reversed sort order (WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which demonstrates that with default (increasing) order, everything works fine. All tests pass successfully when run against Cassandra. The fix for both issues was already committed, so I verified these tests reproduced the bug before that commit, and pass now. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>	2021-01-11 08:48:30 +01:00
Nadav Har'El	f32c34d8ad	cql-pytest: port Cassandra's unit test validation/entities/frozen_collections_test In this patch, we port validation/entities/frozen_collections_test.java, containing 33 tests for frozen collections of all types, including nesting collections. In porting these tests, I uncovered four previously unknown bugs in Scylla: Refs #7852: Inserting a row with a null key column should be forbidden. Refs #7868: Assertion failure (crash) when clustering key is a frozen collection and reverse order. Refs #7888: Certain combination of filtering, index, and frozen collection, causes "marshalling error" failure. Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections. These tests also provide two more reproducers for an already known bug: Refs #7745: Length of map keys and set items are incorrectly limited to 64K in unprepared CQL. Due to these bugs, 7 out of the 33 tests here currently xfail. We actually had more failing tests, but we fixed issue #7868 before this patch went in, so its tests are passing at the time of this submission. As usual in these sort of tests, all 33 pass when running against Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>	2021-01-11 08:48:08 +01:00
Nadav Har'El	0516cd1609	alternator test: de-duplicate some duplicate code In test_streams.py we had some code to get a list of shards and iterators duplicated three times. Put it in a function, shards_and_latest_iterators(), to reduce this duplication. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201006112421.426096-1-nyh@scylladb.com>	2021-01-11 08:47:25 +01:00
Botond Dénes	cb4d92aae4	sstable_writer: add validation Add a mutation_fragment_stream_validating_filter to sstables::writer_impl and use it in sstable_writer to validate the fragment stream passed down to the writer implementation. This ensures that all fragment streams written to disk are validated, and we don't have to worry about validating each source separately. The current validator from sstable::write_components() is removed. This covers only part of the write paths. Ad-hoc validations in the reader implementations are removed as well as they are now redundant.	2021-01-11 09:12:56 +02:00
Botond Dénes	4b254a26ab	test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation The test violates clustering key order on purpose to produce a corrupt sstable (to test scrub). Disable key validation so when we move the validator into the writer itself in the next patch it doesn't abort the test.	2021-01-11 09:12:56 +02:00
Botond Dénes	8dae6152bf	mutation_fragment_stream_validator: make it easier to validate concrete fragment types The current API is tailored to the `mutation_fragment` type. In the next patch we will want to use the validator from a context where the mutation fragments are already decomposed into their respective concrete types, e.g. static_row, clustering_row, etc. To avoid having to reconstruct a mutation fragment type just to use the validator, add an API which allows validating these concrete types conveniently too.	2021-01-11 08:07:42 +02:00
Botond Dénes	495f9d54ba	flat_mutation_reader: extract fragment stream validator into its own header To allow using it without pulling in the huge `flat_mutation_reader.hh`.	2021-01-11 08:07:42 +02:00
Dejan Mircevski	3aa80f47fe	abstract_type: Rework unreversal methods Replace two methods for unreversal (`as` and `self_or_reversed`) with a new one (`without_reversed`). More flexible and better named. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #7889	2021-01-10 19:30:12 +02:00
Tomasz Grabiec	15b5b286d9	Merge "frozen_mutation: better diagnostics for out-of-order and duplicate rows" from Botond Currently, frozen mutations, that contain partitions with out-of-order or duplicate rows will trigger (if they even do) an assert in `row::append_cell()`. However, this results in poor diagnostics (if at all) as the context doesn't contain enough information on what exactly went wrong. This results in a cryptic error message and an investigation that can only start after looking at a coredump. This series remedies this problem by explicitly checking for out-of-order and duplicate rows, as early as possible, when the supposedly empty row is created. If the row already existed (is a duplicate) or it is not the last row in the partition (out-of-order row) an exception is thrown and the deserialization is aborted. To further improve diagnostics, the partition context is also added to the exception. Tests: unit(release) * botond/frozen-mutation-bad-row-diagnostics/v3: frozen_mutation: add partition context to errors coming from deserializing partition_builder: accept_row(): use append_clustering_row() mutation_partition: add append_clustered_row()	2021-01-10 19:30:12 +02:00
Pekka Enberg	e5fe0acd15	Update seastar submodule * seastar 56cfe179...ed345cdb (1): > perftune.py: Fix the dump options after adding multiple nics option Refs #6266	2021-01-08 18:13:26 +01:00
Benny Halevy	60bde99e8e	flat_mutation_reader: consume_in_thread: always filter.on_end_of_stream on return Since we're calling _consumer.consume_end_of_stream() unconditionally when consume_pausable_in_thread returns. Refs #7623 Refs #7640 Test: unit(dev) Dtest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_with_resharding_low_to_half_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210106103024.3494569-1-bhalevy@scylladb.com>	2021-01-08 18:13:26 +01:00
Michał Chojnowski	f317b3c39f	mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator measuring_allocator is a wrapper around standard_allocator, but it exposed the default preferred_max_contiguous_allocation, not the one from standard_allocator. Thus managed_bytes allocated in those two allocators had fragments of different size, and their total memory usage differed, causing test_external_memory_usage to fail if standard_allocator::preferred_max_contiguous_allocation was changed from the default. Fix that.	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	907b73a652	row_cache: more indentation fixes Fixup indentation issues introduced in recent patches. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	eb523d4ac8	utils: remove unused linearization facilities in `managed_bytes` class Remove the following bits of `managed_bytes` since they are unused: * `with_linearized_managed_bytes` function template * `linearization_context_guard` RAII wrapper class for managing `linearization_context` instances. * `do_linearize` function * `linearization_context` class Since there is no more public or private methods in `managed_class` to linearize the value except for explicit `with_linearized()`, which doesn't use any of aforementioned parts, we can safely remove these. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	8709844566	misc: fix indentation The patch fixes indentation issues introduced in previous patches related to removing `with_linearized_managed_bytes` uses from the code tree. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	e04eb68a9c	treewide: remove remaining `with_linearized_managed_bytes` uses There is no point in calling the wrapper since linearization code is private in `managed_bytes` class and there is no one to call `managed_bytes::data` because it was deleted recently. This patch is a prerequisite for removing `with_linearized_managed_bytes` function completely, alongside with the corresponding parts of implementation in `managed_bytes`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Pavel Solodovnikov	bf8b138b42	memtable, row_cache: remove `with_linearized_managed_bytes` uses Since `managed_bytes::data()` is deleted as well as other public APIs of `managed_bytes` which would linearize stored values except for explicit `with_linearized`, there is no point invoking `with_linearized_managed_bytes` hack which would trigger automatic linearization under the hood of managed_bytes. Remove useless `with_linearized_managed_bytes` wrapper from memtable and row_cache code. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Avi Kivity	3bf6b78668	utils: managed_bytes: remove linearizing accessors Accessor that require linearization, such as data(), begin(), and casting to bytes_view, are no longer used and are now removed.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	dbcf987231	keys, compound: switch from bytes_view to managed_bytes_view The keys classes (partition_key et al) already use managed_bytes, but they assume the data is not fragmented and make liberal use of that by casting to bytes_view. The view classes use bytes_view. Change that to managed_bytes_view, and adjust return values to managed_bytes/managed_bytes_view. The callers are adjusted. In some places linearization (to_bytes()) is needed, but this isn't too bad as keys are always <= 64k and thus will not be fragmented when out of LSA. We can remove this linearization later. The serialize_value() template is called from a long chain, and can be reached with either bytes_view or managed_bytes_view. Rather than trace and adjust all the callers, we patch it now with constexpr if. operator bytes_view (in keys) is converted to operator managed_bytes_view, allowing callers to defer or avoid linearization.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	a1a0839164	sstables: writer: add write_* helpers for managed_bytes_view We will use them in the upcoming patch where we transition keys from bytes_view to mutable_bytes_view.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	45c1b90eb5	compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view The underlying view will change from bytes_view to managed_bytes_view in the next commits, so we prepare for that.	2021-01-08 14:16:08 +01:00
Avi Kivity	d9fcc4f4ef	types: change equal() to accept managed_bytes_view bytes_view can convert to managed_bytes_view, so the change is compatible with the existing representation and the next patches, which change compound types to use managed_bytes_view.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	1de0b9a425	types: add parallel interfaces for managed_bytes_view We will need those to transition keys and compound from bytes_view to managed_bytes_view.	2021-01-08 14:16:08 +01:00
Avi Kivity	d1f354f5fb	types: add to_managed_bytes(const sstring&) This is a helper for tests (similar to to_bytes(const sstring&)).	2021-01-08 14:16:08 +01:00
Michał Chojnowski	c6eb485675	serializer_impl: handle managed_bytes without linearizing With managed_bytes_view implemented, it's easy to de/serialize managed_bytes without linearization.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	bf0ec63e34	utils: managed_bytes: add managed_bytes_view::operator[] This operator has a single purpose: an easier port of legacy_compound_view from bytes_view to managed_bytes_view. It is inefficient and should be removed as soon as legacy_compound_view stops using operator[].	2021-01-08 14:16:08 +01:00
Michał Chojnowski	778269151a	utils: managed_bytes: introduce managed_bytes_view managed_bytes_view is a non-owning view into managed_bytes. It can also be implicitly constructed from bytes_view. It conforms to the FragmentedView concept and is mainly used through that interface. It will be used as a replacement for bytes_view occurrences currently obtained by linearizing managed_bytes.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	cf7d25b98d	utils: fragment_range: add serialization helpers for FragmentedMutableView We will use them to write to managed_bytes_view in an upcoming patch, to avoid linearization in compound_type::serialize_value.	2021-01-08 14:16:07 +01:00
Michał Chojnowski	75898ee44e	bytes: implement std::hash using appending_hash This is a preparation for the upcoming introduction of managed_bytes_view, intended as a fragmented replacement for bytes_view. To ease the transition, we want both types to give equal hashes for equal contents.	2021-01-08 13:17:46 +01:00
Michał Chojnowski	4822730752	utils: mutable_view: add substr() Analogous to bytes_view::substr. This bit of functionality will be used to implement managed_bytes_mutable_view.	2021-01-08 13:17:46 +01:00
Dejan Mircevski	9eed26ca3d	cql3: Fix maps::setter_by_key for unset values Unset values for key and value were not handled. Handle them in a manner matching Cassandra. This fixes all cases in testMapWithUnsetValues, so re-enable it (and fix a comment typo in it). Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Dejan Mircevski	4515a49d4d	cql3: Fix `IN ?` for unset values When the right-hand side of IN is an unset value, we must report an error, like Cassandra does. This fixes testListWithUnsetValues, so re-enable it. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Dejan Mircevski	5bee97fa51	cql3: Fix handling of scalar unset value Make the bind() operation of the scalar marker handle the unset-value case (which it previously didn't). Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Dejan Mircevski	8b2f459622	cql3: Fix crash when removing unset_value from set Avoid crash described in #7740 by ignoring the update when the element-to-remove is UNSET_VALUE. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Pekka Enberg	e81f4caf67	Update seastar submodule * seastar a2fc9d72...56cfe179 (1): > perftune.py: Fix nic_is_bond_iface() and other function signatures Refs #6266	2021-01-07 13:22:20 +02:00
Takuya ASADA	10184ba64f	redis: implement parse error, reply error message correctly Since we haven't implemented parse error on redis protocol parser, reply message is broken at parse error. Implemented parse error, reply error message correctly. Fixes #7861 Fixes #7114 Closes #7862	2021-01-07 13:22:20 +02:00
Dejan Mircevski	176ff0238a	cql3: Fix handling of reverse-order maps When the clustering order is reversed on a map column, the column type is reversed_type_impl, not map_type_impl. Therefore, we have to check for both reversed type and map type in some places. This patch handles reverse types in enough places to make test_clustering_key_reverse_frozen_map pass. However, it leaves other places (invocations of is_map() and *_cast<map_type_impl>()) as they currently are; some are protected by callers from being invoked on reverse types, but some are quite possibly bugs untriggered by existing tests. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Dejan Mircevski	6bb10fcf36	cql3: Fix handling of reverse-order lists When the clustering order is reversed on a list column, the column type is reversed_type_impl, not list_type_impl. Therefore, we have to check for both reversed type and list type in some places. This patch handles reverse types in enough places to make test_clustering_key_reverse_frozen_list pass. However, it leaves other places (invocations of is_list() and *_cast<list_type_impl>()) as they currently are; some are protected by callers from being invoked on reverse types, but some are quite possibly bugs untriggered by existing tests. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Dejan Mircevski	14fa39cfa6	cql3: Fix handling of reverse-order sets When the clustering order is reversed on a set column, the column type is reversed_type_impl, not set_type_impl. Therefore, we have to check for both reversed type and set type in some places. To make such checks easier, add convenience methods self_or_reversed() and as() to abstract_type. Invoke those methods (instead of is_set() and casts) enough to make test_clustering_key_reverse_frozen_set pass. Leave other invocations of is_set() and *_cast<set_type_impl>() as they are; some are protected by callers from being invoked on reverse types, but some are quite possibly bugs untriggered by existing tests. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-01-07 13:22:20 +02:00
Calle Wilund	7c84b16cd8	commitlog: Make flush threshold configurable	2021-01-05 18:16:09 +00:00
Calle Wilund	c3d95811da	table: Add a flush RP mark to table, and shortcut if not above Adds a second RP to table, marking where we flushed last. If a new flush request comes in that is below this mark, we can skip a second flush. This is to (in future) support incremental CL flush.	2021-01-05 18:16:09 +00:00
Raphael Carvalho	28a2aca627	Fix doc for building pkgs for a specific build mode Closes #7878	2021-01-05 18:56:21 +02:00
Tomasz Grabiec	1d717f37e2	vint-serialization: Reference the correct spec We are not using the protobol buffers format for vint. Message-Id: <1609865471-22292-1-git-send-email-tgrabiec@scylladb.com>	2021-01-05 18:54:09 +02:00
Vojtech Havel	d858c57357	cql3: allow SELECTs restricted by "IN" to retrieve collections This patch enables select cql statements where collection columns are selected columns in queries where clustering column is restricted by "IN" cql operator. Such queries are accepted by cassandra since v4.0. The internals actually provide correct support for this feature already, this patch simply removes relevant cql query check. Tests: cql-pytest (testInRestrictionWithCollection) Fixes #7743 Fixes #4251 Signed-off-by: Vojtech Havel <vojtahavel@gmail.com> Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>	2021-01-05 14:39:18 +02:00
Pekka Enberg	e54cc078a1	Update seastar submodule * seastar d1b5d41b...a2fc9d72 (6): > perftune.py: support passing multiple --nic options to tune multiple interfaces at once > perftune.py recognize and sort IRQs for Mellanox NICs > perftune.py: refactor getting of driver name into __get_driver_name() Fixes #6266 > install-dependencies: support Manjaro > append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size > future: do_until: handle exception in stop condition	2021-01-05 13:32:21 +02:00
Avi Kivity	43a2636229	Merge "Remove proxy from size-estimates reader" from Pavel E " The size_estimates_mutation_reader call for global proxy to get database from. The database is used to find keyspaces to work with. However, it's safe to keep the local database refernece on the reader itself. tests: unit(debug) " * 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla: size_estimate_reader: Use local db reference not global size_estimate_reader: Keep database reference on mutation reader size_estimate_reader: Keep database reference on virtual_reader	2021-01-05 11:28:09 +02:00
Pavel Emelyanov	9632af5d6b	schema_tables: Drop unused merge_schema overload After the `d3aa1759` one of them became unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210105051724.5249-1-xemul@scylladb.com>	2021-01-05 11:25:22 +02:00
Michał Chojnowski	6c97027f85	utils: fragment_range: add compare_unsigned We will use it to compare fragmented buffers (mainly managed_bytes_view in types, compound, and tests) without linearization.	2021-01-04 22:50:45 +01:00
Michał Chojnowski	2d28471a59	utils: managed_bytes: make the constructors from bytes and bytes_view explicit Conversions from views to owners have no business being implicit. Besides, they would also cause various ambiguity problems when adding managed_bytes_view.	2021-01-04 22:22:12 +01:00
Raphael S. Carvalho	d265bb9bdb	test: Add test for TWCS interposer on memtable flush Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 16:55:06 -03:00
Raphael S. Carvalho	9124a708f1	table: Wire interposer consumer for memtable flush From now on, memtable flush will use the strategy's interposer consumer iff split_during_flush is enabled (disabled by default). It has effect only for TWCS users as TWCS it's the only strategy that goes on to implement this interposer consumer, which consists of segregating data according to the window configuration. Fixes #4617. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 16:26:07 -03:00
Raphael S. Carvalho	c926a948e5	table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader This new variant will be needed for interposer consumer. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 16:23:00 -03:00
Raphael S. Carvalho	32acb44fec	table: Allow sstable write permit to be shared across monitors As a preparation for interposer on flush, let's allow database write monitor to store a shared sstable write permit, which will be released as soon as any of the sstable writers reach the sealing stage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 14:46:43 -03:00
Nadav Har'El	ed31dd1742	cql-pytest: port Cassandra's unit test validation/entities/counters_test In this patch, we port validation/entities/collection_test.java, containing 7 tests for CQL counters. Happily, these tests did not uncover any bugs in Scylla and all pass on both Cassandra and Scylla. There is one small difference that I decided to ignore instead of reporting a bug. If you try a CREATE TABLE with both counter and non-counter columns, Scylla gives a ConfigurationException error, while Cassandra gives a more reasonable InvalidRequest. The ported test currently allows both. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>	2021-01-04 18:25:48 +01:00
Nadav Har'El	05d6eff850	cql-pytest: add tests for non-support of unicode equivalence` In issue #7843 there were questions raised on how much does Scylla support the notion of Unicode Equivalence, a.k.a. Unicode normalization. Consider the Spanish letter ñ - it can be represented by a single Unicode character 00F1, but can also be represented as a 006E (lowercase "n") followed by a 0303 ("combining tilde"). Unicode specifies that these two representations should be considered "equivalent" for purposes of sorting or searching. But the following tests demonstrates that this is not, in fact, supported in Scylla or Cassandra: 1. If you use one representation as the key, then looking up the other one will not find the row. Scylla (and Cassandra) do not consider the two strings equivalent. 2. The LIKE operator (a Scylla-only extension) doesn't know that the single-character ñ begins with an n, or that the two-character ñ is just a single character. This is despite the thinking on #7843 which by using ICU in the implementation of LIKE, we somehow got support for this. We didn't. Refs #7843 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>	2021-01-04 18:25:28 +01:00
Nadav Har'El	feb028c97e	cql-pytest: add reproducer for issue 7856 This patch adds a reproducer for issue #7856, which is about frozen sets and how we can in Scylla (but not in Cassandra), insert one in the "wrong" order, but only in very specific circumstances which this reproducer demonstrates: The bug can only be reproduced in a nested frozen collection, and using prepared statements. Refs #7856 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>	2021-01-04 18:25:12 +01:00
Raphael S. Carvalho	738049cba2	memtable: Track min timestamp Tracking both min and max timestamp will be required for memtable flush to short-circuit interposer consumer if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 13:24:43 -03:00
Raphael S. Carvalho	5519fdba72	table: Extend cache update to operate a memtable split into multiple sstables This extension is needed for future work where a memtable will be segregated during flush into one sstable or more. So now multiple sstables can be added to the set after a memtable flush, and compaction is only triggered at the end. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-01-04 13:24:10 -03:00
Piotr Sarna	d5da455d95	schema_tables: describe calculate_schema_digest better - the mystical `accept_predicate` is renamed to `accept_keyspace` to be more self-descriptive - a short comment is added to the original calculate_schema_digest function header, mentioning that it computes schema digest for non-system keyspaces Refs #7854 Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>	2021-01-04 14:46:17 +02:00
Amos Kong	8b231a3bd9	install.sh: switch to use realpath for EnvironmentFile In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path, realpath was used to convert the path. This patch changed to use realpath in scylla repo to make it consistent with scylla-jmx. Suggested-by: Pekka Enberg <penberg@scylladb.com> Signed-off-by: Amos Kong <amos@scylladb.com> Closes #7860	2021-01-04 12:45:17 +02:00
Avi Kivity	33ee07a9d8	Merge 'Skip internal distributed tables in schema_change_test' from Piotr Sarna The original idea for `schema_change_test` was to ensure that if schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed. Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain. To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test. Tests: * unit(dev) * manual(rebasing on top of a change which adds two distributed system tables - all tests still passed) Refs #7617 Closes #7854 * github.com:scylladb/scylla: schema_change_test: skip distributed system tables in digest schema_tables: allow custom predicates in schema digest calc alternator: drop unneeded sstring creation system_keyspace: migrate helper functions to string_view database: migrate find_keyspace to string views	2021-01-04 12:44:03 +02:00
Piotr Sarna	e26aa836a9	schema_change_test: skip distributed system tables in digest With previous design of the schema change test, a regeneration was necessary each time a new distributed system table was added. It was not the original purpose of the test to keep track of new distributed tables which simply propagate on their own, so the test case is now modified: internal distributed tables are not part of the schema digest anymore, which means that changes inside them will not cause mismatches. This change involves a one-shot regeneration of all digests, which due to historical reasons included internal distributed tables in the digest, but no further regenerations should ever be necessary when a new internal distributed table is added.	2021-01-04 10:24:40 +01:00
Piotr Sarna	13a60b02ea	schema_tables: allow custom predicates in schema digest calc For testing purposes it would be useful to be able to skip computing schema for certain tables (namely, internal distributed tables). In order to allow that, a function which accepts a custom predicate is added.	2021-01-04 10:11:41 +01:00
Piotr Sarna	12b5184933	alternator: drop unneeded sstring creation It's now possible to use string views to check if a particular table is a system table, so it's no longer needed to explicitly create an sstring instance.	2021-01-04 09:47:01 +01:00
Piotr Sarna	f293c59a46	system_keyspace: migrate helper functions to string_view Functions for checking if the keyspace is system/internal were based on sstring references, which is impractical compared to string views and may lead to unnecessary creation of sstring instances.	2021-01-04 09:47:01 +01:00
Piotr Sarna	aba9772eff	database: migrate find_keyspace to string views ... in order to avoid creating unnecessary sstring instances just to compare strings.	2021-01-04 09:47:01 +01:00
Gleb Natapov	d3aa17591c	migration_manager: drop announce_locally flag It looks like the history of the flag begins in Cassandra's https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is introduced to speedup tests by not needing to start the gossiper. The thing is we always start gossiper in our cql tests, so the flag only introduce noise. And, of course, since we want to move schema to use raft it goes against the nature of the raft to be able to apply modification only locally, so we better get rid of the capability ASAP. Tests: units(dev, debug) Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>	2021-01-03 13:58:09 +02:00
Gleb Natapov	491f10bb70	schema-tables: make schema update global when fixing legacy SI tables When a node notice that it uses legacy SI tables it converts them to use new format, but it update only local schema. It will only cause schema discrepancy between nodes, there schema change should propagate globally. Fixes #7857. Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>	2021-01-03 13:57:46 +02:00
Raphael S. Carvalho	d55d65d77c	compaction: Enable filtering reader only on behalf of cleanup compaction After `13fa2bec4c`, every compaction will be performed through a filtering reader because consumers cannot do the filtering if interposer consumer is enabled. It turns out that filtering_reader is adding significant overhead when regular compactions are running. As no other compaction type need to actually do any filtering, let's limit filtering_reader to cleanup compaction. Alternatively, we could disable interposer consumer on behalf of cleanup, or add support for the consumers to do the filtering themselves but that would add lots of complexity. Fixes #7748. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>	2021-01-03 12:02:43 +02:00
Raphael S. Carvalho	e42d277805	compaction: Drop needless partition filter for regular compaction This filter is used to discard data that doesn't belong to current shard, but scylla will only make a sstable available to regular compaction after it was resharded on either boot or refresh. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>	2021-01-03 12:02:42 +02:00
Pekka Enberg	5872b754e0	Revert "dist/docker: Remove 'epel-release' from Docker image" This reverts commit `ceb67e7728`. The "epel-release" package is needed to install the "supervisord" package, which I somehow missed in testing... Fixes #7851	2021-01-02 12:49:12 +02:00
Nadav Har'El	93a2c52338	cql-pytest: add tests for inserting rows with missing key columns This patch adds two simple tests for what happens when a user tries to insert a row with one of the key column missing. The first tests confirms that if the column is completely missing, we correctly print an error (this was issue #3665, that was already marked fixed). However, the second test demonstrates that we still have a bug when the key column appears on the command, but with a null value. In this case, instead of failing the insert (as Cassandra does), we silently ignore it. This is the proper behavior for UNSET_VALUE, but not for null. So the second test is marked xfail, and I opened issue #7852 about it. Refs #3665 Refs #7852 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>	2020-12-30 18:20:01 +01:00
Nadav Har'El	10fbef5bff	cql-pytest: clean up test_using_timeout.py In a previous version of test_using_timeout.py, we had tables pre-filled with some content labled "everything". The current version of the tests don't use it, so drop it completely. One test, test_per_query_timeout_large_enough, still had code that did res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h")) assert res == everything this was a bug - it only works as expected if this test is run before anything other test is run, and will fail if we ever reorder or parallelize these tests. So drop these two lines. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>	2020-12-30 09:16:25 +01:00
Asias He	84f482bde4	table: Add make_streaming_reader for given sstables set Add a streaming reader that streams from a given sstables set. Refs #7831	2020-12-30 08:32:42 +08:00
Nadav Har'El	5f24ff9187	Merge 'Coroutinize alternator tagging requests' from Piotr Sarna This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine. NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla. Closes #7453 * github.com:scylladb/scylla: alternator: coroutinize untagging a resource alternator: coroutinize tagging a resource	2020-12-29 23:36:25 +02:00
Avi Kivity	700ddd1914	Merge 'scylla_setup: enable node_exporter for offline installation' from Amos Kong node_exporter had been added to scylla-server package by commit `95197a09c9`. So we can enable it by default for offline installation. Closes #7832 * github.com:scylladb/scylla: scylla_setup: cleanup if judgments scylla_setup: enable node_exporter for offline installation	2020-12-28 22:07:36 +02:00
Avi Kivity	1716359455	Update tools/jmx submodule * tools/jmx 20469bf...2c95650 (1): > install.sh: set a valid WorkingDirectory for nonroot offline install	2020-12-28 21:19:04 +02:00
Avi Kivity	f7b731bc46	Merge 'Fix potential reactor stall on LCS compaction completion' from Raphael Carvalho On every compaction completion, sstable set is rebuilt from scratch. With LCS and ~160G of data per shard, it means we'll have to create a new sstable set with ~1000 entries whenever compaction completes, which will likely result in reactor stalling for a significant amount of time. Fixes #7758. Closes #7842 * github.com:scylladb/scylla: table: Fix potential reactor stall on LCS compaction completion table: decouple preparation from execution when updating sstable set table: change rebuild_sstable_list to return new sstable set row_cache: allow external updater to decouple preparation from execution	2020-12-28 21:16:17 +02:00
Pavel Emelyanov	7ac435f67c	test: Enhance test for range_tombstone_list de-overlapping The range_tombstone_list always (unless misused?) contains de-overlapped entries. There's a test_add_random that checks this, but it suffers from several problems: - generated "random" ranges are sequential and may only overlap on their borders - test uses the keys of the same prefix length Enhance the generator part to produce a purely random sequence of ranges with bound keys of arbitrary length. Just pay attention to generate the "valid" individual ranges, whose start is not ahead of the end. Also -- rename the test to reflect what it's doing and increase the number of iterations. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20201228115525.20327-1-xemul@scylladb.com>	2020-12-28 18:26:48 +02:00
Raphael S. Carvalho	8dd7280107	table: Fix potential reactor stall on LCS compaction completion On every compaction completion, sstable set is rebuilt from scratch. With LCS and ~160G of data per shard, it means we'll have to create a new sstable set with ~1000 entries whenever compaction completes, which will likely result in reactor stalling for a significant amount of time. This is fixed by futurizing build_new_sstable_list(), so it will yield whenever needed. Fixes #7758. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:50 -03:00
Raphael S. Carvalho	6082da4703	table: decouple preparation from execution when updating sstable set row cache now allows updater to first prepare the work, and then execute the update atomically as the last step. let's do that when rebuilding the set, so now new set is created in the preparation phase, and the new set replaces the old one in the execution phase, satisfying the atomicity requirement of row cache. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:48 -03:00
Raphael S. Carvalho	43f0200b8f	table: change rebuild_sstable_list to return new sstable set procedure is changed to return the new set, so caller will be responsible for replacing the old set with the new one. this will allow our future work where building new set and enabling it will be decoupled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:47 -03:00
Raphael S. Carvalho	198b87503f	row_cache: allow external updater to decouple preparation from execution External updater may do some preparatory work like constructing a new sstable list, and at the end atomically replace the old list by the new one. Decoupling the preparation from execution will give us the following benefits: - the preparation step can now yield if needed to avoid reactor stalls, as it's been futurized. - the execution step will now be able to provide strong exception guarantees, as it's now decoupled from the preparation step which can be non-exception-safe. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-12-28 13:17:45 -03:00
Avi Kivity	3325960486	Update seastar submodule * seastar 1f5e3d3419...d1b5d41b6d (1): > append_challenged_posix_file_impl: adjust sloppy_size only in optimize_queue Fixes #7439 (the coredump part).	2020-12-28 13:00:04 +02:00
Nadav Har'El	7eda6b1e90	cql-pytest: increase default request timeout The CQL tests in test/cql-pytest use the Python CQL driver's default timeout for execute(), which is 10 seconds. This usually more than enough. However, in extreme cases like noted in issue #7838, 10 seconds may not be enough. In that issue, we run a very slow debug build on a very slow test machine, and encounter a very slow request (a DROP KEYSPACE that needs to drop multiple tables). So this patch increases the default timeout to an even larger 120 seconds. We don't care that this timeout is ridiculously large - under normal operations it will never be reached, there is no code which loops for this amount of time for example. Tested that this patch fixes #7838 by choosing a much lower timeout (1 second) and reproducing test failures caused by timeouts. Fixes #7838. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>	2020-12-28 11:19:37 +02:00
Amos Kong	8723b0ce86	leveled_compaction_strategy: fix boundary of maximum sstable level The MAX_LEVELS is the levels count, but sstable level (index) starts from 0. So the maximum and valid level is MAX_LEVELS - 1. Signed-off-by: Amos Kong <amos@scylladb.com> Closes #7833	2020-12-27 18:59:54 +02:00
Benny Halevy	8a745a0ee0	compaction: compaction_writer: destroy shared_sstable after the sstable_writer sstable_writer may depend on the sstable throughout its whole lifecycle. If the sstable is freed before the sstable_writer we might hit use-after-free as in the follwing case: ``` std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214 sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519 seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:? (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432 seastar::output_stream<char>::flush() at table.cc:? seastar::output_stream<char>::close() at table.cc:? sstables::file_writer::close() at sstables.cc:? sstables::mc::writer::~writer() at writer.cc:? (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790 sstables::mc::writer::~writer() at writer.cc:? flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:? (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>) const at /usr/include/c++/10/bits/unique_ptr.h:85 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201 auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612 ``` What happens here is that: compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc) : _out(std::move(out)) , _compression_metadata(cm) , _offsets(_compression_metadata->offsets.get_writer()) , _compression(lc) , _full_checksum(ChecksumType::init_checksum()) _compression_metadata points to a buffer held by the sstable object. and _compression_metadata->offsets.get_writer returns a writer that keeps a reference to the segmented_offsets in the sstables::compression that is used in the ~writer -> close path. Fixes #7821 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>	2020-12-27 17:02:13 +02:00
Pavel Emelyanov	387889315e	mutation-partition: Relax putting a dummy entry into a continuous range When applying a mutation partition to another if a dummy entry from the source falls into a destination continuous range, it can be just dropped. However, current implementation still inserts it and then instantly removes. Relax this code-flow by dropping the unwanted entry without tossing it. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20201224130438.11389-1-xemul@scylladb.com>	2020-12-27 14:47:32 +02:00
Amos Kong	9adc6f68ee	scylla_setup: cleanup if judgments This patch merged two nested if judgments. Signed-off-by: Amos Kong <amos@scylladb.com>	2020-12-26 04:45:25 +08:00
Amos Kong	632b01ce4e	scylla_setup: enable node_exporter for offline installation node_exporter had been added to scylla-server package by commit `95197a09c9`. So we can enable it by default for offline installation. Signed-off-by: Amos Kong <amos@scylladb.com>	2020-12-25 10:54:31 +08:00
Pavel Emelyanov	72c2482f73	mutation-partition: Construct rows_entry directly from clustering_row When a rows_entry is added to row_cache it's constructed from clustering_row by unpacking all its internals and putting them into the rows_entry's deletable_row. There's a shorter way -- the clustering_row already has the deletale_row onboard from which rows_entry can copy-construct its. This lets keeping the rows_entry and deletable_row set of constructors a bit shorter. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20201224161112.20394-1-xemul@scylladb.com>	2020-12-24 18:13:44 +02:00
Avi Kivity	8f06a687b4	Merge "idl: minor improvements to idl compiler" from Pavel S " This series does a lot of cleanups, dead code removal, and most importantly fixes the following things in IDL compiler tool: * The grammar now rejects invalid identifiers, which, in some cases, allowed to write things like `std:vector`. * Error reporting is improved significantly and failures are now pointing to the place of failure much more accurately. This is done by restricting rule backtracing on those rules which don't need it. " * 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla: idl: move enum and class serializer code writers to the corresponding AST classes idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums idl: minor fixes and code simplification idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn idl: fix parsing of basic types and discard unneeded terminals idl: remove unused functions idl: improve error tracing in the grammar and tighten-up some grammar rules idl: remove redundant `set_namespace` function idl: remove unused `declare_class` function idl: slightly change `str` and `repr` for AST types idl: place directly executed init code into if __name__=="__main__"	2020-12-24 15:14:09 +02:00
Takuya ASADA	95197a09c9	dist: add node_exporter to scylla-server package To connection-less environment, we need to add node_exporter binary to scylla-server package, not downloading it from internet. Related #7765 Fixes #2190 Closes #7796	2020-12-24 11:44:13 +02:00
Pavel Solodovnikov	219ac2bab5	large_data_handler: fix segmentation fault when constructing `data_value` from a `nullptr` It turns out that `cql_table_large_data_handler::record_large_rows` and `cql_table_large_data_handler::record_large_cells` were broken for reporting static cells and static rows from the very beginning: In case a large static cell or a large static row is encountered, it tries to execute `db::try_record` with `nullptr` additional values, denoting that there is no clustering key to be recorded. These values are next passed to `qctx.execute_cql()`, which creates `data_value` instances for each statement parameter, hence invoking `data_value(nullptr)`. This uses `const char*` overload which delegates to `std::string_view` ctor overload. It is UB to pass `nullptr` pointer to `std::string_view` ctor. Hence leading to segmentation faults in the aforementioned large data reporting code. What we want here is to make a null `data_value` instead, so just add an overload specifically for `std::nullptr_t`, which will create a null `data_value` with `text` type. A regression test is provided for the issue (written in `cql-pytest` framework). Tests: test/cql-pytest/test_large_cells_rows.py Fixes: #6780 Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>	2020-12-24 11:37:43 +02:00
Nadav Har'El	79faaa34c7	alternator test: confirm that list index can't be a reference In Alternator's expression parser in alternator/expressions.g, a list can be indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference for the index, e.g., "something[:xyz]", should also work. So this patch adds a test that checks whether "something[:xyz]" works, and confirms that both DynamoDB and Alternator don't accept it and consider it a syntax error. So Alternator's parser is correct to insist that the index be a literal integer. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>	2020-12-24 11:37:29 +02:00
Piotr Sarna	b62457d5b0	test: add verification to using timeout prepared statements Previously the test cases only verified that the queries did not time out with sufficiently large timeout, but now they also check that appropriate data is inserted and can be read. Message-Id: <8bc979434fce977c30d8516dc82789d4fe317696.1608734455.git.sarna@scylladb.com>	2020-12-24 11:37:29 +02:00
Piotr Sarna	1577e6f632	test: add cases for using timeout with batches The test suite for USING TIMEOUT already included SELECT, INSERT and UPDATE statements, but missed batches. The suite is now updated to include batch tests. Tests: unit(dev) Message-Id: <a6738d2ed3d62681615523d01109362766c90325.1608734455.git.sarna@scylladb.com>	2020-12-24 11:37:29 +02:00
Piotr Sarna	4eb41b7d56	test: use random keys in tests for USING TIMEOUT Since the tables are written to and it's possible to run mutliple test cases concurrently, the cases now use pseudorandom keys instead of hardcoded values. Message-Id: <d864dbb096360c17cdc2ebd8e79bfd983c19910e.1608734455.git.sarna@scylladb.com>	2020-12-24 11:37:29 +02:00
Avi Kivity	0bbd78037f	Update seastar submodule * seastar 2bd8c8d088...1f5e3d3419 (5): > Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E > doc: add a coroutines section to the tutorial > Merge "tests/perf: add random-seed config option" from Benny > iotune: Print parameters affecting the measurement results > cook: Add patch cmd for ragel build (signed char confusion on aarch64)	2020-12-24 11:37:29 +02:00
Piotr Sarna	3b26fc01c2	alternator: coroutinize untagging a resource Historically, a seastar thread was used for this request because it's not on a critical path, but a coroutine makes the code simpler.	2020-12-23 15:53:57 +01:00
Piotr Sarna	1ca39cc8c1	alternator: coroutinize tagging a resource Historically, a seastar thread was used for this request because it's not on a critical path, but a coroutine makes the code simpler.	2020-12-23 15:53:57 +01:00
Pavel Solodovnikov	3a91f1127d	idl: move enum and class serializer code writers to the corresponding AST classes Expand the role of AST classes to also supply methods for actually generating the code. More changes will follow eventually until all generation code is handled by these classes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-22 23:23:12 +03:00
dgarcia360	fd5f0c3034	docs: add organization Closes #7818	2020-12-22 15:33:31 +02:00
Pekka Enberg	ceb67e7728	dist/docker: Remove 'epel-release' from Docker image We no longer need the 'epel-release' package for anything as our scylla-server package bundles all the necessary dependencies. Closes #7823	2020-12-22 14:55:17 +02:00
Avi Kivity	e2dfa24540	Merge "token_metadata: add clear_gently" from Benny " We've encountered a number of reactor stalls related to token_metadata that were fixed in `052a8d036d`. This is a follow-up series that adds a clear_gently method to token_metadata that uses continuations to prevent reactor stalls when destroying token_metadata objects. Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug) " * tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla: token_metadata: add clear_gently token_metadata: shared_token_metadata: add mutate_token_metadata token_metdata: futurize update_normal_tokens abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield repair: replace_with_repair: convert to coroutine	2020-12-22 13:23:31 +02:00
Nadav Har'El	f2978e1873	cql-pytest: port Cassandra's collection_test.py A previous patch added test/cql-pytest/cassandra_tests - a framework for porting Cassandra's unit tests to Python - but only ported two tiny test files with just 3 tests. In this patch, we finally port a much larger test file validation/entities/collection_test.java. This file includes 50 separate tests, which cover a lot of aspects of collection support, as well as how other stuff interact with collections. As of now, 23 (!) of these 50 tests fail, and exposed six new issues in Scylla which I carefully documented: Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax Refs #7740: CQL prepared statements incomplete support for "unset" values Refs #7743: Restrictions missing support for "IN" on tables with collections, added in Cassandra 4.0 Refs #7745: Length of map keys and set items are incorrectly limited to 64K in unprepared CQL Refs #7747: Handling of multiple list updates in a single request differs from recent Cassandra Refs #7751: Allow selecting map values and set elements, like in Cassandra 4.0 These issues vary in severity - some are simply new Cassandra 4.0 features that Scylla never implemented, but one (#7740) is an old Cassandra 2.2 feature which it seems we did not implement correctly in some cases that involve collections. Note that there are some things that the ported tests do not include. In a handful of places there are things which the Python driver checks, before sending a request - not giving us an opportunity to check how the server handles such errors. Another notable change in this port is that the original tests repeated a lot of tests with and without a "nodetool flush". In this port I chose to stub the flush() function - it does NOT flush. I think the point of these tests is to check the correctness of the CQL features - not to verify that memtable flush works correctly. Doing a real memtable flush is not only slow, it also doesn't really check much (Scylla may still serve data from cache, not sstables). So I decided it is pointless. An important goal of this patch is that all 50 tests (except three skipped tests because Python has client-side checking), pass when run on Cassandra (with test/cql-pytest/run-cassandra). This is very important: It was very easy to make mistakes while porting the tests, and I did make many such mistakes; But running the against Cassandra allowed me to fix those mistakes - because the correct tests should pass on Cassandra. And now they do. Unfortunately, the new tests are significantly slower than what we've been accustomed in Alternator/CQL tests. The 50 tests create more than a hundred tables, udfs, udts, and similar slow operations - they do not reuse anything via fixtures. The total time for these 50 tests (in dev build mode) is around 18 seconds. Just one test - testMapWithLargePartition is responsibe for almost half (!) of that time - we should consider in the future whether it's worth it or can be made smaller. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>	2020-12-22 13:22:09 +02:00
Avi Kivity	5a33ce58a7	Update seastar submodule * seastar 3b8903d406...2bd8c8d088 (8): > core: remove unused chrono.h reference > cmake: force cxx standard if dialect is specified > queue: add front() > coroutine: deprecate coroutine forwarding > memory: Use 2^n sizes when searching for preferred span size > shared_ptr: define debug_shared_ptr_counter_type constructor as noexcept > install-dependencies: add pkg-config to Debian/Ubuntu packages > log: do_log: prevent garbling due to context switch	2020-12-22 13:22:09 +02:00
Benny Halevy	322aa2f8b5	token_metadata: add clear_gently clear_gently gently clears the token_metadata members. It uses continuations to allow yielding if needed to prevent reactor stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-22 11:22:21 +02:00
Benny Halevy	56aa49ca81	token_metadata: shared_token_metadata: add mutate_token_metadata mutate_token_metadata acquires the shared_token_metadata lock, clones the token_metadata (using clone_async) and calls an asynchronous functor on the cloned copy of the token_metadata to mutate it. If the functor is successful, the mutated clone is set back to to the shared_token_metadata, otherwise, the clone is destroyed. With that, get rid of shared_token_metadata::clone Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-22 11:22:19 +02:00
Benny Halevy	e089c22ec1	token_metdata: futurize update_normal_tokens The function complexity if O(#tokens) in the worst case as for each endpoint token to traverses _token_to_endpoint_map lineraly to erase the endpoint mapping if it exists. This change renames the current implementation of update_normal_tokens to update_normal_tokens_sync and clones the code as a coroutine that returns a future and may yield if needed. Eventually we should futurize the whole token_metadata and abstract_replication_strategy interface and get rid of the synchronous functions. Until then the sync version is still required from call sites that are neither returning a future nor run in a seastar thread. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-22 10:35:15 +02:00
Benny Halevy	e7f4cd89a9	abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield Optimize the can_yield case by invoking the futurized version of clone_only_token_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-22 09:49:08 +02:00
Benny Halevy	55316df6bf	repair: replace_with_repair: convert to coroutine Prepare to futurizing update_normal_tokens. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-22 09:49:08 +02:00
Piotr Sarna	da7e87dc56	test: add cases for using timeout with bind markers The test suite for USING TIMEOUT already included binding the timeout value, but only for wildcard (?). The test case is now extended with named bind markers. Tests: unit(dev) Message-Id: <b5344f40d26d90b36e90a04c2474127728535eaa.1608573624.git.sarna@scylladb.com>	2020-12-22 09:03:56 +02:00
Pekka Enberg	961b9e8390	install.sh: Add seastar-cpu-map.sh to $PATH Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to create symbolic links to scripts so that they appear in $PATH. Please note that there are additional Python scripts (like perftune.py), which are not in $PATH. That's because Python scripts are handled separately in "install.sh" and no Python script has a "sbin" symlink. We might want to change this in the future, though. Fixes #6731 Closes #7809	2020-12-21 14:12:27 +02:00
Avi Kivity	0f7b6dd180	utils: managed_bytes: introduce with_linearized() This is a temporary scaffold for weaning ourselves off linearization. It differs from with_linearized_managed_bytes in that it does not rely on the environment (linearization_context) and so is easier to remove.	2020-12-20 15:14:44 +01:00
Avi Kivity	c37e495958	utils: managed_bytes: constrain with_linearized_managed_bytes() The passed function must be called with a no parameters; document and enforce it.	2020-12-20 15:14:44 +01:00
Avi Kivity	a1df1b3c34	utils: managed_bytes: avoid internal uses of managed_bytes::data() We use managed_bytes::data() in a few places when we know the data is non-fragmented (such as when the small buffer optimization is in use). We'd like to remove managed_bytes::data() as linearization is bad, so in preparation for that, replace internal uses of data() with the equivalent direct access.	2020-12-20 15:14:44 +01:00
Avi Kivity	72a2554a86	utils: managed_bytes: extract do_linearize_pure() do_linearize() is an impure function as it changes state in linearization_context. Extract the pure parts into a new do_linearize_pure(). This will be used to linearize managed_bytes without a linearization_context, during the transition period where fragmented and non-fragmented values coexist.	2020-12-20 15:14:44 +01:00
Avi Kivity	4b3f0fd7c0	thrift: do not depend on implicit conversion of keys to bytes_view This implicit conversion will soon be gone, as it is dangerous. Ask for the representation explicitly.	2020-12-20 15:14:44 +01:00
Avi Kivity	8521248955	clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view This implicit conversion will soon be gone, as it is dangerous. Ask for the representation explicitly.	2020-12-20 15:14:44 +01:00
Avi Kivity	1dd6d7029a	cql3: expression: linearize get_value_from_mutation() eariler do_get_value() is careful to return a fragmented view, but its only caller get_value_from_mutation() linearizes it immediately afterwards. Linearize it sooner; this prevents mixing in fragmented values from cells (now via IMR) and fragmented values from partition/clustering keys. It only works now because keys are not fragmented outside LSA, and value_view has a special case for single-fragment values. This helps when keys become fragmented.	2020-12-20 15:14:44 +01:00
Avi Kivity	b59a21967c	bytes: add to_bytes(bytes) Converting from bytes to bytes is nonsensical, but it helps when transitioning to other types (managed_bytes/managed_bytes_view), and these types will have to_bytes() conversions.	2020-12-20 15:14:44 +01:00
Avi Kivity	28126257c2	cql3: expression: mark do_get_value() as static It is used only later in this file.	2020-12-20 15:14:44 +01:00
Avi Kivity	b3e39d81aa	Merge 'Avoid scanning sstables in parallel for TWCS single-partition queries' from Kamil Braun We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`. The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently). The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader. The following experiment was performed: 1. create a TWCS table with 1 minute windows 2. fill the table with 8 equal windows of data (each window flushed to a separate sstable) 3. perform `select * from ks.t where pk = 0 limit 1` query with and without the change The expectation is that with the commit, only one sstable will be opened to fetch that one row; without the commit all 8 sstables would be opened at once. The difference in the value of `scylla_reactor_aio_bytes_read` was measured (value after the query minus value before the query), both with and without the commit. With the commit, the difference was 67584. Without the commit, the difference was 528384. 528384 / 67584 ~= 7.8. Fixes #6418. Closes #7437 * github.com:scylladb/scylla: sstables: gather clustering key filtering statistics in TWCS single key reader sstables: use time_series_sstable_set in time_window_compaction_strategy sstable_set: new reader for TWCS single partition queries mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set sstable_set: introduce min_position_reader_queue sstable_set: introduce time_series_sstable_set sstables: add min_position and max_position accessors sstable_set: make create_single_key_sstable_reader a virtual method clustering_order_reader_merger: fix the 0 readers case	2020-12-19 23:53:18 +02:00
Kamil Braun	53414558a1	sstables: gather clustering key filtering statistics in TWCS single key reader	2020-12-18 16:33:27 +01:00
Kamil Braun	4f2d45001c	sstables: use time_series_sstable_set in time_window_compaction_strategy The following experiment was performed: 1. create a TWCS table with 1 minute windows 2. fill the table with 8 windows of data (each window flushed to a separate sstable) 3. perform `select * from ks.t where pk = 0 limit 1` query with and without the change The expectation is that with the commit, only one sstable will be opened to fetch that one row; without the commit all 8 sstables would be opened at once. The difference in the value of `scylla_reactor_aio_bytes_read` was measured (value after the query minus value before the query), both with and without the commit. With the commit, the difference was 67584. Without the commit, the difference was 528384. 528384 / 67584 ~= 7.8. Fixes https://github.com/scylladb/scylla/issues/6418.	2020-12-18 16:33:27 +01:00
Kamil Braun	f0842ba34e	sstable_set: new reader for TWCS single partition queries This commit introduces a new implementation of `create_single_key_sstable_reader` in `time_series_sstable_set` dedicated for TWCS-created sstables. It uses the fact that such sstables are mostly disjoint with respect to contained `position_in_partition`s in order to decrease the number of sstable readers that are opened at the same time. The implementation uses `clustering_order_reader_merger` under the hood. The reader assumes that the schema does not have static columns and none of the queried sstable contain partition tombstones; also, it assumes that the sstables have the min/max clustering key metadata in order for the implementation to be efficient. Thus, if we detect that some of these assumptions aren't true, we fall back to the old implementation.	2020-12-18 16:33:27 +01:00
Kamil Braun	b41139a07f	mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set	2020-12-18 16:33:27 +01:00
Kamil Braun	d0548aa77f	sstable_set: introduce min_position_reader_queue This is a queue of readers of sstables in a time_series_sstable_set, returning the readers in order of the smallest position_in_partition that the sstables have. It uses the min/max clustering key sstable metadata. The readers are opened lazily, at the moment of being returned.	2020-12-18 16:33:27 +01:00
Kamil Braun	52697022b0	sstable_set: introduce time_series_sstable_set At this moment it is a slightly less efficient version of bag_sstable_set, but in following commits we will use the new data structures to gain advantage in single partition queries for sstables created by TimeWindowCompactionStrategy.	2020-12-18 16:33:27 +01:00
Kamil Braun	2a160dd909	sstables: add min_position and max_position accessors The methods return a lower-bound and an upper-bound for the position-in-partitions appearing in a given sstable.	2020-12-18 16:33:27 +01:00
Kamil Braun	fe26da82ba	sstable_set: make create_single_key_sstable_reader a virtual method ... of sstable_set_impl. Soon we shall provide a specialized implementation in one of the `sstable_set_impl` derived classes. The existing implementation is used as the default one.	2020-12-18 12:31:16 +01:00
Kamil Braun	5e846b33b8	clustering_order_reader_merger: fix the 0 readers case With 0 readers the merger would produce a `partition_end` fragment when it should immediately return `end_of_stream` instead.	2020-12-18 12:30:40 +01:00
Gleb Natapov	85cffd1aeb	lwt: rewrite storage_proxy::cas using coroutings Makes code much simpler to understand. Message-Id: <20201201160213.GW1655743@scylladb.com>	2020-12-17 18:15:35 +01:00
Avi Kivity	a60c81b615	Merge 'cql3: Fix handling of impossible restrictions on a primary-key column' from Dejan Mircevski There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0): 1. When the column is indexed, Scylla crashed (#7772) 2. Computing ranges and slices was throwing an exception This series fixes them both; it also happens to resolve some old TODOs from restriction_test. Tests: unit (dev, debug) Closes #7804 * github.com:scylladb/scylla: cql3: Fix value_for when restriction is impossible cql3: Fix range computation for p=1 AND p=1	2020-12-17 12:01:36 +02:00
Dejan Mircevski	46b4b59945	cql3: Fix value_for when restriction is impossible Previously, single_column_restrictions::value_for() assumed that a column's restriction specifies exactly one value for the column. But since `37ebe521e3`, multiple equalities on the same column are allowed, so the restriction could be a conjunction of conflicting equalities (eg, c=1 AND c=0). That violates an assert and crashes Scylla. This patch fixes value_for() by gracefully handling the impossible-restriction case. Fixes #7772 Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-12-16 15:00:29 -05:00
Dejan Mircevski	4bb1107652	cql3: Fix range computation for p=1 AND p=1 Previously compute_bounds was assuming that primary-key columns are restricted by exactly one equality, resulting in the following error: query 'select p from t where p=1 and p=1' failed: std::bad_variant_access (std::get: wrong index for variant) This patch removes that assumption and deals correctly with the multiple-equalities case. As a byproduct, it also stops raising "invalid null value" exceptions for null RHS values. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-12-16 14:46:48 -05:00
Pavel Solodovnikov	edf9ccee48	idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums Split `write`, `read` and `skip` serializer function writers to separate functions in `handle_class` and `handle_enum` functions, which slightly improves readability. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 20:33:55 +03:00
Pavel Solodovnikov	8049cb0f91	idl: minor fixes and code simplification * Introduce `ns_qualified_name` and `template_params_str` functions to simplify code a little bit in `handle_enum` and `handle_class` functions. * Previously each serializer had a separate namespace open-close statements, unify them into a single namespace scope. * Fix a few more `hout` -> `cout` argument names. * Rename `template` pattern to `template_decl` to improve clarity. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:32:08 +03:00
Pavel Solodovnikov	0de96426db	idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn Prior to the patch all functions that are called from `add_visitors` and this function itself declared the argument denoting the output file as `hout`. Though, this was quite misleading since `hout` is meant to be header file with declarations, while `cout` is an implementation file. These functions write to implmentation file hence `hout` should be changed to `cout` to avoid confusion. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:32:03 +03:00
Pavel Solodovnikov	0defb52855	idl: fix parsing of basic types and discard unneeded terminals Prior to the patch `btype` production was using `with_colon` rule, which accidentally supported parsing both numbers and identifiers (along with other invalid inputs, such as "123asd"). It was changed to use `ns_qualified_ident` and those places which can accept numeric constants, are explicitly listing it as an alternative, e.g. template parameter list. Unfortunately, I had to make TemplateType to explicitly construct `BasicType` instances from numeric constants in template arguments list. This is exactly the way it was handled before, though. But nonetheless, this should be addressed sometime later. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:57 +03:00
Pavel Solodovnikov	0cc87ead3d	idl: remove unused functions Remove the following functions since they are not used: * `open_namespaces` * `close_namespaces` * `flat_template` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:51 +03:00
Pavel Solodovnikov	bea965a0a7	idl: improve error tracing in the grammar and tighten-up some grammar rules This patch replaces use of some handwritten rules to use their alternatives already defined in `pyparsing.pyparsing_common` class, i.e.: `number`, `identifier` productions. Changed ignore patterns for comments to use pre-defined `pp.cppStyleComment` instead of hand-written combination of '//'-style and C-style comment rules. Operator '-' is now used whenever possible to improve debugging experience: it disables default backtracking for productions so that compiler fails earlier and can now point more precisely to a place in the input string where it failed instead of backtracking to the top-level rule and reporting error there. Template names and class names now use `ns_qualified_ident` rule instead of `with_colon` which prevents grammar from matching invalid identifiers, such as `std:vector`. Many places are using the updated `identifier` production, which is working correctly unlike its predecessor: now inputs such as `1ident` are considered invalid. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:46 +03:00
Pavel Solodovnikov	3a037bc5b6	idl: remove redundant `set_namespace` function Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:40 +03:00
Pavel Solodovnikov	e76e8aec0e	idl: remove unused `declare_class` function Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:35 +03:00
Pavel Solodovnikov	745f4ac23b	idl: slightly change `str` and `repr` for AST types Surround string representation with angle brackets. This improves readability when printing debug output. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:31:20 +03:00
Pavel Solodovnikov	4a61270701	idl: place directly executed init code into if __name__=="__main__" Since idl compiler is not intended to be used as a module to other python build scripts, move initialization code under an if checking that current module name is "__main__". Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 19:30:33 +03:00
Gleb Natapov	37368726c9	migration_manager: remove unused announce() variant Message-Id: <20201216153150.GG3244976@scylladb.com>	2020-12-16 18:14:07 +02:00
Konstantin Osipov	2c46938c2a	commitlog: avoid a syscall in a most common case of segment recycle When recycling a segment in O_DSYNC mode if the size of the segment is neither shrunk nor grown, avoid calling file::truncate() or file::allocate(). Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>	2020-12-16 14:57:36 +02:00
Avi Kivity	fdb47c954d	Merge "idl: allow IDL compiler to parse `const` specifiers for template arguments" from Pavel S " This patch series consists of the following patches: 1. The first one turned out to be a massive rewrite of almost everything in `idl-compiler.py`. It aims to decouple parser structures from the internal representation which is used in the code-generation itself. Prior to the patch everything was working with raw token lists and the code was extremely fragile and hard to understand and modify. Moreover, every change in the parser code caused a cascade effect of breaking things at many different places, since they were relying on the exact format of output produced by parsing rules. Now there is a bunch of supplementary AST structures which provide hierarchical and strongly typed structure as the output of parsing routine. It is much easier to verify (by the means of `isinstance`, for example) and extend since the internal structures used in code-generation are decoupled from the structure of parsing rules, which are now controlled by custom parse actions providing high-level abstractions. It is tested manually by checking that the old code produces exactly the same autogenerated sources for all Scylla IDLs as the new one. 2 and 3. Cosmetics changes only: fixed a few typos and moved from old-fashioned `string.Template` to python f-strings. This improves readability of the idl-compiler code by a lot. Only one non-functional whitespace change introduced. 4. This patch adds a very basic support for the parser to understand `const` specifier in case it's used with a template parameter for a data member in a class, e.g. struct my_struct { std::vector<const raft::log_entry> entries; }; It actually does two things: * Adjusts `static_asserts` in corresponding serializer methods to match const-ness of fields. * Defines a second serializer specialization for const type in `.dist.hh` right next to non-const one. This seems to be sufficient for raft-related uses for now. Please note there is no support for the following cases, though: const std::vector<raft::log_entry> entries; const raft::term_t term; None of the existing IDLs are affected by the change, so that we can gradually improve on the feature and write the idl unit-tests to increase test coverage with time. 5. A basic unit-test that writes a test struct with an `std::vector<S<const T>>` field and reads it back to verify that serialization works correctly. 6. Basic documentation for AST classes. TODO: should also update the docs in `docs/IDL.md`. But it is already quite outdated, and some changes would even be out of scope for this patch set. " * 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla: idl: add docstrings for AST classes idl: add unit-test for `const` specifiers feature idl: allow to parse `const` specifiers for template arguments idl: fix a few typos in idl-compiler idl: switch from `string.Template` to python f-strings and format string in idl-compiler idl: Decouple idl-compiler data structures from grammar structure	2020-12-16 14:05:33 +02:00
Gleb Natapov	61520a33d6	mutation_writer: pass exceptions through feed_writer feed_writer() eats exception and transforms it into an end of stream instead. Downstream validators hate when this happens. Fixes #7482 Message-Id: <20201216090038.GB3244976@scylladb.com>	2020-12-16 13:18:19 +02:00
Pavel Solodovnikov	8b8dce15c3	idl: add docstrings for AST classes Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-16 09:03:39 +03:00
Botond Dénes	978ec7a4bb	tools: introduce scylla-sstable-index A tool which lists all partitions contained in an sstable index. As all partitions in an sstable are indexed, this tool can be used to find out what partitions are contained in a given sstable. The printout has the following format: $pos: $human_readable_value (pk{$raw_hex_value}) Where: * $pos: the position of the partition in the (decompressed) data file * $human_readable_value: the human readable partition key * $raw_hex_value: the raw hexadecimal value of the binary representation of the partition key For now the tool requires the types making up the partition key to be specified on the command line, using the `--type\|-t` command line argument, using the Cassandra type class name notation for types. As these are not assumed to be widely known, this patch includes a document mapping all cql3 types to their Cassandra type class name equivalent (but not just). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>	2020-12-15 18:46:47 +02:00
Calle Wilund	71c5dc82df	database: Verify iff we actually are writing memtables to disk in truncate Fixes #7732 When truncating with auto_snapshot on, we try to verify the low rp mark from the CF against the sstables discarded by the truncation timestamp. However, in a scenario like: Fill memtables Flush Truncate with snapshot A Fill memtables some more Truncate Move snapshot A to upload + refresh (load old tables) Truncate The last op will assert, because while we have sstables loaded, which will be discarded now, we did not in fact generate any _new_ ones (since memtables are empty), and the RP we get back from discard is one from an earlier generation set. (Any permutation of events that create the situation "empty memtable" + "non-empty sstables with only old tables" will generate the same error). Added a check that before flushing checks if we actually have any data, and if not, does not uphold the RP relation assert. Closes #7799	2020-12-15 16:24:36 +02:00
Avi Kivity	7636799b18	Merge 'Add waiting for flushes on table drops' from Piotr Sarna This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish. Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race: 1. Run a node with `auto_snapshot=false` 2. Schedule a memtable flush (e.g. via nodetool) 3. Get preempted in the middle of the flush 4. Drop the table 5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied) Fixes #7792 Closes #7798 * github.com:scylladb/scylla: database: add flushes to waiting for pending operations table: unify waiting for pending operations database: add a phaser for flush operations database: add waiting for pending streams on table drop	2020-12-15 16:02:47 +02:00
Pavel Solodovnikov	1e6df841a5	idl: add unit-test for `const` specifiers feature Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-15 16:03:18 +03:00
Pavel Solodovnikov	facf27dbe4	idl: allow to parse `const` specifiers for template arguments This patch introduces very limited support for declaring `const` template parameters in data members. It's not covering all the cases, e.g. `const type member_variable` and `const template_def<T1, T2, ...>` syntax is not supported at the moment. Though the changes are enough for raft-related use: this makes it possible to declare `std::vector<raft::log_entries_ptr>` (aka `std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL. Existing IDL files are not affected in any way. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-15 16:03:11 +03:00
Pavel Solodovnikov	f02703fcd7	idl: fix a few typos in idl-compiler Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-15 16:02:55 +03:00
Pavel Solodovnikov	28b602833f	idl: switch from `string.Template` to python f-strings and format string in idl-compiler Move to a modern and lightweight syntax of f-strings introduced in python 3.6. It improves readability and provides greater flexibility. A few places are now using format strings instead, though. In case when multiline substitution variable is used, the template string should be first re-indented and only after that the formatting should be applied, or we can end up with screwed indentation the in generated sources. This change introduces one invisible whitespace change in `query.dist.impl.hh`, otherwise all generated code is exactly the same. Tests: build(dev) and diff genetated IDL sources by hand Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-15 16:01:17 +03:00
Pavel Solodovnikov	4ab1f7f55d	idl: Decouple idl-compiler data structures from grammar structure Instead of operating on the raw lists of tokens, transform them into typed structures representation, which makes the code by many orders of magnitude simpler to read, understand and extend. This includes sweeping changes throughout the whole source code of the tool, because almost every function was tightly coupled to the way data was passed down from the parser right to the code generation routines. Tested manually by checking that old generated sources are precisely the same as the new generated sources. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-12-15 15:59:17 +03:00
Piotr Sarna	b1208d0fcc	database: add flushes to waiting for pending operations In order to prevent races with table drops, the helper function which waits for all pending operations to finish now also waits for pending flushes.	2020-12-15 13:11:33 +01:00
Piotr Sarna	cd1e351dc1	table: unify waiting for pending operations In order to reduce code duplication which already caused a bug, waiting for pending operations is now unified with a single helper function.	2020-12-15 13:11:25 +01:00
Piotr Sarna	df3204426d	database: add a phaser for flush operations Pending flushes can participate in races when a table with auto_snapshot==false is dropped. The race is as follows: 1. A flush of table T is initiated 2. The flush operation is preempted 3. Table T is dropped without flushing, because it has auto_snapshot off 4. The flush operation from (2.) wakes up and continues working on table T, which is already dropped 5. Segfault/memory corruption To prevent such races, a phaser for pending flushes is introduced	2020-12-15 12:59:36 +01:00
Piotr Sarna	57d63ca036	database: add waiting for pending streams on table drop We already wait for pending reads and writes, so for completeness we should also wait for all pending stream operations to finish before dropping the table to avoid inconsistencies.	2020-12-15 12:55:45 +01:00
Takuya ASADA	ebc4076fa5	tools: toolchain: add node_exporter Download node_exporter in frozen image to prepare adding node_exporter to relocatable pacakge. Related #2190 Closes #7765 [avi: updated toolchain, x86_64/aarch64/s390x]	2020-12-14 20:34:17 +02:00
Piotr Sarna	13317f7698	alternator: ensure correct isolation level in tracing tests Taking advantage of the fact that isolation level can be defined for a table with a tag, the tracing test that relies on CAS can now be sure to have a correct isolation level. Message-Id: <43f005ab9d566c7d3d55ce93c553127b1df9e87f.1607954739.git.sarna@scylladb.com>	2020-12-14 17:37:55 +02:00
Piotr Sarna	7081e361cc	test: add isolation level requirement message to tracing tests Alternator tracing tests require the cluster to have the 'always' isolation level configured to work properly. If that's not the case, the tests will fail due to not having CAS-related traces present in the logs. In order to help the users fix their configuration, a helper message is printed before the test case is performed. Automatic tests do not need this, because they are all ran with matching isolation level, but this message could greatly improve the user experience for manual tests. Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>	2020-12-14 14:53:58 +02:00
Piotr Sarna	4b0303d8ae	tests: make alternator tracing tests idempotent The outcome of alternator tracing tests was that tracing probability was always set to 0 after the test was finished. That makes sense for most test runs, but manual tests can work on existing clusters with tracing probability set to some other value. Due to preserve previous trace probability, the value is now extracted and stored, so that it can be restored after the test is done. Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>	2020-12-14 14:53:23 +02:00
Avi Kivity	19ff528ef3	Update seastar submodule * seastar 2de43eb6bf...3b8903d406 (3): > coroutines: check preemption flag in co_await > memory: consider span freelist objects in small pool diagnostics > util: noncopyable_function: avoid gcc uninitialized error in move constructor	2020-12-14 12:50:32 +02:00
Pekka Enberg	8d00c16feb	transport/server: Code cleanups Fix up some coding style issues spotted while reading the code: - Fix indentation to be 4 spaces - Remove superfluous semicolons Closes #7793	2020-12-14 12:48:05 +02:00
Konstantin Osipov	b6c6cc275f	commitlog: align input of dma_write() during segment recycle Normally a file size should be aligned around block size, since we never write to it any unaligned size. However, we're not protected against partial writes. Just to be safe, align up the amount of bytes to zerofill when recycling a segment. Message-Id: <20201211142628.608269-4-kostja@scylladb.com>	2020-12-14 12:16:18 +02:00
Konstantin Osipov	ad6817bcde	commitlog: fix typo in a comment Message-Id: <20201211142628.608269-2-kostja@scylladb.com>	2020-12-14 12:16:14 +02:00
Benny Halevy	0e79e0f215	test: mutation_diff: extend section markers When the different mutations are printed via BOOST_REQUIRE_EQUAL, we don't get the "expect {} but got {}" section markers. Instead, the parts we're interested in are bracketed like "critical check X == Y has failed [{} != {}]" Test: with both formats: - https://github.com/scylladb/scylla/files/3890627/test_concurrent_reads_and_eviction.log - https://github.com/scylladb/scylla/files/4303117/flat_mutation_reader_test.118.log - https://github.com/scylladb/scylla/files/5687372/flat_mutation_reader_test.172.log.gz Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201214100521.3814909-1-bhalevy@scylladb.com>	2020-12-14 12:11:34 +02:00
Nadav Har'El	72cb3e9255	alternator test: add missing wait for update_table to finish Three tests in test_streams.py run update_table() on a table without waiting for it to complete, and then call update_table() on the same table or delete it. This always works in Scylla, and usually works in AWS, but if we reach the second call, it may fail because the previous update_table() did not take effect yet. We sometimes see these failures when running the Alternator test suite against AWS. So in this patch, after an each update_table() we wait for the table to return from UPDATING to ACTIVE status. The entire Alternator test suite now passes (or skipped) on AWS, so: Fixes #7778. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>	2020-12-14 09:18:38 +01:00
Nadav Har'El	43ce0aef3d	alternator test: fix test wrongly failing on AWS The test test_query_filter.py::test_query_filter_paging fails on AWS and shouldn't fail, so this patch fixes the test. Note that this is only a test problem - no fix is needed for Alternator itself. The test reads 20 results with 1-result pages, and assumed that 21 pages are returned. The 21st page may happen because when the server returns the 20th, it might not yet know there will be no additional results, so another page is needed - and will be empty. Still a different implementation might notice that the last page completed the iteration, and not return an extra empty page. This is perfectly fine, and this is what AWS DynamoDB does today - and should not be considered an error. Refs #7778 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>	2020-12-14 09:18:31 +01:00
Nadav Har'El	4ab98a4c68	alternator: use a more specific error when Authorization header is missing When request signature checking is enabled in Alternator, each request should come with the appropriate Authorization header. Most errors in this preparing this header will result in an InvalidSignatureException response; But DynamoDB returns a more specific error when this header is completely missing: MissingAuthenticationTokenException. We should do the same, but before this patch we return InvalidSignatureException also for a missing header. The test test_authorization.py::test_no_authorization_header used to enshrine our wrong error message, and failed when run against AWS. After this patch, we fix the error message and the test - which now passes against both Alternator and AWS. Refs #7778. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>	2020-12-14 09:18:24 +01:00
Avi Kivity	39afe14ad4	Merge 'Add per query timeout' from Piotr Sarna This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker. The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues. The series comes with a pytest test suite. Examples: ```cql SELECT * FROM t USING TIMEOUT 200ms; ``` ```cql INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms; ``` Working with prepared statements works as usual - the timeout parameter can be explicitly defined or provided as a marker: ```cql SELECT * FROM t USING TIMEOUT ?; ``` ```cql INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms; ``` Tests: unit(dev) Fixes #7777 Closes #7781 * github.com:scylladb/scylla: test: add prepared statement tests to USING TIMEOUT suite docs: add an entry about USING TIMEOUT test: add a test suite for USING TIMEOUT storage_proxy: start propagating local timeouts as timeouts cql3: allow USING clause for SELECT statement cql3: add TIMEOUT attribute to the parser cql3: add per-query timeout to select statement cql3: add per-query timeout to batch statement cql3: add per-query timeout to modification statement cql3: add timeout to cql attributes	2020-12-14 09:46:46 +02:00
Piotr Sarna	d6e7e36280	test: add prepared statement tests to USING TIMEOUT suite	2020-12-14 07:50:40 +01:00
Piotr Sarna	da77ab832b	docs: add an entry about USING TIMEOUT The paragraph describes how USING TIMEOUT clause can be used along with some simple examples.	2020-12-14 07:50:40 +01:00
Piotr Sarna	0148b41a02	test: add a test suite for USING TIMEOUT The test suite is based on cql-pytest and checks if USING TIMEOUT works as expected.	2020-12-14 07:50:40 +01:00
Piotr Sarna	27fba35832	storage_proxy: start propagating local timeouts as timeouts A local timeout was previously propagated to the client as WriteFailure, while there exists a more concrete error type for that: WriteTimeout.	2020-12-14 07:50:40 +01:00
Piotr Sarna	ddd9cb1b2a	cql3: allow USING clause for SELECT statement In order to be able to specify a timeout for SELECT statements, it's now possible to use the USING clause with it.	2020-12-14 07:50:40 +01:00
Piotr Sarna	d3896a209b	cql3: add TIMEOUT attribute to the parser It's now possible to specify TIMEOUT as part of the USING clause.	2020-12-14 07:50:40 +01:00
Piotr Sarna	157be33b89	cql3: add per-query timeout to select statement First of all, select statement is extended with an 'attrs' field, which keeps the per-query attributes. Currently, only TIMEOUT parameter is legal to use, since TIMESTAMP and TTL bear no meaning for reads. Secondly, if TIMEOUT attribute is set, it will be used as the effective timeout for a particular query.	2020-12-14 07:50:40 +01:00
Piotr Sarna	20dedd0df7	cql3: add per-query timeout to batch statement If TIMEOUT attribute is set, it will be used as the effective timeout for a particular query.	2020-12-14 07:50:40 +01:00
Piotr Sarna	3c49b6bd88	cql3: add per-query timeout to modification statement If TIMEOUT attribute is set, it will be used as the effective timeout for a particular query.	2020-12-14 07:50:40 +01:00
Piotr Sarna	5bbd0b049b	cql3: add timeout to cql attributes This attribute will be used later to specify per-query timeout.	2020-12-14 07:50:40 +01:00
Benny Halevy	c60da2e90d	cdc: remove _token_metadata from db_context 1. It's unused since `cbe510d1b8` 2. It's unsafe to keep a reference to token_metadata& potentially across yield points. The higher-level motivation is to make storage_service::get_token_metadata() private so we can control better how it's used. For cdc, if the token_metadata is going to be needed to the future, it'd be better get it from db_context::_proxy.get_token_metadata_ptr(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>	2020-12-13 18:32:17 +02:00
Avi Kivity	0f967f911d	Merge "storage_service: get_token_metadata_ptr to hold on to token_metadata" from Benny " This series fixes use-after-free via token_metadata& We may currently get a token_metadata& via get_token_metadata() and use it across yield points in a couple of sites: - do_decommission_removenode_with_repair - get_new_source_ranges To fix that, get_token_metadata_ptr and hold on to it across yielding. Fixes #7790 Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug) Test: unit(dev) " * tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla: storage_service: get_new_source_ranges: don't hold token_metadata& across yield point storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield	2020-12-13 17:37:24 +02:00
Aleksandr Bykov	e74dc311e7	dist: scylla_util: fix aws_instance.ebs_disks method aws_instance.ebs_disks() method should return ebs disk instead of ephemeral Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com> Closes #7780	2020-12-13 17:33:37 +02:00
Benny Halevy	1fbc831dae	storage_service: get_new_source_ranges: don't hold token_metadata& across yield point Provide the token_metadata& to get_new_source_ranges by the caller, who keeps it valid throughout the call. Note that there is no need to clone_only_token_map since the token_metadata_ptr is immutable and can be used just as well for calling strat.get_range_addresses. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-13 16:42:00 +02:00
Benny Halevy	f13913d251	storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range Now that we pass can_yield::yes to calculate_natural_endpoints for each token_range. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-13 16:42:00 +02:00
Benny Halevy	89ed0705e8	storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner No need to hold on to the shared token_metadata_ptr after we got clone_after_all_left(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-13 16:42:00 +02:00
Benny Halevy	684c4143df	storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield When yielding in clone_only_token_map or clone_after_all_left the token_metadata got with get_token_metadata() may go away. Use get_token_metadata_ptr() instead to hold on to it. And with that, we don't need to clone_only_token_map. `metadata` is not modified by calculate_natural_endpoints, so we can just refer to the immutable copy retrieved with get_token_metadata_ptr. Fixes #7790 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-13 16:41:58 +02:00
Avi Kivity	65a0244614	Update tools/jmx submodule * tools/jmx 6174a47...20469bf (1): > column_family: Return proper cardinality for toppartitions requests	2020-12-13 13:51:38 +02:00
Avi Kivity	9265b87610	Merge "Remove get_local_storage_proxy from validation" from Pavel E " The validate_column_family() helper uses the global proxy reference to get database from. Fortunatelly, all the callers of it can provide one via argument. tests: unit(dev) " * 'br-no-proxy-in-validate' of https://github.com/xemul/scylla: validation: Remove get_local_storage_proxy call client_state: Call validate_column_family() with database arg client_state: Add database& arg to has_column_family_access storage_proxy: Add .local_db() getters validate: Mark database argument const	2020-12-13 13:12:57 +02:00
Avi Kivity	19aaf8eb83	Merge "Remove global storage service from index manager" from Pavel E " The initial intent was to remove call for global storage service from secondary index manager's create_view_for_index(), but while fixing it one of intermediate schema table's helper managed to benefit from it by re-using the database reference flying by. The cleanup is done by simply pushing the database reference along the stack from the code that already has it down the create_view_for_index(). tests: unit(dev) " * 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla: schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations schema-tables: Add database argument to make_update_table_mutations schema-tables: Factor out calls getting database instance index-manager: Move feature evaluation one level up	2020-12-13 12:41:51 +02:00
Benny Halevy	aae3991246	repair: do_decommission_removenode_with_repair: don't deref ops when null `ops` might be passed as a disengaged shared_ptr when called from `decommission_with_repair`. In this case we need to propagate to sync_data_using_repair a disengaged std::optional<utils::UUID>. Fixes #7788 DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>	2020-12-13 12:37:18 +02:00
Avi Kivity	18be57a4e5	Update seastar submodule * seastar 8b400c7b45...2de43eb6bf (3): > core: show span free sizes correctly in diagnostics > Merge "IO queues to share capacities" from Pavel E > file: make_file_impl: determine blockdev using st_mode	2020-12-12 21:57:01 +02:00
Pekka Enberg	c990f2bd34	Merge 'Reinstate [[nodiscard]] support' from Avi Kivity The switch to clang disabled the clang-specific -Wunused-value since it generated some harmless warnings. Unfortunately, that also prevent [[nodiscard]] violations from warning. Fix by clearing all instances of the warning (including [[nodiscard]] violations that crept in while it was disabled) and reinstating the warning. Closes #7767 * github.com:scylladb/scylla: build: reinstate -Wunused-value warning for [[nodiscard]] test: lib: don't ignore future in compare_readers() test: mutation_test: check both ranges when comparing summaries serialializer: silence unused value warning in variant deserializer	2020-12-12 09:54:05 +02:00
Avi Kivity	615b8e8184	dist: rpm: uninstall tuned when installing scylla-kernel-conf tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns and other sysctl tunables that we so laboriously tuned, dropping performance by a factor of 5 (due to increased latency). Fix by obsoleting tuned during install (in effect, we are a better tuned, at least for us). Not needed for .deb, since debian/ubunto do not install tuned by default. Fixes #7696 Closes #7776	2020-12-12 09:54:05 +02:00
Pavel Emelyanov	3a025cfa52	schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations Two halves of the tunnel finally connect -- the latter helper needs the local database instance and is only called by the former one which already has it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 21:23:53 +03:00
Pavel Emelyanov	89fd524c5a	schema-tables: Add database argument to make_update_table_mutations There are 3 callers of this helper (cdc, migration manager and tests) and all of them already have the database object at hands. The argument will be used by next patch to remove call for global storage proxy instance from make_update_indices_mutations. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 21:21:22 +03:00
Pavel Emelyanov	1bcef04c7a	schema-tables: Factor out calls getting database instance The make_update_indices_mutations gets database instance for two things -- to find the cf to work with and to get the value of a feature for index view creation. To suit both and to remove calls for global storage proxy and service instances get the database once in the function entrance. Next patch will clean this further. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 21:17:11 +03:00
Pavel Emelyanov	6dd10e771d	index-manager: Move feature evaluation one level up The create_view_for_index needs to know the state of the correct-idx-token-in-secondary-index feature. To get one it takes quite a long route through global storage service instance. Since there's only one caller of the method in question, and the method is called in a loop, it's a bit faster to get the feature value in caller and pass it in argument. This will also help to get rid of the call for global storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 21:14:12 +03:00
Pavel Emelyanov	3a3ee45488	size_estimate_reader: Use local db reference not global The get_next_partition uses global proxy instance to get the local database reference. Now it's available in the reader object itself, so it's possible to remove this call for global storage proxy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 20:38:21 +03:00
Pavel Emelyanov	107dcbfbd6	size_estimate_reader: Keep database reference on mutation reader This reader uses local databse instance in its get_next_partition method to find keyspaces to work with Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 20:34:54 +03:00
Pavel Emelyanov	48e494fb62	size_estimate_reader: Keep database reference on virtual_reader The database will be then used to create the mutation reader Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 20:31:35 +03:00
Pavel Emelyanov	83073f4e8b	validation: Remove get_local_storage_proxy call It is used in validate_column_family. The last caller of it was removed by previous patch, so we may kill the helper itself Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 18:52:42 +03:00
Pavel Emelyanov	12cc539835	client_state: Call validate_column_family() with database arg The previous patch brought the databse reference arg. And since the currently called validate_column_family() overload _just_ gets the database from global proxy, it's better to shortcut. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 18:50:49 +03:00
Pavel Emelyanov	b0c4a9087d	client_state: Add database& arg to has_column_family_access It is called from cql3/statements' check_access methods and from thrift handlers. The former have proxy argument from which they can get the database. The latter already have the database itself on board. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 18:49:16 +03:00
Pavel Emelyanov	4c7bc8a3d1	storage_proxy: Add .local_db() getters To facilitate the next patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 18:48:02 +03:00
Avi Kivity	a11ecfe231	Merge 'types: don't linearize in validate()' from Michał Chojnowski A sequel to #7692. This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing). The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`. Refs: #6138 Closes #7770 * github.com:scylladb/scylla: types: add single-fragment optimization in validate() utils: fragment_range: add with_simplified() cql3: statements: select_statement: remove unnecessary use of with_linearized cql3: maps: remove unnecessary use of with_linearized cql3: lists: remove unnecessary use of with_linearized cql3: tuples: remove unnecessary use of with_linearized cql3: sets: remove unnecessary use of with_linearized cql3: tuples: remove unnecessary use of with_linearized cql3: attributes: remove unnecessary uses of with_linearized types: validate lists without linearizing types: validate tuples without linearizing types: validate sets without linearizing types: validate maps without linearizing types: template abstract_type::validate on FragmentedView types: validate_visitor: transition from FragmentRange to FragmentedView utils: fragmented_temporary_buffer: add empty() to FragmentedView utils: fragmented_temporary_buffer: don't add to null pointer	2020-12-11 17:33:59 +02:00
Pavel Emelyanov	563b466227	validate: Mark database argument const They are indeed used like that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-11 18:27:45 +03:00
Michał Chojnowski	150473f074	types: add single-fragment optimization in validate() Manipulating fragmented views is costlier that manipulating contiguous views, so let's detect the common situation when the fragmented view is actually contiguous underneath, and make use of that. Note: this optimization is only useful for big types. For trivial types, validation usually only checks the size of the view.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	e2d17879fc	utils: fragment_range: add with_simplified() Reading from contiguous memory (bytes_view) is significantly simpler runtime-wise than reading from a fragmented view, due to less state and less branching, so we often want to convert a fragmented view to a simple view before processing it, if the fragmented view contains at most one fragment, which is common. with_simplified() does just that.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	51ca5fa4c5	cql3: statements: select_statement: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	72186bee69	cql3: maps: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	3f3a10c588	cql3: lists: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	efa036329d	cql3: tuples: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	4f359a7a99	cql3: sets: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	281417917b	cql3: tuples: remove unnecessary use of with_linearized We can validate directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	d1d1a00311	cql3: attributes: remove unnecessary uses of with_linearized We can validate and deserialize directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	0581b3ff31	types: validate lists without linearizing We can validate collections directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	4fe41b69fd	types: validate tuples without linearizing We can validate tuples directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	a7dd736d03	types: validate sets without linearizing We can validate collections directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	1459608375	types: validate maps without linearizing We can validate collections directly from fragmented buffers now.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	82befbe8c0	types: template abstract_type::validate on FragmentedView This is primarily a stylistic change. It makes the interface more consistent with deserialize(). It will also allow us to call `validate()` for collection elements in `validate_aux()`.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	15dbe00e8a	types: validate_visitor: transition from FragmentRange to FragmentedView This will allow us to easily get rid of linearizations when validating collections and tuples, because the helpers used in validate_aux() already have FragmentedView overloads.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	3647c0ba47	utils: fragmented_temporary_buffer: add empty() to FragmentedView It's redundant with size_bytes(), but sometimes empty() is more readable and reduces churn when replacing other types with FragmentedView.	2020-12-11 09:53:07 +01:00
Michał Chojnowski	b4dd5d3bdb	utils: fragmented_temporary_buffer: don't add to null pointer When fragmented_temporary_buffer::view is created from a bytes_view, _current is null. In that case, in remove_current(), null pointer offset happens, and ubsan complains. Fix that.	2020-12-11 09:53:07 +01:00
Raphael S. Carvalho	e4b55f40f3	sstables: Fix sstable reshaping for STCS The heuristic of STCS reshape is correct, and it built the compaction descriptor correctly, but forgot to return it to the caller, so no reshape was ever done on behalf of STCS even when the strategy needed it. Fixes #7774. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>	2020-12-10 12:45:25 +02:00
Asias He	829b4c1438	repair: Make removenode safe by default Currently removenode works like below: - The coordinator node advertises the node to be removed in REMOVING_TOKEN status in gossip - Existing nodes learn the node in REMOVING_TOKEN status - Existing nodes sync data for the range it owns - Existing nodes send notification to the coordinator - The coordinator node waits for notification and announce the node in REMOVED_TOKEN Current problems: - Existing nodes do not tell the coordinator if the data sync is ok or failed. - The coordinator can not abort the removenode operation in case of error - Failed removenode operation will make the node to be removed in REMOVING_TOKEN forever. - The removenode runs in best effort mode which may cause data consistency issues. It means if a node that owns the range after the removenode operation is down during the operation, the removenode node operation will continue to succeed without requiring that node to perform data syncing. This can cause data consistency issues. For example, Five nodes in the cluster, RF = 3, for a range, n1, n2, n3 is the old replicas, n2 is being removed, after the removenode operation, the new replicas are n1, n5, n3. If n3 is down during the removenode operation, only n1 will be used to sync data with the new owner n5. This will break QUORUM read consistency if n1 happens to miss some writes. Improvements in this patch: - This patch makes the removenode safe by default. We require all nodes in the cluster to participate in the removenode operation and sync data if needed. We fail the removenode operation if any of them is down or fails. If the user want the removenode operation to succeed even if some of the nodes are not available, the user has to explicitly pass a list of nodes that can be skipped for the operation. $ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id> Example restful api: $ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5" - The coordinator can abort data sync on existing nodes For example, if one of the nodes fails to sync data. It makes no sense for other nodes to continue to sync data because the whole operation will fail anyway. - The coordinator can decide which nodes to ignore and pass the decision to other nodes Previously, there is no way for the coordinator to tell existing nodes to run in strict mode or best effort mode. Users will have to modify config file or run a restful api cmd on all the nodes to select strict or best effort mode. With this patch, the cluster wide configuration is eliminated. Fixes #7359 Closes #7626	2020-12-10 10:14:39 +02:00
Piotr Sarna	20bdeb315a	Merge ' types: add constraint on lexicographical_tri_compare()' from Avi Kivity Verify that the input types are iterators and their value types are compatible with the compare function. Because some of the inputs were not actually valid iterators, they are adjusted too. Closes #7631 * github.com:scylladb/scylla: types: add constraint on lexicographical_tri_compare() composite: make composite::iterator a real input_iterator compound: make compount_type::iterator a real input_iterator	2020-12-09 18:48:01 +01:00
Nadav Har'El	a8fdbf31cd	alternator: fix UpdateItem ADD for non-existent attribute UpdateItem's "ADD" operation usually adds elements to an existing set or adds a number to an existing counter. But it can also be used to create a new set or counter (as if adding to an empty set or zero). We unfortunately did not have a test for this case (creating a new set or counter), and when I wrote such a test now, I discovered the implementation was missing. So this patch adds both the test and the implementation. The new test used to fail before this patch, and passes with it - and passes on DynamoDB. Note that we only had this bug for the newer UpdateItem syntax. For the old AttributeUpdates syntax, we already support ADD actions on missing attributes, and already tested it in test_update_item_add(). I just forgot to test the same thing for the newer syntax, so I missed this bug :-( Fixes #7763. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>	2020-12-09 18:44:30 +01:00
Juliusz Stasiewicz	b150906d39	gossip: Added SNITCH_NAME to `application_state` Snitch name needs to be exchanged within cluster once, on shadow round, so joining nodes cannot use wrong snitch. The snitch names are compared on bootstrap and on normal node start. If the cluster already used mixed snitches, the upgrade to this version will fail. In this case customer needs to add a node with correct snitch for every node with the wrong snitch, then put down the nodes with the wrong snitch and only then do the upgrade. Fixes #6832 Closes #7739	2020-12-09 15:45:25 +02:00
Nadav Har'El	781f9d9aca	alternator: make default timeout configurable Whereas in CQL the client can pass a timeout parameter to the server, in the DynamoDB API there is no such feature; The server needs to choose reasonable timeouts for its own internal operations - e.g., writes to disk, querying other replicas, etc. Until now, Alternator had a fixed timeout of 10 seconds for its requests. This choice was reasonable - it is much higher than we expect during normal operations, and still lower than the client-side timeouts that some DynamoDB libraries have (boto3 has a one-minute timeout). However, there's nothing holy about this number of 10 seconds, some installations might want to change this default. So this patch adds a configuration option, "--alternator-timeout-in-ms", to choose this timeout. As before, it defaults to 10 seconds (10,000ms). In particular, some test runs are unusually slow - consider for example testing a debug build (which is already very slow) in an extremely over-comitted test host. In some cases (see issue #7706) we noticed the 10 second timeout was not enough. So in this patch we increase the default timeout chosen in the "test/alternator/run" script to 30 seconds. Please note that as the code is structured today, this timeout only applies to some operations, such as GetItem, UpdateItem or Scan, but does not apply to CreateTable, for example. This is a pre-existing issue that this patch does not change. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>	2020-12-09 14:30:43 +01:00
Avi Kivity	f802356572	Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb"" This reverts commit `dc77d128e9`. It was reverted due to a strange and unexplained diff, which is now explained. The HEAD on the working directory being pulled from was set back, so git thought it was merging the intended commits, plus all the work that was committed from HEAD to master. So it is safe to restore it.	2020-12-08 19:19:55 +02:00
Avi Kivity	1badd315ef	Merge "Speed up devel tests 10 times" from Pavel E " The multishard_mutation_query test is toooo slow when built with clang in dev mode. By reducing the number of scans it's possible to shrink the full suite run time from half an hour down to ~3 minutes. tests: unit(dev) " * 'br-devel-mode-tests' of https://github.com/xemul/scylla: test: Make multishard_mutation_query test do less scans configure: Add -DDEVEL to dev build flags	2020-12-08 15:42:12 +02:00
Pavel Emelyanov	b837cf25b1	test: Make multishard_mutation_query test do less scans When built by clang this dev-mode test takes ~30 minutes to complete. Let's reduce this time by reducing the scale of the test if DEVEL is set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-08 15:55:04 +03:00
Pavel Emelyanov	703451311f	configure: Add -DDEVEL to dev build flags To let source code tell debug, dev and release builds from each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-08 15:54:30 +03:00
Avi Kivity	461c9826de	Merge 'scylla_setup: fix wrong command suggestion' from Takuya ASADA scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead. Also, --nic is currently ignored by command suggestion, need to print it just like other options. Related #7395 Closes #7724 * github.com:scylladb/scylla: scylla_setup: print --swap-directory and --swap-size on command suggestion scylla_setup: print --nic on command suggestion scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.	2020-12-08 13:58:55 +02:00
Avi Kivity	98271a5c57	Merge 'types: don't linearize in serialize_for_cql()' from Michał Chojnowski A sequel to #7692. This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type. The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template. Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`. Refs: #6138 Closes #7771 * github.com:scylladb/scylla: types: switch serialize_for_cql from bytes to bytes_ostream types: switch serialize_for_cql_aux from bytes to bytes_ostream types: serialize user types to bytes_ostream types: serialize lists to bytes_ostream types: serialize sets to bytes_ostream types: serialize maps to bytes_ostream utils: fragment_range: use range-based for loop instead of boost::for_each types: add write_collection_value() overload for bytes_ostream and value_view	2020-12-08 12:38:36 +02:00
Lubos Kosco	a0b1474bba	scylla_util.py: Increase disk to ram ratio for GCP Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of RAM for one NVMe log various reasons for not recommending the instance type. Fixes #7587 Closes #7600	2020-12-08 11:20:30 +02:00
Piotr Wojtczak	c09ab3b869	api: Add cardinality to toppartitions results This change enhances the toppartitions api to also return the cardinality of the read and write sample sets. It now uses the size() method of space_saving_top_k class, counting the unique operations in the sampled set for up to the given capacity. Fixes #4089 Closes #7766	2020-12-08 09:38:59 +01:00
Nadav Har'El	86779664f4	alternator: fix broken Scan/Query paging with bytes keys When an Alternator table has partition keys or sort keys of type "bytes" (blobs), a Scan or Query which required paging used to fail - we used an incorrect function to output LastEvaluatedKey (which tells the user where to continue at the next page), and this incorrect function was correct for strings and numbers - but NOT for bytes (for bytes, we need to encode them as base-64). This patch also includes two tests - for bytes partition key and for bytes sort key - that failed before this patch and now pass. The test test_fetch_from_system_tables also used to fail after a Limit was added to it, because one of the tables it scans had a bytes key. That test is also fixed by this patch. Fixes #7768 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>	2020-12-08 09:38:23 +01:00
Eliran Sinvani	70770ff7fa	debian pkg: Make deb packages explicitly depend on versioned components Up until now, Scylla's debian packages dependencies versions were unspecified. This was due to a technical difficulty to determine the version of the dependent upon packages (such as scylla-python3 or scylla-jmx). Now, when those packages are also built as part of this repo and are built with a version identical to the server package itself we can depend all of our packages with explicit versions. The motivation for this change is that if a user tries to install a specific Scylla version by installing a specific meta package, it will silently drag in the latest components instead of the ones of the requested versions. The expected change in behavior is that after this change an attempt to install a metapackage with version which is not the latest will fail with an explicit error hinting the user what other packages of the same version should be explicitly included in the command line. Fixes #5514 Closes #7727	2020-12-07 18:58:15 +02:00
Michał Chojnowski	d43fd456cd	types: switch serialize_for_cql from bytes to bytes_ostream Now we can serialize collections from collection_mutation_view_description without linearizations.	2020-12-07 17:55:36 +01:00
Michał Chojnowski	81a55b032d	types: switch serialize_for_cql_aux from bytes to bytes_ostream We will switch serialize_for_cql itself to bytes_ostream soon.	2020-12-07 17:55:35 +01:00
Michał Chojnowski	71183cf0bd	types: serialize user types to bytes_ostream Avoids linearization by serializing to a fragmented type. It's still linearized at the very end, this will be changed in the near future.	2020-12-07 17:52:06 +01:00
Michał Chojnowski	41b889d0c8	types: serialize lists to bytes_ostream Avoids linearization by serializing to a fragmented type. It's still linearized at the very end, this will be changed in the near future.	2020-12-07 17:49:21 +01:00
Michał Chojnowski	2b3d2c193d	types: serialize sets to bytes_ostream Avoids linearization by serializing to a fragmented type. It's still linearized at the very end, this will be changed in the near future.	2020-12-07 17:47:49 +01:00
Michał Chojnowski	35823d12db	types: serialize maps to bytes_ostream Avoids linearization by serializing to a fragmented type. It's still linearized at the very end, this will be changed in the near future.	2020-12-07 17:47:12 +01:00
Botond Dénes	ba7cf2f5fd	tools/scylla-types: update name in description to use - instead of _ The executable was rename from using _ to using - to at one point but apparently the description wasn't updated. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20201207161626.79013-1-bdenes@scylladb.com>	2020-12-07 18:34:52 +02:00
Avi Kivity	7580a93ec8	build: reinstate -Wunused-value warning for [[nodiscard]] The switch to clang disabled the clang-specific -Wunused-value since it generated some harmless warnings. Unfortunately, that also prevent [[nodiscard]] violations from warning. Fix by reinstating the warning, now that all instances of the warning have been fixed.	2020-12-07 16:51:19 +02:00
Avi Kivity	8fc0bbd487	test: lib: don't ignore future in compare_readers() A fast_forward_to() call is not waited on in compare_readers(). Since this is called in a thread, add a future::get() call to wait for it.	2020-12-07 16:50:20 +02:00
Avi Kivity	732d83dc0e	test: mutation_test: check both ranges when comparing summaries A copy/paste error means we ignore the termination of one of the ranges. Change the comma expression to a disjunction to avoid the unused value warning from clang. The code is not perfect, since if the two ranges are not the same size we'll invoke undefined behavior, but it is no worse than before (where we ignored the comparison completely).	2020-12-07 16:47:52 +02:00
Avi Kivity	fc0a45af5f	serialializer: silence unused value warning in variant deserializer The variant deserializer uses a fold expression to implement an if-tree with a short-circuit, producing an intermediate boolean value to terminate evaluation. This intermediate value is unneeded, but evokes a warning from clang when -Wunused-value is enabled. Since we want to enable the warning, add a cast to void to ignore the intermediate value.	2020-12-07 16:45:20 +02:00
Michał Chojnowski	60a3cecfea	utils: fragment_range: use range-based for loop instead of boost::for_each We want to pass bytes_ostream to this loop in later commits. bytes_ostream does not conform to some boost concepts required by boost::for_each, so let's just use C++'s native loop.	2020-12-07 12:50:36 +01:00
Piotr Sarna	1cc4ed50c1	db: fix getting local ranges for size estimates table When getting local ranges, an assumption is made that if a range does not contain an end or when its end is a maximum token, then it must contain a start. This assumption proven not true during manual tests, so it's now fortified with an additional check. Here's a gdb output for a set of local ranges which causes an assertion failure when calling `get_local_ranges` on it: (gdb) p ranges $1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = { _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = { _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}} Closes #7764	2020-12-07 12:08:31 +02:00
Takuya ASADA	c3abba1913	scylla_setup: print --swap-directory and --swap-size on command suggestion We need to print --swap-directory and --swap-size on command suggestion just like other options. Related #7395	2020-12-07 18:40:59 +09:00
Takuya ASADA	582a3ffb2f	scylla_setup: print --nic on command suggestion We need to print --nic on command suggestion just like other options. Related #7395	2020-12-07 18:40:59 +09:00
Nadav Har'El	220d6dde17	alternator, test: make test_fetch_from_system_tables faster The test test_fetch_from_system_tables tests Alternator's system-table feature by reading from all system tables. The intention was to confirm we don't crash reading any of them - as they have different schemas and can run into different problems (we had such problems in the initial implementation). The intention was not to read a lot from each table - we only make a single "Scan" call on each, to read one page of data. However, the Scan call did not set a Limit, so the single page can get pretty big. This is not normally a problem, but in extremely slow runs - such as when running the debug build on an extremely overcommitted test machine (e.g., issue #7706) reading this large page may take longer than our default timeout. I'll send a separate patch for the timeout issue, but for now, there is really no reason why we need to read a big page. It is good enough to just read 50 rows (with Limit=50). This will still read all the different types and make the test faster. As an example, in the debug run on my laptop, this test spent 2.4 seconds to read the "compaction_history" table before this patch, and only 0.1 seconds after this patch. 2.4 seconds is close to our default timeout (10 seconds), 0.1 is very far. Fixes #7706 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>	2020-12-07 08:52:31 +01:00
Michał Chojnowski	1fe7490970	types: add write_collection_value() overload for bytes_ostream and value_view We will use it to serialize collections to bytes_ostream in serialize_for_cql().	2020-12-07 08:48:31 +01:00
Nadav Har'El	0cd05dd0fd	cql-pytest: add tests for ALLOW FILTERING The original goal of this patch was to replace the two single-node dtests allow_filtering_test and allow_filtering_secondary_indexes_test, which recently caused us problems when we wanted to change the ALLOW FILTERING behavior but the tests were outside the tree. I'm hoping that after this patch, those two tests could be removed from dtest. But this patch actually tests more cases then those original dtest, and moreover tests not just whether ALLOW FILTERING is required or not, but also that the results of the filtering is correct. Currently, four of the included tests are expected to fail ("xfail") on Scylla, reproducing two issues: 1. Refs #5545: "WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING 2. Refs #7608: "WHERE c=1" on clustering key c should require ALLOW FILTERING, but doesn't. All tests, except the one for issue #5545, pass on Cassandra. That one fails on Cassandra because doesn't support IN on an indexed column at all (regardless of whether ALLOW FILTERING is used or not). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>	2020-12-06 19:51:25 +02:00
Pavel Solodovnikov	56c0fcfcb2	cql_query_test: handle `bounce_to_shard` msg in `test_null_value_tuple_floating_types_and_uuids` Use `prepared_on_shard` helper function to handle `bounce_to_shard` messages that can happen when using LWT statements. Fixes: #7757 Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201204172944.601730-1-pa.solodovnikov@scylladb.com>	2020-12-06 19:34:13 +02:00
Amos Kong	6b1659ee80	schema.cc/describe: fix invalid compaction options in schema There is a typo in schema.cql of snapshot, lack of comma after compaction strategy. It will fail to restore schema by the file. AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'} map_as_cql_param() function has a `first` parameter to smartly add comma, the compaction_strategy_options is always not the first. Fixes #7741 Signed-off-by: Amos Kong <amos@scylladb.com> Closes #7734	2020-12-06 17:40:05 +02:00
Avi Kivity	ca950e6f08	Merge "Remove get_local_storage_service() from counters" from Pavel E " The storage service is called there to get the cached value of db::system_keyspace::get_local_host_id(). Keeping the value on database decouples it from storage service and kills one more global storage service reference. tests: unit(dev) " * 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla: counters: Drop call to get_local_storage_service and related counters: Use local id arg in transform_counter_update_to_shards database: Have local id arg in transform_counter_updates_to_shards() storage_service: Keep local host id to database	2020-12-06 16:15:21 +02:00
Avi Kivity	6e460e121a	Merge 'docs: Add Sphinx and ScyllaDB theme' from David Garcia This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla 1. Run the command ``make preview`` from the ``docs`` directory. 3. Check the terminal where you have executed the previous command. It should not raise warnings. 3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages. The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure. Closes #7752 * github.com:scylladb/scylla: docs: fixed warnings docs: added theme	2020-12-06 15:26:57 +02:00
Benny Halevy	64a4ffc579	large_data_handler: do not delete records in the absence of large_data_stats The previous way of deleting records based on the whole sstatble data_size causes overzealous deletions (#7668) and inefficiency in the rows cache due to the large number of range tombstones created. Therefore we'd be better of by juts letting the records expire using he 30 days TTL. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>	2020-12-06 11:34:37 +02:00
Avi Kivity	dc77d128e9	Revert "Merge "raft: fix replication if existing log on leader" from Gleb" This reverts commit `0aa1f7c70a`, reversing changes made to `72c59e8000`. The diff is strange, including unrelated commits. There is no understanding of the cause, so to be safe, revert and try again.	2020-12-06 11:34:19 +02:00
Pavel Emelyanov	df0e26035f	counters: Drop call to get_local_storage_service and related The local host id is now passed by argument, so we don't need the counter_id::local() and some other methods that call or are called by it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-04 16:31:12 +03:00
Pavel Emelyanov	914613b3c3	counters: Use local id arg in transform_counter_update_to_shards Only few places in it need the uuid. And since it's only 16 bytes it's possibvle to safely capture it by value in the called lambdas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-04 16:30:31 +03:00
Pavel Emelyanov	62214e2258	database: Have local id arg in transform_counter_updates_to_shards() There are two places that call it -- database code itself and tests. The former already has the local host id, so just pass one. The latter are a bit trickier. Currently they use the value from storage_service created by storage_service_for_tests, but since this version of service doesn't pass through prepare_to_join() the local_host_id value there is default-initialized, so just default-initialize the needed argument in place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-04 15:09:30 +03:00
Pavel Emelyanov	5a286ee8d4	storage_service: Keep local host id to database The value in question is cached from db::system_keyspace for places that want to have it without waiting for futures. So far the only place is database counters code, so keep the value on database itself. Next patches will make use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-12-04 15:09:29 +03:00
Piotr Sarna	2015988373	Merge 'types: get rid of linearization in deserialize()' from Michał Chojnowski Citing #6138: > In the past few years we have converted most of our codebase to work in terms of fragmented buffers, instead of linearised ones, to help avoid large allocations that put large pressure on the memory allocator. > One prominent component that still works exclusively in terms of linearised buffers is the types hierarchy, more specifically the de/serialization code to/from CQL format. Note that for most types, this is the same as our internal format, notable exceptions are non-frozen collections and user types. > > Most types are expected to contain reasonably small values, but texts, blobs and especially collections can get very large. Since the entire hierarchy shares a common interface we can either transition all or none to work with fragmented buffers. This series gets rid of intermediate linearizations in deserialization. The next steps are removing linearizations from serialization, validation and comparison code. Series summary: - Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered while testing. Since it wasn't discovered earlier, I guess it doesn't occur in any code path in master.) - Add a `FragmentedView` concept to allow uniform handling of various types of fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`, `ser::buffer_view` and likely `managed_bytes_view` in the future). - Implement `FragmentedView` for relevant fragmented buffer types. - Add helper functions for reading from `FragmentedView`. - Switch `deserialize()` and all its helpers from `bytes_view` to `FragmentedView`. - Remove `with_linearized()` calls which just became unnecessary. - Add an optimization for single-fragment cases. The addition of `FragmentedView` might be controversial, because another concept meant for the same purpose - `FragmentRange` - is already used. Unfortunately, it lacks the functionality we need. The main (only?) thing we want to do with a fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no way to do that, because it's immutable by design. We can work around that by wrapping it into a mutable view which will track the offset into the immutable `FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing around as a view - it stores a pair of fragment iterators, a fragment view and a size (11 words) to conform to the iterator-based design of `FragmentRange`, when one fragment iterator (4 words) already contains all needed state, just hidden. I suggest we replace `FragmentRange` with `FragmentedView` (or something similar) altogether. Refs: #6138 Closes #7692 * github.com:scylladb/scylla: types: collection: add an optimization for single-fragment buffers in deserialize types: add an optimization for single-fragment buffers in deserialize cql3: tuples: don't linearize in in_value::from_serialized cql3: expr: expression: replace with_linearize with linearized cql3: constants: remove unneeded uses of with_linearized cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell cql3: lists: remove unneeded use of with_linearized query-result-set: don't linearize in result_set_builder::deserialize types: remove unneeded collection deserialization overloads types: switch collection_type_impl::deserialize from bytes_view to FragmentedView cql3: sets: don't linearize in value::from_serialized cql3: lists: don't linearize in value::from_serialized cql3: maps: don't linearize in value::from_serialized types: remove unused deserialize_aux types: deserialize: don't linearize tuple elements types: deserialize: don't linearize collection elements types: switch deserialize from bytes_view to FragmentedView types: deserialize tuple types from FragmentedView types: deserialize set type from FragmentedView types: deserialize map type from FragmentedView types: deserialize list type from FragmentedView types: add FragmentedView versions of read_collection_size and read_collection_value types: deserialize varint type from FragmentedView types: deserialize floating point types from FragmentedView types: deserialize decimal type from FragmentedView types: deserialize duration type from FragmentedView types: deserialize IP address types from FragmentedView types: deserialize uuid types from FragmentedView types: deserialize timestamp type from FragmentedView types: deserialize simple date type from FragmentedView types: deserialize time type from FragmentedView types: deserialize boolean type from FragmentedView types: deserialize integer types from FragmentedView types: deserialize string types from FragmentedView types: remove unused read_simple_opt types: implement read_simple* versions for FragmentedView utils: fragmented_temporary_buffer: implement FragmentedView for view utils: fragment_range: add single_fragmented_view serializer: implement FragmentedView for buffer_view utils: fragment_range: add linearized and with_linearized for FragmentedView utils: fragment_range: add FragmentedView utils: fragmented_temporary_buffer: fix view::remove_prefix	2020-12-04 09:46:20 +01:00
Michał Chojnowski	a1f7fabb3d	types: collection: add an optimization for single-fragment buffers in deserialize Helpers parametrized with single_fragmented_view should compile to better code, so let's use them when possible.	2020-12-04 09:21:05 +01:00
Michał Chojnowski	08c394726e	types: add an optimization for single-fragment buffers in deserialize Values usually come in a single fragment, but we pay the cost of fragmented deserialization nevertheless: bigger view objects (4 words instead of 2 words) more state to keep updated (i.e. total view size in addition to current fragment size) and more branches. This patch adds a special case for single-fragment buffers to abstract_type::deserialize. They are converted to a single_fragmented_view before doing anything else. Templates instantiated with single_fragmented_view should compile to better code than their multi-fragmented counterparts. If abstract_type::deserialize is inlined, this patch should completely prevent any performance penalties for switching from with_linearized to fragmented deserialization.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	f75db1fcf5	cql3: tuples: don't linearize in in_value::from_serialized We can deserialize directly from fragmented buffers now.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	68177a6721	cql3: expr: expression: replace with_linearize with linearized with_linearized creates an additional internal `bytes` when the input is fragmented. linearized copies the data directly to the output `bytes`, so it's more efficient.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	5ffe40d5a2	cql3: constants: remove unneeded uses of with_linearized We can deserialize directly from fragmented buffers now.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	3c98806df9	cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell We can deserialize directly from fragmented buffers now.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	c43ef3951b	cql3: lists: remove unneeded use of with_linearized We can deserialize directly from fragmented buffers now.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	0d5c5b8645	query-result-set: don't linearize in result_set_builder::deserialize We can deserialize directly from fragmented buffers now.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	04786dee30	types: remove unneeded collection deserialization overloads Inherit the method from base class rather than reimplementing it in every child.	2020-12-04 09:19:39 +01:00
Michał Chojnowski	c08419e28d	types: switch collection_type_impl::deserialize from bytes_view to FragmentedView Devirtualizes collection_type_impl::deserialize (so it can be templated) and adds a FragmentedView overload. This will allow us to deserialize collections with explicit cql_serialization_format directly from fragmented buffers.	2020-12-04 09:19:37 +01:00
dgarcia360	1304f6a0bb	docs: fixed warnings docs: fixed warnings	2020-12-03 17:40:34 +01:00
dgarcia360	a340b46a79	docs: added theme	2020-12-03 17:37:18 +01:00
Michał Chojnowski	d731b34d95	cql3: sets: don't linearize in value::from_serialized We can deserialize directly from fragmented buffers now.	2020-12-03 10:57:07 +01:00
Michał Chojnowski	64e64fd2b3	cql3: lists: don't linearize in value::from_serialized We can deserialize directly from fragmented buffers now.	2020-12-03 10:57:07 +01:00
Michał Chojnowski	536a2f8c8d	cql3: maps: don't linearize in value::from_serialized We can deserialize directly from fragmented buffers now.	2020-12-03 10:57:07 +01:00
Michał Chojnowski	58d9f52363	types: remove unused deserialize_aux Dead code.	2020-12-03 10:57:07 +01:00
Michał Chojnowski	8440279130	types: deserialize: don't linearize tuple elements We can deserialize directly from fragmented buffers now.	2020-12-03 10:57:07 +01:00
Michał Chojnowski	a216b0545f	types: deserialize: don't linearize collection elements We can deserialize directly from fragmented buffers now.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	1ccdfc7a90	types: switch deserialize from bytes_view to FragmentedView The final part of the transition of deserialize from bytes_view to FragmentedView. Adds a FragmentedView overload to abstract_type::deserialize and switches deserialize_visitor from bytes_view to FragmentedView, allowing deserialization of all types with no intermediate linearization.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	898cea4cde	types: deserialize tuple types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	507883f808	types: deserialize set type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	9b211a7285	types: deserialize map type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	5f1939554c	types: deserialize list type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	ad7ab73cd0	types: add FragmentedView versions of read_collection_size and read_collection_value We will need those to deserialize collections from FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	495bf5c431	types: deserialize varint type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	0f8ad89740	types: deserialize floating point types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	0bb0291e50	types: deserialize decimal type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	760bc5fd60	types: deserialize duration type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	75a56f439b	types: deserialize IP address types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	9f668929db	types: deserialize uuid types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	3e1a24ca0d	types: deserialize timestamp type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	a4bc43ab19	types: deserialize simple date type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	24bd986aea	types: deserialize time type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	c03ad52513	types: deserialize boolean type from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	2f351928e2	types: deserialize integer types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	28b727082f	types: deserialize string types from FragmentedView A part of the transition of deserialize from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	426308f526	types: remove unused read_simple_opt Dead code.	2020-12-03 10:57:06 +01:00
Michał Chojnowski	e1145fe410	types: implement read_simple* versions for FragmentedView We will need those to switch deserialize() from bytes_view to FragmentedView.	2020-12-03 10:57:06 +01:00
Botond Dénes	71722d8b41	frozen_mutation: add partition context to errors coming from deserializing	2020-12-02 15:08:49 +02:00
Botond Dénes	8d944ff755	partition_builder: accept_row(): use append_clustering_row() The partition builder doesn't expect the looked-up row to exist. In fact it already existing is a sign of a bug. Currently bugs resulting in duplicate rows will manifest by tripping an assert in `row::append_cell()`. This however results in poor diagnostics, so we want to catch these errors sooner to be able to provide higher level diagnostics. To this end, switch to the freshly introduced `append_clustering_row()` so that duplicate rows are found early and in a context where their identity is known.	2020-12-02 15:08:49 +02:00
Botond Dénes	63ea36e277	mutation_partition: add append_clustered_row() A variant of `clutered_row()` which throws if the row already exists, or if any greater row already exists.	2020-12-02 15:08:32 +02:00
Benny Halevy	c7311d1080	docs: sstable-scylla-format: document large_data_type in more details This adds details about large_data_type on top of `ca5184052d` and introduces structured indentation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201202110539.634880-1-bhalevy@scylladb.com>	2020-12-02 13:25:49 +02:00
Avi Kivity	a95c2a946c	Merge 'mutation_reader: introduce clustering_order_reader_merger' from Kamil Braun This abstraction is used to merge the output of multiple readers, each opened for a single partition query, into a non-decreasing stream of mutation_fragments. It is similar to `mutation_reader_merger`, but an important difference is that the new merger may select new readers in the middle of a partition after it already returned some fragments from that partition. It uses the new `position_reader_queue` abstraction to select new readers. It doesn't support multi-partition (ring range) queries. The new merger will be later used when reading from sstable sets created by TimeWindowCompactionStrategy. This strategy creates many sstables that are mostly disjoint w.r.t the contained clustering keys, so we can delay opening sstable readers when querying a partition until after we have processed all mutation fragments with positions before the keys contained by these sstables. A microbenchmark was added that compares the existing combining reader (which uses `mutation_reader_merger` underneath) with a new combining reader built using the new `clustering_order_reader_merger` and a simple queue of readers that returns readers from some supplied set. The used set of readers is built from the following ranges of keys (each range corresponds to a single reader): `[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`. The microbenchmark runs the reader and divides the result by the number of mutation fragments. The results on my laptop were: ``` $ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10 single run iterations: 0 single run duration: 1.000s number of runs: 10 test iterations median mad min max clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns ``` `ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader. Split from https://github.com/scylladb/scylla/pull/7437. Closes #7688 * github.com:scylladb/scylla: tests: mutation_source_test for clustering_order_reader_merger perf: microbenchmark for clustering_order_reader_merger mutation_reader_test: test clustering_order_reader_merger in memory test: generalize `random_subset` and move to header mutation_reader: introduce clustering_order_reader_merger	2020-12-02 12:15:35 +02:00
Kamil Braun	502ed2e9f7	tests: mutation_source_test for clustering_order_reader_merger	2020-12-02 11:13:58 +01:00
Nadav Har'El	fae2ba60e9	cql-pytest: start to port Cassandra's CQL unit tests In issue #7722, it was suggested that we should port Cassandra's CQL unit tests into our own repository, by translating the Java tests into Python using the new cql-pytest framework. Cassandra's CQL unit test framework is orders of magnitude faster than dtest, and in-tree, so Cassandra have been moving many CQL correctness tests there, and we can also benefit from their test cases. In this patch, we take the first step in a long journey: 1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the translated Cassandra tests will reside. The structure of this directory will mirror that of the test/unit/org/apache/cassandra/cql3 directory in the Cassandra repository. pytest conveniently looks for test files recursively, so when all the cql-pytest are run, the cassandra_tests files will be run as well. As usual, one can also run only a subset of all the tests, e.g., "test/cql-pytest/run -vs cassandra_tests" runs only the tests in the cassandra_tests subdirectory (and its subdirectories). 2. I translated into Python two of the smallest test files - validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just three test functions. The plan is to translate entire Java test files one by one, and to mirror their original location in our own repository, so it will be easier to remember what we already translated and what remains to be done. 3. I created a small library, porting.py, of functions which resemble the common functions of the Java tests (CQLTester.java). These functions aim to make porting the tests easier. Despite the resemblence, the ported code is not 100% identical (of course) and some effort is still required in this porting. As we continue this porting effort, we'll probably need more of these functions, can can also continue to improve them to reduce the porting effort. Refs #7722. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>	2020-12-02 09:29:22 +01:00
Avi Kivity	77466177ab	Merge 'Use large_data_counters in scylla_metadata to decide when to delete large_data records' from Benny Halevy This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable. These are accounted for in the sstable writer whenever the respective large data entry is encountered. It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged. Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds. Fixes #7668 Test: unit(dev) Dtest: wide_rows_test.py (in progress) Closes #7669 * github.com:scylladb/scylla: docs: sstable-scylla-format: add large_data_stats subcomponent large_data_handler: maybe_delete_large_data_entries: use sstable large data stats large_data_handler: maybe_delete_large_data_entries: accept shared_sstable large_data_handler: maybe_delete_large_data_entries: move out of line sstables: load large_data_stats from scylla_metadata sstables: store large_data_stats in scylla_metadata sstables: writer: keep track of large data stats large_data_handler: expose methods to get threshold sstables: kl/writer: never record too many rows large_data_handler: indicate recording of large data entries large_data_handler: move constructor out of line	2020-12-02 10:08:18 +02:00
Nadav Har'El	5c08489569	cql-pytest: don't run tests if Scylla boot timed out In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla. I never expected to reach this timeout - it normally takes (in dev build mode) around 2 seconds, but in one run on Jenkins we did reach it. It turns out that the code does not recognize this timeout correctly, thought that Scylla booted correctly - and then failed all the subtests when they fail to connect to Scylla. This patch fixes the timeout logic. After the timeout, if Scylla's CQL port is still not responsive, the test run is failed - without trying to run many individual tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>	2020-12-02 08:48:44 +02:00
Kamil Braun	2da723b9c8	cdc: produce postimage when inserting with no regular columns When a row was inserted into a table with no regular columns, and no such row existed in the first place, postimage would not be produced. Fix this. Fixes #7716. Closes #7723	2020-12-01 18:01:23 +02:00
Benny Halevy	ca5184052d	docs: sstable-scylla-format: add large_data_stats subcomponent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	4406a2514e	large_data_handler: maybe_delete_large_data_entries: use sstable large data stats If the sstable has scylla_metadata::large_data_stats use them to determine whether to delete the corresponding large data records. Otherwise, defer to the current method of comparing the sstable data_size to the respective thresholds. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	8cebe7776f	large_data_handler: maybe_delete_large_data_entries: accept shared_sstable Since the actual deletion if the large data entries is done in the background, and we don't captures the shared_sstable, we can safely pass it to maybe_delete_large_data_entries when deleting the sstable in sstable::unlink and it will be release as soon as maybe_delete_large_data_entries returns. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	f7d0ae3d10	large_data_handler: maybe_delete_large_data_entries: move out of line It is called on the cold path, when the sstable is deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	be4a58c34c	sstables: load large_data_stats from scylla_metadata Load the large data stats from the scylla_metadata component if they are present. Otherwise, if we're opening a legacy sstable that has scylla_metadata_type::LargeDataStats, leave sstable::_large_data_stats disengaged. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	92443ed71c	sstables: store large_data_stats in scylla_metadata Store the large data statistics in the scylla_metadata component. These will be retrieved when loading the sstable and be used for determining whether to delete the corresponding large data entries upon sstable deletion. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:42 +02:00
Benny Halevy	79c19a166c	sstables: writer: keep track of large data stats In the next patch, this is will be written to the sstable's scylla_metadata component. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:19:41 +02:00
Benny Halevy	8ab053bd44	large_data_handler: expose methods to get threshold To be used for keeping large_data statistics in sstable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Benny Halevy	f1257dfdc0	sstables: kl/writer: never record too many rows rows_count is not tracked prior to the mc format. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Benny Halevy	dd7422a713	large_data_handler: indicate recording of large data entries Return true from the maybe_{record,log}_* methods if a large data record or log entry were emitted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Benny Halevy	873107821b	large_data_handler: move constructor out of line No need for it to be inlined. Also, add debug logging to the large data handler options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-12-01 15:18:14 +02:00
Dejan Mircevski	e45af3b9b8	index: Ensure restriction is supported in find_idx Previously, statement_restrictions::find_idx() would happily return an index for a non-EQ restriction (because it checked only the column name, not the operator). This is incorrect: when the selected index is for a non-EQ restriction, it is impossible to query that index table. Fixes #7659. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #7665	2020-12-01 15:16:48 +02:00
Avi Kivity	df572b41ae	Update seastar submodule * seastar 010fb0df1e...8b400c7b45 (6): > append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size > Merge "Extend per task-queue timing statistics" from Pavel E > tls_test: Create test certs at build time > cook: upgrade hwloc version > memory: rate-limit diagnostics messages > util/log: add rate-limited version of writer version of log()	2020-12-01 15:12:25 +02:00
Tomasz Grabiec	0c5d23d274	thrift: Validate cell names when constructing clustering keys Currently, if the user provides a cell name with too many components, we will accept it and construct an invalid clusterin key. This may result in undefined behavior down the stream. It was caught by ASAN in a debug build when executing dtest cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with nodetool flush manually added after the write. Triggered during sstable writing to an MC-format sstable: seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577 sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180 To prevent corrupting the state in this way, we should fail early. This patch addds validation which will fail thrift requests which attempt to create invalid clustering keys. Fixes #7568. Example error: Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600 Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>	2020-12-01 15:12:08 +02:00
Avi Kivity	2fd895a367	Merge 'dist/common/scripts/scylla_setup: Optionally config rsyslog destination' from Amnon Heiman This patch adds an option to scylla_setup to configure an rsyslog destination. The monitoring stack has an option to get information from rsyslog it requires that rsyslog on the scylla machines will send the trace line to it. The configuration will be in a Scylla configuration file, so it is safe to run it multiple times. Fixes #7589 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #7634 * github.com:scylladb/scylla: dist/common/scripts/scylla_setup: Optionally config rsyslog destination Adding dist/common/scripts/scylla_rsyslog_setup utility	2020-12-01 13:12:32 +02:00
Amnon Heiman	4036cecdea	dist/common/scripts/scylla_setup: Optionally config rsyslog destination This patch adds an option to scylla_setup to configure an rsyslog destination. The monitoring stack has an option to get information from rsyslog, it requires that rsyslog on the scylla machines will send the trace line to it. If the /etc/rsyslog.d/ directory exists (that means the current system runs rsyslog) it will ask if to add rsyslog configuration and if yes, it would run scylla_rsyslog_setup. Fixes #7589 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-12-01 12:33:37 +02:00
Takuya ASADA	572d6b2a4e	scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead. Related #7395	2020-12-01 07:23:55 +09:00
Tomasz Grabiec	f8f81ec322	Merge "raft: various snapshot fixes" from Gleb * scylla-dev/snapshot_fixes_v1: raft: ignore append_reply from a peer in SNAPSHOT state raft: Ignore outdated snapshots raft: set next_idx to correct value after snapshot transfer	2020-11-30 21:34:31 +01:00
Alejo Sanchez	72a64b05ea	raft: replication test: fix total entries for initial snapshot Since now total expected entries are updated by load snapshot, do not trim the total entries expected values with the initial snapshot on test state machine initialization. reported by @gleb Branch URL: https://github.com/alecco/scylla/tree/raft-ale-tests-06-snapshot-total-entries Tests: unit ({dev}), unit ({debug}), unit ({release}) Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20201125171232.321992-1-alejo.sanchez@scylladb.com>	2020-11-30 21:34:31 +01:00
Kamil Braun	af49a95627	perf: microbenchmark for clustering_order_reader_merger	2020-11-30 11:55:44 +01:00
Kamil Braun	4f7e2bf920	mutation_reader_test: test clustering_order_reader_merger in memory	2020-11-30 11:55:44 +01:00
Kamil Braun	b22aa6dbde	test: generalize `random_subset` and move to header	2020-11-30 11:55:44 +01:00
Kamil Braun	0b36c5e116	mutation_reader: introduce clustering_order_reader_merger This abstraction is used to merge the output of multiple readers, each opened for a single partition query, into a non-decreasing stream of mutation_fragments. It is similar to `mutation_reader_merger`, an important difference is that the new merger may select new readers in the middle of a partition after it already returned some fragments from that partition. It uses the new `position_reader_queue` abstraction to select new readers. It doesn't support multi-partition (ring range) queries. The new merger will be later used when reading from sstable sets created by TimeWindowCompactionStrategy. This strategy creates many sstables that are mostly disjoint w.r.t the contained clustering keys, so we can delay opening sstable readers when querying a partition until after we have processed all mutation fragments with positions before the keys contained by these sstables.	2020-11-30 11:55:44 +01:00
Avi Kivity	ea9c058be3	Merge 'Don't use secondary indices for multi-column restrictions' from Dejan Mircevski Fix #7680 by never using secondary index for multi-column restrictions. Modify expr::is_supported_by() to handle multi-column correctly. Tests: unit (dev) Closes #7699 * github.com:scylladb/scylla: cql3/expr: Clarify multi-column doesn't use indexing cql3: Don't use index for multi-column restrictions test: Add eventually_require_rows	2020-11-30 12:38:26 +02:00
Avi Kivity	12c20c4101	Merge 'test/cql-pytest: tests for string validation (UTF-8 and ASCII)' from Nadav Har'El The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively. The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra. Closes #7719 * github.com:scylladb/scylla: test/cql-pytest: add tests for validation of inserted strings test/cql-pytest: add "scylla_only" fixture test/cpy-pytest: enable experimental features	2020-11-30 12:26:25 +02:00
Piotr Wojtczak	3560acd311	cql_metrics: Add metrics for CQL errors This change adds tracking of all the CQL errors that can be raised in response to a CQL message from a client, as described in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs included. Fixes #5859 Closes #7604	2020-11-30 12:18:37 +02:00
Takuya ASADA	6238d105d9	dist/redhat: drop Conflicts with older kernel We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure the environment is running newer kernel. However, user may use non-standard kernel which has different package name, like kernel-ml or kernel-uek. On such environment Conflicts tag does not works correctly. Even the system running with newer kernel, rpm only checks "kernel" package version number. To avoid such issue, we need to drop Conflicts tag. Fixes #7675	2020-11-30 11:38:42 +02:00
Nadav Har'El	48c78ade33	test/cql-pytest: add tests for validation of inserted strings This patch adds comprehensive cql-pytest tests for checking the validation of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL using several methods - a strings can be a string literal as part of the statement, can be encoded as a blob (0x...), or can be a binding parameter for a prepared statement, or returned by user-defined functions - and these tests check all of them. We already have low-level unit tests for UTF-8 parsing in test/boost/utf8_test.cc, but the new tests here confirms that we really call these low-level functions in the correct way. Moreover, since these are CQL tests, they can also be run against Cassandra, and doing that demonstrated that Scylla's UTF-8 parsing is stricter than Cassandra's - Scylla's UTF-8 parser rejects the following sequences which Cassandra's accepts: 1. \xC0\x80 as another non-minimal representation of null. Note that other non-minimal encodings are rejected by Cassandra, as expected. 2. Characters beyond the official Unicode range (or what Scylla considers the end of the range). 3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra accepts them, and Scylla does not. In the future, we should consider whether Scylla is more correct than Cassandra here (so we're fine), or whether compatibility is more important than correctness (so this exposed a bug). The ASCII tests reproduces issue #5421 - that trying to insert a non-ASCII string into an "ascii" column should produce an error on insert - not later when fetching the string. This test now passes, because issue 5421 was already fixed. These tests did not exposed any bug in Scylla (other than the differences with Cassandra mentioned a bug), so all of them pass on Scylla. Two of the tests fail on Cassandra, because Cassandra does not recognize some invalid UTF-8 (according to Scylla's definition) as invalid. Refs #5421. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-11-29 17:43:20 +02:00
Dejan Mircevski	5bc7e31284	restrictions: Forbid mixing ck=0 and (ck)=(0) Reject the previously accepted case where the multi-column restriction applied to just a single column, as it causes a crash downstream. The user can drop the parentheses to avoid the rejection. Fixes #7710 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #7712	2020-11-29 17:06:41 +02:00
Avi Kivity	0584db1eb3	Merge "Unstall cleanup_compaction::get_ranges_for_invalidation" from Benny " This series adds maybe_yield called from cleanup_compaction::get_ranges_for_invalidation to avoid reactor stalls. To achieve that, we first extract bool_class can_yield to utils/maybe_yield.hh, and add a convience helper: utils::maybe_yield(can_yield) that conditionally calls seastar::thread::maybe_yield if it can (when called in a seastar thread). With that, we add a can_yield parameter to dht::to_partition_ranges and dht::partition_range::deoverlap (defaults to false), and use it from cleanup_compaction::get_ranges_for_invalidation, as the latter is always called from `consume_in_thread`. Fixes #7674 Test: unit(dev) " * tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla: compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points dht/i_partitioner: to_partition_ranges: support yielding locator: extract can_yield to utils/maybe_yield.hh	2020-11-29 14:10:39 +02:00
Asias He	0a3a2a82e1	api: Add force_remove_endpoint for gossip It is used to force remove a node from gossip membership if something goes wrong. Note: run the force_remove_endpoint api at the same time on _all_ the nodes in the cluster in order to prevent the removed nodes come back. Becasue nodes without running the force_remove_endpoint api cmd can gossip around the removed node information to other nodes in 2 * ring_delay (2 * 30 seconds by default) time. For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove node 3 from gossip membership prior the auto removal (3 days by default), run the api cmd on both node 1 and node 2 at the same time. $ curl -X POST --header "Accept: application/json" "http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3" $ curl -X POST --header "Accept: application/json" "http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3" Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes are not present. Fixes #2134 Closes #5436	2020-11-29 13:58:46 +02:00
Nadav Har'El	0864933d4d	test/cql-pytest: add "scylla_only" fixture This patch adds a fixture "scylla_only" which can be used to mark tests for Scylla-specific features. These tests are skipped when running against other CQL servers - like Apache Cassandra. We recognize Scylla by looking at whether any system table exists with the name "scylla" in its name - Scylla has several of those, and Cassandra has none. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-11-29 10:18:58 +02:00
Nadav Har'El	91ccb2afb5	test/cpy-pytest: enable experimental features Enable experimental features, and in particular UDF, so we can test those features in our tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-11-29 10:18:58 +02:00
Michał Chojnowski	fcb258cb01	utils: fragmented_temporary_buffer: implement FragmentedView for view fragmented_temporary_buffer::view is one of the types we want to directly deserialize from.	2020-11-27 15:26:13 +01:00
Michał Chojnowski	f6cc2b6a48	utils: fragment_range: add single_fragmented_view bytes_view is one of the types we want to deserialize from (at least for now), so we want to be able to pass it to deserialize() after it's transitioned to FragmentView. single_fragmented_view is a wrapper implementing FragmentedView for bytes_view. It's constructed from bytes_view explicitly, because it's typically used in context where we want to phase linearization (and by extension, bytes_view) out.	2020-11-27 15:26:13 +01:00
Michał Chojnowski	0b20c7ef65	serializer: implement FragmentedView for buffer_view buffer_view is one of the types we want to directly deserialize from.	2020-11-27 15:26:13 +01:00
Michał Chojnowski	2008c0f62f	utils: fragment_range: add linearized and with_linearized for FragmentedView We would like those helpers to disappear one day but for now we still need them until everything can handle fragmented buffers.	2020-11-27 15:26:13 +01:00
Michał Chojnowski	fc90bd5190	utils: fragment_range: add FragmentedView This patch introduces FragmentedView - a concept intented as a general-purpose interface for fragmented buffers. Another concept made for this purpose, FragmentedRange, already exists in the codebase. However, it's unwieldy. The iterator-based design of FragmentRange is harder to implement and requires more code, but more importantly it makes FragmentRange immutable. Usually we want to read the beginning of the buffer and pass the rest of it elsewhere. This is impossible with FragmentRange. FragmentedView can do everything FragmentRange can do and more, except for playing nicely with iterator-based collection methods, but those are useless for fragmented buffers anyway.	2020-11-27 15:26:13 +01:00
Lubos Kosco	4d0587ed11	scylla_util.py: fix metadata gcp call for disks to get details disk parsing expects output from recursive listing of GCP metadata REST call, the method used to do it by default, but now it requires a boolean flag to run in recursive mode Fixes #7684 Closes #7685	2020-11-27 15:20:56 +02:00
Pekka Enberg	c84754a634	Update tools/java submodule * tools/java ad48b44a26...8080009794 (1): > sstableloader: Fix command line parsing of "ignore-missing-columns"	2020-11-27 15:19:48 +02:00
Avi Kivity	390e07d591	dist: sysctl: configure more inotify instances Since `f3bcd4d205` ("Merge 'Support SSL Certificate Hot Reloading' from Calle"), we reload certificates as they are modified on disk. This uses inotify, which is limited by a sysctl fs.inotify.max_user_instances, with a default of 128. This is enough for 64 shards only, if both rpc and cql are encrypted; above that startup fails. Increase to 1200, which is enough for 6 instances * 200 shards. Fixes #7700. Closes #7701	2020-11-26 23:44:48 +02:00
Takuya ASADA	5f81f97773	install.sh: apply sysctl.d files on non-packaging installation We don't apply sysctl.d files on non-packaging installation, apply them just like rpm/deb taking care of that. Fixes #7702 Closes #7705	2020-11-26 09:52:14 +02:00
Takuya ASADA	ba4d54efa3	dist/redhat: packaging dependencies.conf as normal file, not ghost When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost, but it should be normal file, should be installed normally on package installation. Fixes #7703 Closes #7704	2020-11-26 09:50:05 +02:00
Dejan Mircevski	7f8ed811c1	cql3/expr: Clarify multi-column doesn't use indexing Although not currently used, the old code was wrong and confusing to readers. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-11-25 10:59:13 -05:00
Avi Kivity	956f031a68	Merge 'Add missing shaded<>::stop in exceptional startup code for CQL/redis' from Calle Wilund Fixes #7211 If we start a sharded<> object, then proceed to do potentially exceptional stuff, we should destroy it on said exception. Otherwise, the exception propagation will abort on RAII destruction of the sharded<>. And we get no exception logging. Closes #7697 * github.com:scylladb/scylla: redis::service: Shut down sharded<> subobject on startup exception transport::controller: Shut down distributed object on startup exception	2020-11-25 17:57:53 +02:00
Calle Wilund	55acf09662	redis::service: Shut down sharded<> subobject on startup exception Refs #7211 If we start a sharded<> object, then proceed to do potentially exceptional stuff, we should destroy it on said exception. Otherwise, the exception propagation will abort on RAII destruction of the sharded<>. And we get no exception logging.	2020-11-25 15:52:47 +00:00
Calle Wilund	ae4d5a60ca	transport::controller: Shut down distributed object on startup exception Fixes #7211 If we start a sharded<> object, then proceed to do potentially exceptional stuff, we should destroy it on said exception. Otherwise, the exception propagation will abort on RAII destruction of the sharded<>. And we get no exception logging.	2020-11-25 15:52:47 +00:00
Dejan Mircevski	db63b40347	cql3: Don't use index for multi-column restrictions The downstream code expects a single-column restriction when using an index. We could fix it, but we'd still have to filter the rows fetched from the index table, unlike the code that queries the base table directly. For instance, WHERE (c1,c2,c3) = (1,2,3) with an index on c3 can fetch just the right rows from the base table but all the c3=3 rows from the index table. Fixes #7680 Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-11-25 10:39:04 -05:00
Dejan Mircevski	ab7aa57b24	test: Add eventually_require_rows Makes it easier to combine eventually{assert_that} with useful error messages. Refs #7573. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-11-25 10:34:44 -05:00
Benny Halevy	e1fe1f18c7	compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points Avoid reactor stalls by allowing yielding in long-running loops as seen in #7674. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-25 13:46:32 +02:00
Gleb Natapov	be6119b350	raft: ignore append_reply from a peer in SNAPSHOT state If append_reply is received from a node that currently gets snapshot transferred to it ignore it, it is a stray reply.	2020-11-25 12:36:41 +02:00
Gleb Natapov	851e3000c4	raft: Ignore outdated snapshots Do not try to install snapshots that are older than current one.	2020-11-25 12:36:41 +02:00
Gleb Natapov	2ce9473037	raft: set next_idx to correct value after snapshot transfer After snapshot is transferred progress::next_idx is set to its index, but the code uses current snapshot to set it instead of the snapshot that was transferred. Those can be different snapshots.	2020-11-25 11:34:49 +02:00
Tomasz Grabiec	0aa1f7c70a	Merge "raft: fix replication if existing log on leader" from Gleb * scylla-dev/add_dummy_v2: raft: test: replication works on leader change without adding an entry raft: commit a dummy entry after leader change raft: test: fix snapshot correctness check sstables: add `may_have_partition_tombstones` method	2020-11-24 11:35:18 +01:00
Gleb Natapov	51d1d20687	raft: test: replication works on leader change without adding an entry Check that a newly elected leader commits all the entries in its log without waiting for more entries to be submitted.	2020-11-24 11:35:18 +01:00
Gleb Natapov	6130fb8b39	raft: commit a dummy entry after leader change After a node becomes leader it needs to do two things: send an append message to establish its leadership and commit one entry to make sure all previous entries with smaller terms are committed as well.	2020-11-24 11:35:18 +01:00
Gleb Natapov	e3a886738b	raft: test: fix snapshot correctness check Snapshot index cannot be used to check snapshot correctness since some entries may not be command and thus do not affect snapshot value. Lest use applied entries count instead.	2020-11-24 11:35:18 +01:00
Benny Halevy	37e971ad87	dht/i_partitioner: to_partition_ranges: support yielding Allow yielding to prevent reactor stalls when called with a long vector of ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-24 12:23:56 +02:00
Benny Halevy	157a964a63	locator: extract can_yield to utils/maybe_yield.hh Move the definition of bool_class can_yield to a standalone header file and define there a maybe_yield(can_yield) helper. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-24 12:23:56 +02:00
Asias He	1b2155eb1d	repair: Use same description for the same metric In commit `9b28162f88` (repair: Use label for node ops metrics), we switched to use label for different node operations. We should use the same description for the same metric name. Fixes #7681 Closes #7682	2020-11-24 09:35:39 +02:00
Avi Kivity	e8ff77c05f	Merge 'sstables: a bunch of refactors' from Kamil Braun 1. sstables: move `sstable_set` implementations to a separate module All the implementations were kept in sstables/compaction_strategy.cc which is quite large even without them. `sstable_set` already had its own header file, now it gets its own implementation file. The declarations of implementation classes and interfaces (`sstable_set_impl`, `bag_sstable_set`, and so on) were also exposed in a header file, sstable_set_impl.hh, for the purposes of potential unit testing. 2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh Files which need this definition won't have to include mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are in total smaller; mutation_reader.hh includes flat_mutation_reader.hh). 3. sstables: move sstable reader creation functions to `sstable_set` Lower level functions such as `create_single_key_sstable_reader` were made methods of `sstable_set`. The motivation is that each concrete sstable_set may decide to use a better sstable reading algorithm specific to the data structures used by this sstable_set. For this it needs to access the set's internals. A nice side effect is that we moved some code out of table.cc and database.hh which are huge files. 4. sstables: pass `ring_position` to `create_single_key_sstable_reader` instead of `partition_range`. It would be best to pass `partition_key` or `decorated_key` here. However, the implementation of this function needs a `partition_range` to pass into `sstable_set::select`, and `partition_range` must be constructed from `ring_position`s. We could create the `ring_position` internally from the key but that would involve a copy which we want to avoid. 5. sstable_set: refactor `filter_sstable_for_reader_by_pk` Introduce a `make_pk_filter` function, which given a ring position, returns a boolean function (a filter) that given a sstable, tells whether the sstable may contain rows with the given position. The logic has been extracted from `filter_sstable_for_reader_by_pk`. Split from #7437. Closes #7655 * github.com:scylladb/scylla: sstable_set: refactor filter_sstable_for_reader_by_pk sstables: pass ring_position to create_single_key_sstable_reader sstables: move sstable reader creation functions to `sstable_set` mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh sstables: move sstable_set implementations to a separate module	2020-11-24 09:23:57 +02:00
Michał Chojnowski	9bceaac44c	utils: fragmented_temporary_buffer: fix view::remove_prefix This piece of logic was wrong for two unrelated reasons: 1. When fragmented_temporary_buffer::view is constructed from bytes_view, _current is null. When remove_prefix was used on such view, null pointer dereference happened. 2. It only worked for the first remove_prefix call. A second call would put a wrong value in _current_position.	2020-11-24 03:05:13 +01:00
Kamil Braun	6c8b0af505	sstable_set: refactor filter_sstable_for_reader_by_pk Introduce a `make_pk_filter` function, which given a ring position, returns a boolean function (a filter) that given a sstable, tells whether the sstable may contain rows with the given position. The logic has been extracted from `filter_sstable_for_reader_by_pk`.	2020-11-23 12:35:10 +01:00
Kamil Braun	68663d0de0	sstables: pass ring_position to create_single_key_sstable_reader instead of partition_range. It would be best to pass `partition_key` or `decorated_key` here. However, the implementation of this function needs a `partition_range` to pass into `sstable_set::select`, and `partition_range` must be constructed from `ring_position`s. We could create the `ring_position` internally from the key but that would involve a copy which we want to avoid.	2020-11-23 12:33:24 +01:00
Amnon Heiman	9e116d136e	Adding dist/common/scripts/scylla_rsyslog_setup utility scylla_rsyslog_setup adds a configuration file to rsyslog to forward the trances to a remote server. It will override any existing file, so it is safe to run it multiple times. It takes an ip, or ip and port from the users for that configuration, if no port is provided, the default port of Scylla-Monitoring promtail is used. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-11-22 15:48:48 +02:00
Kamil Braun	40d8bfa394	sstables: move sstable reader creation functions to `sstable_set` Lower level functions such as `create_single_key_sstable_reader` were made methods of `sstable_set`. The motivation is that each concrete sstable_set may decide to use a better sstable reading algorithm specific to the data structures used by this sstable_set. For this it needs to access the set's internals. A nice side effect is that we moved some code out of table.cc and database.hh which are huge files.	2020-11-19 17:52:39 +01:00
Kamil Braun	708093884c	mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh Files which need this definition won't have to include mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).	2020-11-19 17:52:39 +01:00
Kamil Braun	b02b441c2e	sstables: move sstable_set implementations to a separate module All the implementations were kept in sstables/compaction_strategy.cc which is quite large even without them. `sstable_set` already had its own header file, now it gets its own implementation file. The declarations of implementation classes and interfaces (`sstable_set_impl`, `bag_sstable_set`, and so on) were also exposed in a header file, sstable_set_impl.hh, for the purposes of potential unit testing.	2020-11-19 17:52:37 +01:00
Avi Kivity	1cf02cb9d8	types: add constraint on lexicographical_tri_compare() Verify that the input types are iterators and their value types are compatible with the compare function.	2020-11-17 15:19:46 +02:00
Avi Kivity	71e93d63c5	composite: make composite::iterator a real input_iterator Iterators require a default constructor, so add one. This helps a later patch use std::input_iterator to constrain template parameters.	2020-11-17 15:19:46 +02:00
Avi Kivity	867b41b124	compound: make compount_type::iterator a real input_iterator Iterators require a default constructor, so add one. This helps a later patch use std::input_iterator to constrain template parameters.	2020-11-17 15:19:38 +02:00

2438 changed files with 126788 additions and 49391 deletions

28

.github/CODEOWNERS vendored

View File

@@ -4,7 +4,7 @@ auth/* @elcallio @vladzcloudius
 # CACHE
 row_cache* @tgrabiec @haaawk
 *mutation* @tgrabiec @haaawk
 tests/mvcc* @tgrabiec @haaawk
 test/boost/mvcc* @tgrabiec @haaawk
 # CDC
 cdc/* @haaawk @kbr- @elcallio @piodul @jul-stas
@@ -19,13 +19,13 @@ db/batch* @elcallio
 service/storage_proxy* @gleb-cloudius
 # COMPACTION
 sstables/compaction* @raphaelsc @nyh
 compaction/* @raphaelsc @nyh
 # CQL TRANSPORT LAYER
 transport/* @penberg
 transport/*
 # CQL QUERY LANGUAGE
 cql3/* @tgrabiec @penberg @psarna
 cql3/* @tgrabiec @psarna @cvybhu
 # COUNTERS
 counters* @haaawk @jul-stas
@@ -35,7 +35,7 @@ tests/counter_test* @haaawk @jul-stas
 gms/* @tgrabiec @asias
 # DOCKER
 dist/docker/* @penberg
 dist/docker/*
 # LSA
 utils/logalloc* @tgrabiec
@@ -58,9 +58,9 @@ service/migration* @tgrabiec @nyh
 schema* @tgrabiec @nyh
 # SECONDARY INDEXES
 db/index/* @nyh @penberg @psarna
 cql3/statements/*index* @nyh @penberg @psarna
 test/boost/*index* @nyh @penberg @psarna
 db/index/* @nyh @psarna
 cql3/statements/*index* @nyh @psarna
 test/boost/*index* @nyh @psarna
 # SSTABLES
 sstables/* @tgrabiec @raphaelsc @nyh
@@ -78,10 +78,20 @@ db/hints/* @haaawk @piodul @vladzcloudius
 # REDIS
 redis/* @nyh @syuu1228
 redis-test/* @nyh @syuu1228
 test/redis/* @nyh @syuu1228
 # READERS
 reader_* @denesb
 querier* @denesb
 test/boost/mutation_reader_test.cc @denesb
 test/boost/querier_cache_test.cc @denesb
 # PYTEST-BASED CQL TESTS
 test/cql-pytest/* @nyh
 # RAFT
 raft/* @kbr- @gleb-cloudius @kostja
 test/raft/* @kbr- @gleb-cloudius @kostja
 # HEAT-WEIGHTED LOAD BALANCING
 db/heat_load_balance.* @nyh @gleb-cloudius

									
										29

.github/workflows/docs-pages@v2.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,29 @@

				name: "Docs / Publish"

				on:

				  push:

				    branches:

				    - master

				    paths:

				    - "docs/**"

				  workflow_dispatch:

				jobs:

				  release:

				    runs-on: ubuntu-latest

				    steps:

				    - name: Checkout

				      uses: actions/checkout@v2

				      with:

				        persist-credentials: false

				        fetch-depth: 0

				    - name: Set up Python

				      uses: actions/setup-python@v1

				      with:

				        python-version: 3.7

				    - name: Build docs

				      run: make -C docs multiversion

				    - name: Deploy

				      run: ./docs/_utils/deploy.sh

				      env:

				        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

									
										25

.github/workflows/docs-pr@v1.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,25 @@

				name: "Docs / Build PR"

				on:

				  pull_request:

				    branches:

				    - master

				    paths:

				    - "docs/**"

				jobs:

				  build:

				    name: Build

				    runs-on: ubuntu-latest

				    steps:

				    - name: Checkout

				      uses: actions/checkout@v2

				      with:

				        persist-credentials: false

				        fetch-depth: 0

				    - name: Set up Python

				      uses: actions/setup-python@v1

				      with:

				        python-version: 3.7

				    - name: Build docs

				      run: make -C docs test

4

.gitignore vendored

View File

@@ -25,3 +25,7 @@ tags
 testlog
 test/*/*.reject
 .vscode
 docs/_build
 docs/poetry.lock
 compile_commands.json
 .ccls-cache/

2

.gitmodules vendored

View File

@@ -1,6 +1,6 @@
 [submodule "seastar"]
 	path = seastar
 	url = ../seastar
 	url = ../scylla-seastar
 	ignore = dirty
 [submodule "swagger-ui"]
 	path = swagger-ui

									
										124

CMakeLists.txt
									
												View File
												
				@@ -32,8 +32,13 @@ if(target_arch)

				    set(target_arch_flag "-march=${target_arch}")

				endif()

				set(cxx_coro_flag)

				if (CMAKE_CXX_COMPILER_ID MATCHES GNU)

				    set(cxx_coro_flag -fcoroutines)

				endif()

				# Configure Seastar compile options to align with Scylla

				set(Seastar_CXX_FLAGS -fcoroutines ${target_arch_flag} CACHE INTERNAL "" FORCE)

				set(Seastar_CXX_FLAGS ${cxx_coro_flag} ${target_arch_flag} CACHE INTERNAL "" FORCE)

				set(Seastar_CXX_DIALECT gnu++20 CACHE INTERNAL "" FORCE)

				add_subdirectory(seastar)

				@@ -96,7 +101,7 @@ endfunction()

				scylla_generate_thrift(

				    TARGET scylla_thrift_gen_cassandra

				    VAR scylla_thrift_gen_cassandra_files

				    IN_FILE interface/cassandra.thrift

				    IN_FILE "${CMAKE_SOURCE_DIR}/interface/cassandra.thrift"

				    OUT_DIR ${scylla_gen_build_dir}

				    SERVICE Cassandra)

				@@ -153,7 +158,7 @@ foreach(f ${antlr3_grammar_files})

				    scylla_generate_antlr3(

				        TARGET scylla_antlr3_gen_${grammar_file_name}

				        VAR scylla_antlr3_gen_${grammar_file_name}_files

				        IN_FILE ${f}

				        IN_FILE "${CMAKE_SOURCE_DIR}/${f}"

				        OUT_DIR ${scylla_gen_build_dir}/${f_dir})

				    list(APPEND antlr3_gen_files "${scylla_antlr3_gen_${grammar_file_name}_files}")

				endforeach()

				@@ -162,7 +167,7 @@ endforeach()

				seastar_generate_ragel(

				    TARGET scylla_ragel_gen_protocol_parser

				    VAR scylla_ragel_gen_protocol_parser_file

				    IN_FILE redis/protocol_parser.rl

				    IN_FILE "${CMAKE_SOURCE_DIR}/redis/protocol_parser.rl"

				    OUT_FILE ${scylla_gen_build_dir}/redis/protocol_parser.hh)

				# Generate C++ sources from Swagger definitions

				@@ -194,7 +199,7 @@ foreach(f ${swagger_files})

				    seastar_generate_swagger(

				        TARGET scylla_swagger_gen_${fname}

				        VAR scylla_swagger_gen_${fname}_files

				        IN_FILE "${f}"

				        IN_FILE "${CMAKE_SOURCE_DIR}/${f}"

				        OUT_DIR "${scylla_gen_build_dir}/${dir}")

				    list(APPEND swagger_gen_files "${scylla_swagger_gen_${fname}_files}")

				endforeach()

				@@ -229,6 +234,7 @@ set(idl_serializers

				    idl/frozen_mutation.idl.hh

				    idl/frozen_schema.idl.hh

				    idl/gossip_digest.idl.hh

				    idl/hinted_handoff.idl.hh

				    idl/idl_test.idl.hh

				    idl/keys.idl.hh

				    idl/messaging_service.idl.hh

				@@ -237,6 +243,7 @@ set(idl_serializers

				    idl/partition_checksum.idl.hh

				    idl/paxos.idl.hh

				    idl/query.idl.hh

				    idl/raft.idl.hh

				    idl/range.idl.hh

				    idl/read_command.idl.hh

				    idl/reconcilable_result.idl.hh

				@@ -260,7 +267,7 @@ foreach(f ${idl_serializers})

				    scylla_generate_idl_serializer(

				        TARGET scylla_idl_gen_${idl_target}

				        VAR scylla_idl_gen_${idl_target}_files

				        IN_FILE ${f}

				        IN_FILE "${CMAKE_SOURCE_DIR}/${f}"

				        OUT_FILE ${scylla_gen_build_dir}/${idl_dir}/${idl_out_hdr_name})

				    list(APPEND idl_gen_files "${scylla_idl_gen_${idl_target}_files}")

				endforeach()

				@@ -268,8 +275,8 @@ endforeach()

				set(scylla_sources

				    absl-flat_hash_map.cc

				    alternator/auth.cc

				    alternator/base64.cc

				    alternator/conditions.cc

				    alternator/controller.cc

				    alternator/executor.cc

				    alternator/expressions.cc

				    alternator/serialization.cc

				@@ -314,6 +321,7 @@ set(scylla_sources

				    auth/standard_role_manager.cc

				    auth/transitional.cc

				    bytes.cc

				    caching_options.cc

				    canonical_mutation.cc

				    cdc/cdc_partitioner.cc

				    cdc/generation.cc

				@@ -322,6 +330,12 @@ set(scylla_sources

				    cdc/split.cc

				    clocks-impl.cc

				    collection_mutation.cc

				    compaction/compaction.cc

				    compaction/compaction_manager.cc

				    compaction/compaction_strategy.cc

				    compaction/leveled_compaction_strategy.cc

				    compaction/size_tiered_compaction_strategy.cc

				    compaction/time_window_compaction_strategy.cc

				    compress.cc

				    connection_notifier.cc

				    converting_mutation_partition_applier.cc

				@@ -335,6 +349,7 @@ set(scylla_sources

				    cql3/constants.cc

				    cql3/cql3_type.cc

				    cql3/expr/expression.cc

				    cql3/expr/term_expr.cc

				    cql3/functions/aggregate_fcts.cc

				    cql3/functions/castas_fcts.cc

				    cql3/functions/error_injection_fcts.cc

				@@ -345,6 +360,7 @@ set(scylla_sources

				    cql3/lists.cc

				    cql3/maps.cc

				    cql3/operation.cc

				    cql3/prepare_context.cc

				    cql3/query_options.cc

				    cql3/query_processor.cc

				    cql3/relation.cc

				@@ -360,25 +376,32 @@ set(scylla_sources

				    cql3/sets.cc

				    cql3/single_column_relation.cc

				    cql3/statements/alter_keyspace_statement.cc

				    cql3/statements/alter_service_level_statement.cc

				    cql3/statements/alter_table_statement.cc

				    cql3/statements/alter_type_statement.cc

				    cql3/statements/alter_view_statement.cc

				    cql3/statements/attach_service_level_statement.cc

				    cql3/statements/authentication_statement.cc

				    cql3/statements/authorization_statement.cc

				    cql3/statements/batch_statement.cc

				    cql3/statements/cas_request.cc

				    cql3/statements/cf_prop_defs.cc

				    cql3/statements/cf_statement.cc

				    cql3/statements/create_aggregate_statement.cc

				    cql3/statements/create_function_statement.cc

				    cql3/statements/create_index_statement.cc

				    cql3/statements/create_keyspace_statement.cc

				    cql3/statements/create_service_level_statement.cc

				    cql3/statements/create_table_statement.cc

				    cql3/statements/create_type_statement.cc

				    cql3/statements/create_view_statement.cc

				    cql3/statements/delete_statement.cc

				    cql3/statements/detach_service_level_statement.cc

				    cql3/statements/drop_aggregate_statement.cc

				    cql3/statements/drop_function_statement.cc

				    cql3/statements/drop_index_statement.cc

				    cql3/statements/drop_keyspace_statement.cc

				    cql3/statements/drop_service_level_statement.cc

				    cql3/statements/drop_table_statement.cc

				    cql3/statements/drop_type_statement.cc

				    cql3/statements/drop_view_statement.cc

				@@ -388,6 +411,8 @@ set(scylla_sources

				    cql3/statements/index_target.cc

				    cql3/statements/ks_prop_defs.cc

				    cql3/statements/list_permissions_statement.cc

				    cql3/statements/list_service_level_attachments_statement.cc

				    cql3/statements/list_service_level_statement.cc

				    cql3/statements/list_users_statement.cc

				    cql3/statements/modification_statement.cc

				    cql3/statements/permission_altering_statement.cc

				@@ -397,6 +422,8 @@ set(scylla_sources

				    cql3/statements/role-management-statements.cc

				    cql3/statements/schema_altering_statement.cc

				    cql3/statements/select_statement.cc

				    cql3/statements/service_level_statement.cc

				    cql3/statements/sl_prop_defs.cc

				    cql3/statements/truncate_statement.cc

				    cql3/statements/update_statement.cc

				    cql3/statements/use_statement.cc

				@@ -406,11 +433,9 @@ set(scylla_sources

				    cql3/untyped_result_set.cc

				    cql3/update_parameters.cc

				    cql3/user_types.cc

				    cql3/ut_name.cc

				    cql3/util.cc

				    cql3/ut_name.cc

				    cql3/values.cc

				    cql3/variable_specifications.cc

				    data/cell.cc

				    database.cc

				    db/batchlog_manager.cc

				    db/commitlog/commitlog.cc

				@@ -422,8 +447,10 @@ set(scylla_sources

				    db/data_listeners.cc

				    db/extensions.cc

				    db/heat_load_balance.cc

				    db/hints/host_filter.cc

				    db/hints/manager.cc

				    db/hints/resource_manager.cc

				    db/hints/sync_point.cc

				    db/large_data_handler.cc

				    db/legacy_schema_migrator.cc

				    db/marshal/type_parser.cc

				@@ -436,6 +463,7 @@ set(scylla_sources

				    db/view/row_locking.cc

				    db/view/view.cc

				    db/view/view_update_generator.cc

				    db/virtual_table.cc

				    dht/boot_strapper.cc

				    dht/i_partitioner.cc

				    dht/murmur3_partitioner.cc

				@@ -447,17 +475,18 @@ set(scylla_sources

				    flat_mutation_reader.cc

				    frozen_mutation.cc

				    frozen_schema.cc

				    generic_server.cc

				    gms/application_state.cc

				    gms/endpoint_state.cc

				    gms/failure_detector.cc

				    gms/feature_service.cc

				    gms/gossip_digest_ack.cc

				    gms/gossip_digest_ack2.cc

				    gms/gossip_digest_ack.cc

				    gms/gossip_digest_syn.cc

				    gms/gossiper.cc

				    gms/inet_address.cc

				    gms/version_generator.cc

				    gms/versioned_value.cc

				    gms/version_generator.cc

				    hashers.cc

				    index/secondary_index.cc

				    index/secondary_index_manager.cc

				@@ -465,6 +494,7 @@ set(scylla_sources

				    keys.cc

				    lister.cc

				    locator/abstract_replication_strategy.cc

				    locator/azure_snitch.cc

				    locator/ec2_multi_region_snitch.cc

				    locator/ec2_snitch.cc

				    locator/everywhere_replication_strategy.cc

				@@ -478,31 +508,33 @@ set(scylla_sources

				    locator/simple_strategy.cc

				    locator/snitch_base.cc

				    locator/token_metadata.cc

				    lua.cc

				    lang/lua.cc

				    main.cc

				    memtable.cc

				    message/messaging_service.cc

				    multishard_mutation_query.cc

				    mutation.cc

				    raft/fsm.cc

				    raft/log.cc

				    raft/progress.cc

				    raft/raft.cc

				    raft/server.cc

				    mutation_fragment.cc

				    mutation_partition.cc

				    mutation_partition_serializer.cc

				    mutation_partition_view.cc

				    mutation_query.cc

				    mutation_reader.cc

				    mutation_writer/feed_writers.cc

				    mutation_writer/multishard_writer.cc

				    mutation_writer/partition_based_splitting_writer.cc

				    mutation_writer/shard_based_splitting_writer.cc

				    mutation_writer/timestamp_based_splitting_writer.cc

				    partition_slice_builder.cc

				    partition_version.cc

				    querier.cc

				    query-result-set.cc

				    query.cc

				    query-result-set.cc

				    raft/fsm.cc

				    raft/log.cc

				    raft/raft.cc

				    raft/server.cc

				    raft/tracker.cc

				    range_tombstone.cc

				    range_tombstone_list.cc

				    reader_concurrency_semaphore.cc

				@@ -518,15 +550,16 @@ set(scylla_sources

				    redis/server.cc

				    redis/service.cc

				    redis/stats.cc

				    release.cc

				    repair/repair.cc

				    repair/row_level.cc

				    row_cache.cc

				    schema.cc

				    schema_mutations.cc

				    schema_registry.cc

				    serializer.cc

				    service/client_state.cc

				    service/migration_manager.cc

				    service/migration_task.cc

				    service/misc_services.cc

				    service/pager/paging_state.cc

				    service/pager/query_pagers.cc

				@@ -535,29 +568,33 @@ set(scylla_sources

				    service/paxos/prepare_summary.cc

				    service/paxos/proposal.cc

				    service/priority_manager.cc

				    service/qos/qos_common.cc

				    service/qos/service_level_controller.cc

				    service/qos/standard_service_level_distributed_data_accessor.cc

				    service/raft/raft_gossip_failure_detector.cc

				    service/raft/raft_group_registry.cc

				    service/raft/raft_rpc.cc

				    service/raft/raft_sys_table_storage.cc

				    service/raft/schema_raft_state_machine.cc

				    service/storage_proxy.cc

				    service/storage_service.cc

				    sstables/compaction.cc

				    sstables/compaction_manager.cc

				    sstables/compaction_strategy.cc

				    sstables/compress.cc

				    sstables/integrity_checked_file_impl.cc

				    sstables/kl/writer.cc

				    sstables/leveled_compaction_strategy.cc

				    sstables/m_format_read_helpers.cc

				    sstables/kl/reader.cc

				    sstables/metadata_collector.cc

				    sstables/mp_row_consumer.cc

				    sstables/m_format_read_helpers.cc

				    sstables/mx/reader.cc

				    sstables/mx/writer.cc

				    sstables/partition.cc

				    sstables/prepended_input_stream.cc

				    sstables/random_access_reader.cc

				    sstables/size_tiered_compaction_strategy.cc

				    sstables/sstable_directory.cc

				    sstables/sstable_version.cc

				    sstables/sstable_mutation_reader.cc

				    sstables/sstables.cc

				    sstables/sstable_set.cc

				    sstables/sstables_manager.cc

				    sstables/time_window_compaction_strategy.cc

				    sstables/sstable_version.cc

				    sstables/writer.cc

				    streaming/consumer.cc

				    streaming/progress_info.cc

				    streaming/session_info.cc

				    streaming/stream_coordinator.cc

				@@ -579,11 +616,13 @@ set(scylla_sources

				    thrift/server.cc

				    thrift/thrift_validation.cc

				    timeout_config.cc

				    tools/scylla-sstable-index.cc

				    tools/scylla-types.cc

				    tracing/traced_file.cc

				    tracing/trace_keyspace_helper.cc

				    tracing/trace_state.cc

				    tracing/traced_file.cc

				    tracing/tracing.cc

				    tracing/tracing_backend_registry.cc

				    tracing/tracing.cc

				    transport/controller.cc

				    transport/cql_protocol_extension.cc

				    transport/event.cc

				@@ -592,10 +631,10 @@ set(scylla_sources

				    transport/server.cc

				    types.cc

				    unimplemented.cc

				    utils/UUID_gen.cc

				    utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc

				    utils/array-search.cc

				    utils/ascii.cc

				    utils/base64.cc

				    utils/big_decimal.cc

				    utils/bloom_calculations.cc

				    utils/bloom_filter.cc

				@@ -610,6 +649,7 @@ set(scylla_sources

				    utils/file_lock.cc

				    utils/generation-number.cc

				    utils/gz/crc_combine.cc

				    utils/gz/gen_crc_combine_table.cc

				    utils/human_readable.cc

				    utils/i_filter.cc

				    utils/large_bitset.cc

				@@ -625,10 +665,10 @@ set(scylla_sources

				    utils/updateable_value.cc

				    utils/utf8.cc

				    utils/uuid.cc

				    utils/UUID_gen.cc

				    validation.cc

				    vint-serialization.cc

				    zstd.cc

				    release.cc)

				    zstd.cc)

				set(scylla_gen_sources

				    "${scylla_thrift_gen_cassandra_files}"

				@@ -689,7 +729,7 @@ target_link_libraries(scylla PRIVATE

				target_compile_options(scylla PRIVATE

				    -std=gnu++20

				    -fcoroutines # TODO: Clang does not have this flag, adjust to both variants

				    ${cxx_coro_flag}

				    ${target_arch_flag})

				# Hacks needed to expose internal APIs for xxhash dependencies

				target_compile_definitions(scylla PRIVATE XXH_PRIVATE_API HAVE_LZ4_COMPRESS_DEFAULT)

				@@ -709,7 +749,7 @@ target_link_libraries(crc_combine_table PRIVATE seastar)

				target_include_directories(crc_combine_table PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}")

				target_compile_options(crc_combine_table PRIVATE

				    -std=gnu++20

				    -fcoroutines

				    ${cxx_coro_flag}

				    ${target_arch_flag})

				add_dependencies(scylla crc_combine_table)

				@@ -722,15 +762,15 @@ target_sources(scylla PRIVATE "${scylla_gen_build_dir}/utils/gz/crc_combine_tabl

				###

				### Generate version file and supply appropriate compile definitions for release.cc

				###

				execute_process(COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN RESULT_VARIABLE scylla_version_gen_res)

				execute_process(COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN --output-dir "${CMAKE_BINARY_DIR}/gen" RESULT_VARIABLE scylla_version_gen_res)

				if(scylla_version_gen_res)

				    message(SEND_ERROR "Version file generation failed. Return code: ${scylla_version_gen_res}")

				endif()

				file(READ build/SCYLLA-VERSION-FILE scylla_version)

				file(READ "${CMAKE_BINARY_DIR}/gen/SCYLLA-VERSION-FILE" scylla_version)

				string(STRIP "${scylla_version}" scylla_version)

				file(READ build/SCYLLA-RELEASE-FILE scylla_release)

				file(READ "${CMAKE_BINARY_DIR}/gen/SCYLLA-RELEASE-FILE" scylla_release)

				string(STRIP "${scylla_release}" scylla_release)

				get_property(release_cdefs SOURCE "${CMAKE_SOURCE_DIR}/release.cc" PROPERTY COMPILE_DEFINITIONS)

				@@ -742,7 +782,7 @@ set_source_files_properties("${CMAKE_SOURCE_DIR}/release.cc" PROPERTIES COMPILE_

				###

				set(libdeflate_lib "${scylla_build_dir}/libdeflate/libdeflate.a")

				add_custom_command(OUTPUT "${libdeflate_lib}"

				    COMMAND make -C libdeflate

				    COMMAND make -C "${CMAKE_SOURCE_DIR}/libdeflate"

				        BUILD_DIR=../build/${BUILD_TYPE}/libdeflate/

				        CC=${CMAKE_C_COMPILER}

				        "CFLAGS=${target_arch_flag}"

									
										21

CONTRIBUTING.md
									
												View File
												
				@@ -1,11 +1,20 @@

				# Asking questions or requesting help

				# Contributing to Scylla

				Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.

				## Asking questions or requesting help

				# Reporting an issue

				Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.

				Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues.  Fill in as much information as you can in the issue template, especially for performance problems.

				Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.

				# Contributing Code to Scylla

				## Reporting an issue

				To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

				Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.

				## Contributing code to Scylla

				Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).

				If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).

				The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

				Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.

									
										28

HACKING.md
									
												View File
												
				@@ -172,12 +172,8 @@ and you will get output like this:

				```

				CQL QUERY LANGUAGE

				  Tomasz Grabiec <tgrabiec@scylladb.com>   [maintainer]

				  Pekka Enberg <penberg@scylladb.com>      [maintainer]

				MATERIALIZED VIEWS

				  Pekka Enberg <penberg@scylladb.com>      [maintainer]

				  Duarte Nunes <duarte@scylladb.com>       [maintainer]

				  Nadav Har'El <nyh@scylladb.com>          [reviewer]

				  Duarte Nunes <duarte@scylladb.com>       [reviewer]

				```

				### Running Scylla

				@@ -366,7 +362,27 @@ $ git remote update

				$ git checkout -t local/my_local_seastar_branch

				```

				### Generating code coverage report

				Install dependencies:

				    $ dnf install llvm # for llvm-profdata and llvm-cov

				    $ dnf install lcov # for genhtml

				Instruct `configure.py` to generate build files for `coverage` mode:

				    $ ./configure.py --mode=coverage

				Build the tests you want to run, then run them via `test.py` (important!):

				    $ ./test.py --mode=coverage [...]

				Alternatively, you can run individual tests via `./scripts/coverage.py --run`.

				Open the link printed at the end. Be horrified. Go and write more tests.

				For more details see `./scripts/coverage.py --help`.

				### Core dump debugging

				Slides:

				2018.11.20: https://www.slideshare.net/tomekgrabiec/scylla-core-dump-debugging-tools

				See [debugging.md](debugging.md).

4

NOTICE.txt

View File

@@ -5,3 +5,7 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
 These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
 It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
 It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)
 It includes files from https://github.com/bytecodealliance/wasmtime-cpp (owned by Bytecode Alliance), licensed with Apache License 2.0.

									
										11

README.md
									
												View File
												
				@@ -42,8 +42,8 @@ For further information, please see:

				* [Docker image build documentation] for information on how to build Docker images.

				[developer documentation]: HACKING.md

				[build documentation]: docs/building.md

				[docker image build documentation]: dist/docker/redhat/README.md

				[build documentation]: docs/guides/building.md

				[docker image build documentation]: dist/docker/debian/README.md

				## Running Scylla

				@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help

				## Testing

				See [test.py manual](docs/testing.md).

				See [test.py manual](docs/guides/testing.md).

				## Scylla APIs and compatibility

				By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and

				@@ -78,10 +78,7 @@ and the current compatibility of this feature as well as Scylla-specific extensi

				## Documentation

				Documentation can be found in [./docs](./docs) and on the

				[wiki](https://github.com/scylladb/scylla/wiki). There is currently no clear

				definition of what goes where, so when looking for something be sure to check

				both.

				Documentation can be found [here](https://scylla.docs.scylladb.com).

				Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).

				User documentation can be found [here](https://docs.scylladb.com/).

76

SCYLLA-VERSION-GEN

View File

@@ -1,7 +1,66 @@
 #!/bin/sh
 USAGE=$(cat <<-END
 Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] -- generate Scylla version and build information files.
 Options:
   -h|--help show this help message.
   -o|--output-dir PATH specify destination path at which the version files are to be created.
 By default, the script will attempt to parse 'version' file
 in the current directory, which should contain a string of
 '\$version-\$release' form.
 Otherwise, it will call 'git log' on the source tree (the
 directory, which contains the script) to obtain current
 commit hash and use it for building the version and release
 strings.
 The script assumes that it's called from the Scylla source
 tree.
 The files created are:
   SCYLLA-VERSION-FILE
   SCYLLA-RELEASE-FILE
   SCYLLA-PRODUCT-FILE
 By default, these files are created in the 'build'
 subdirectory under the directory containing the script.
 The destination directory can be overriden by
 using '-o PATH' option.
 END
 )
 while [[ $# -gt 0 ]]; do
 	opt="$1"
 	case $opt in
 		-h|--help)
 			echo "$USAGE"
 			exit 0
 			;;
 		-o|--output-dir)
 			OUTPUT_DIR="$2"
 			shift
 			shift
 			;;
 		*)
 			echo "Unexpected argument found: $1"
 			echo
 			echo "$USAGE"
 			exit 1
 			;;
 	esac
 done
 SCRIPT_DIR="$(dirname "$0")"
 if [ -z "$OUTPUT_DIR" ]; then
 	OUTPUT_DIR="$SCRIPT_DIR/build"
 fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=4.4.dev
 VERSION=4.6.11
 if test -f version
 then
@@ -9,7 +68,7 @@ then
 	SCYLLA_RELEASE=$(cat version | awk -F'-' '{print $2}')
 else
 	DATE=$(date +%Y%m%d)
 	GIT_COMMIT=$(git log --pretty=format:'%h' -n 1)
 	GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1)
 	SCYLLA_VERSION=$VERSION
 	# For custom package builds, replace "0" with "counter.your_name",
 	# where counter starts at 1 and increments for successive versions.
@@ -19,16 +78,15 @@ else
 	SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
 fi
 if [ -f build/SCYLLA-RELEASE-FILE ]; then
 	RELEASE_FILE=$(cat build/SCYLLA-RELEASE-FILE)
 	GIT_COMMIT_FILE=$(cat build/SCYLLA-RELEASE-FILE |cut -d . -f 3)
 if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
 	GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
 	if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
 		exit 0
 	fi
 fi
 echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
 mkdir -p build
 echo "$SCYLLA_VERSION" > build/SCYLLA-VERSION-FILE
 echo "$SCYLLA_RELEASE" > build/SCYLLA-RELEASE-FILE
 echo "$PRODUCT" > build/SCYLLA-PRODUCT-FILE
 mkdir -p "$OUTPUT_DIR"
 echo "$SCYLLA_VERSION" > "$OUTPUT_DIR/SCYLLA-VERSION-FILE"
 echo "$SCYLLA_RELEASE" > "$OUTPUT_DIR/SCYLLA-RELEASE-FILE"
 echo "$PRODUCT" > "$OUTPUT_DIR/SCYLLA-PRODUCT-FILE"

2

abseil

Submodule abseil updated: 1e3d25b265...f70eadadd7

									
										2

absl-flat_hash_map.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2020 ScyllaDB

				 * Copyright (C) 2020-present ScyllaDB

				 */

				/*

									
										2

absl-flat_hash_map.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2020 ScyllaDB

				 * Copyright (C) 2020-present ScyllaDB

				 */

				/*

									
										67

alternator/auth.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -24,7 +24,6 @@

				#include <string>

				#include <string_view>

				#include <gnutls/crypto.h>

				#include <seastar/util/defer.hh>

				#include "hashers.hh"

				#include "bytes.hh"

				#include "alternator/auth.hh"

				@@ -32,8 +31,13 @@

				#include "auth/common.hh"

				#include "auth/password_authenticator.hh"

				#include "auth/roles-metadata.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "service/storage_proxy.hh"

				#include "alternator/executor.hh"

				#include "cql3/selection/selection.hh"

				#include "database.hh"

				#include "query-result-set.hh"

				#include "cql3/result_set.hh"

				#include <seastar/core/coroutine.hh>

				namespace alternator {

				@@ -62,6 +66,14 @@ static std::string apply_sha256(std::string_view msg) {

				    return to_hex(hasher.finalize());

				}

				static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {

				    sha256_hasher hasher;

				    for (const temporary_buffer<char>& buf : msg) {

				        hasher.update(buf.get(), buf.size());

				    }

				    return to_hex(hasher.finalize());

				}

				static std::string format_time_point(db_clock::time_point tp) {

				    time_t time_point_repr = db_clock::to_time_t(tp);

				    std::string time_point_str;

				@@ -91,7 +103,7 @@ void check_expiry(std::string_view signature_date) {

				std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,

				        std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,

				        std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {

				        const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {

				    auto amz_date_it = signed_headers_map.find("x-amz-date");

				    if (amz_date_it == signed_headers_map.end()) {

				        throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");

				@@ -124,23 +136,36 @@ std::string get_signature(std::string_view access_key_id, std::string_view secre

				    return to_hex(bytes_view(reinterpret_cast<const int8_t*>(signature.data()), signature.size()));

				}

				future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username) {

				    static const sstring query = format("SELECT salted_hash FROM {} WHERE {} = ?",

				            auth::meta::roles_table::qualified_name, auth::meta::roles_table::role_col_name);

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {

				    schema_ptr schema = proxy.get_db().local().find_schema("system_auth", "roles");

				    partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));

				    dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};

				    std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};

				    const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));

				    if (!salted_hash_col) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));

				    }

				    auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});

				    auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));

				    auto cl = auth::password_authenticator::consistency_for_user(username);

				    return qp.execute_internal(query, cl, auth::internal_distributed_query_state(), {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {

				        auto res = f.get0();

				        auto salted_hash = std::optional<sstring>();

				        if (res->empty()) {

				            throw api_error::unrecognized_client(fmt::format("User not found: {}", username));

				        }

				        salted_hash = res->one().get_opt<sstring>("salted_hash");

				        if (!salted_hash) {

				            throw api_error::unrecognized_client(fmt::format("No password found for user: {}", username));

				        }

				        return make_ready_future<std::string>(*salted_hash);

				    });

				    service::client_state client_state{service::client_state::internal_tag()};

				    service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

				            service::storage_proxy::coordinator_query_options(executor::default_timeout(), empty_service_permit(), client_state));

				    cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());

				    query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				    auto result_set = builder.build();

				    if (result_set->empty()) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("User not found: {}", username)));

				    }

				    const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column

				    if (!salted_hash) {

				        co_return coroutine::make_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));

				    }

				    co_return value_cast<sstring>(utf8_type->deserialize(*salted_hash));

				}

				}

									
										10

alternator/auth.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -27,8 +27,8 @@

				#include "gc_clock.hh"

				#include "utils/loading_cache.hh"

				namespace cql3 {

				class query_processor;

				namespace service {

				class storage_proxy;

				}

				namespace alternator {

				@@ -39,8 +39,8 @@ using key_cache = utils::loading_cache<std::string, std::string>;

				std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,

				        std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,

				        std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);

				        const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);

				future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);

				}

									
										241

alternator/conditions.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -28,7 +28,8 @@

				#include <unordered_map>

				#include "utils/rjson.hh"

				#include "serialization.hh"

				#include "base64.hh"

				#include "utils/base64.hh"

				#include "utils/rjson.hh"

				#include <stdexcept>

				#include <boost/algorithm/cxx11/all_of.hpp>

				#include <boost/algorithm/cxx11/any_of.hpp>

				@@ -123,7 +124,7 @@ struct rjson_engaged_ptr_comp {

				// as internally they're stored in an array, and the order of elements is

				// not important in set equality. See issue #5021

				static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {

				    if (set1.Size() != set2.Size()) {

				    if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {

				        return false;

				    }

				    std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;

				@@ -137,45 +138,107 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2

				    }

				    return true;

				}

				// Moreover, the JSON being compared can be a nested document with outer

				// layers of lists and maps and some inner set - and we need to get to that

				// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).

				static bool check_EQ(const rjson::value* v1, const rjson::value& v2);

				static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {

				    if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {

				        return false;

				    }

				    auto it1 = list1.Begin();

				    auto it2 = list2.Begin();

				    while (it1 != list1.End()) {

				        // Note: Alternator limits an item's depth (rjson::parse() limits

				        // it to around 37 levels), so this recursion is safe.

				        if (!check_EQ(&*it1, *it2)) {

				            return false;

				        }

				        ++it1;

				        ++it2;

				    }

				    return true;

				}

				static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {

				    if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {

				        return false;

				    }

				    for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {

				        auto it2 = list2.FindMember(it1->name);

				        if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {

				            return false;

				        }

				    }

				    return true;

				}

				// Check if two JSON-encoded values match with the EQ relation

				static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {

				    if (!v1) {

				        return false;

				    }

				    if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {

				    if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {

				        auto it1 = v1->MemberBegin();

				        auto it2 = v2.MemberBegin();

				        if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {

				            return check_EQ_for_sets(it1->value, it2->value);

				        if (it1->name != it2->name) {

				            return false;

				        }

				        if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {

				            return check_EQ_for_sets(it1->value, it2->value);

				        } else if(it1->name == "L") {

				            return check_EQ_for_lists(it1->value, it2->value);

				        } else if(it1->name == "M") {

				            return check_EQ_for_maps(it1->value, it2->value);

				        } else {

				            // Other, non-nested types (number, string, etc.) can be compared

				            // literally, comparing their JSON representation.

				            return it1->value == it2->value;

				        }

				    } else {

				        // If v1 and/or v2 are missing (IsNull()) the result should be false.

				        // In the unlikely case that the object is malformed (issue #8070),

				        // let's also return false.

				        return false;

				    }

				    return *v1 == v2;

				}

				// Check if two JSON-encoded values match with the NE relation

				static bool check_NE(const rjson::value* v1, const rjson::value& v2) {

				    return !v1 || *v1 != v2; // null is unequal to anything.

				    return !check_EQ(v1, v2);

				}

				// Check if two JSON-encoded values match with the BEGINS_WITH relation

				static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {

				    // BEGINS_WITH requires that its single operand (v2) be a string or

				    // binary - otherwise it's a validation error. However, problems with

				    // the stored attribute (v1) will just return false (no match).

				    if (!v2.IsObject() || v2.MemberCount() != 1) {

				        throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));

				    }

				    auto it2 = v2.MemberBegin();

				    if (it2->name != "S" && it2->name != "B") {

				        throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));

				    }

				bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,

				                       bool v1_from_query, bool v2_from_query) {

				    bool bad = false;

				    if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {

				        if (v1_from_query) {

				            throw api_error::validation("begins_with() encountered malformed argument");

				        } else {

				            bad = true;

				        }

				    } else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {

				        if (v1_from_query) {

				            throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));

				        } else {

				            bad = true;

				        }

				    }

				    if (!v2.IsObject() || v2.MemberCount() != 1) {

				        if (v2_from_query) {

				            throw api_error::validation("begins_with() encountered malformed argument");

				        } else {

				            bad = true;

				        }

				    } else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {

				        if (v2_from_query) {

				            throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));

				        } else {

				            bad = true;

				        }

				    }

				    if (bad) {

				        return false;

				    }

				    auto it1 = v1->MemberBegin();

				    auto it2 = v2.MemberBegin();

				    if (it1->name != it2->name) {

				        return false;

				    }

				@@ -200,7 +263,7 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2) {

				    if (kv1.name == "S" && kv2.name == "S") {

				        return rjson::to_string_view(kv1.value).find(rjson::to_string_view(kv2.value)) != std::string_view::npos;

				    } else if (kv1.name == "B" && kv2.name == "B") {

				        return base64_decode(kv1.value).find(base64_decode(kv2.value)) != bytes::npos;

				        return rjson::base64_decode(kv1.value).find(rjson::base64_decode(kv2.value)) != bytes::npos;

				    } else if (is_set_of(kv1.name, kv2.name)) {

				        for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {

				            if (*i == kv2.value) {

				@@ -279,24 +342,40 @@ static bool check_NOT_NULL(const rjson::value* val) {

				    return val != nullptr;

				}

				// Only types S, N or B (string, number or bytes) may be compared by the

				// various comparion operators - lt, le, gt, ge, and between.

				// Note that in particular, if the value is missing (v->IsNull()), this

				// check returns false.

				static bool check_comparable_type(const rjson::value& v) {

				    if (!v.IsObject() || v.MemberCount() != 1) {

				        return false;

				    }

				    const rjson::value& type = v.MemberBegin()->name;

				    return type == "S" || type == "N" || type == "B";

				}

				// Check if two JSON-encoded values match with cmp.

				template <typename Comparator>

				bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {

				    if (!v2.IsObject() || v2.MemberCount() != 1) {

				        throw api_error::validation(

				                        format("{} requires a single AttributeValue of type String, Number, or Binary",

				                               cmp.diagnostic));

				bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,

				                   bool v1_from_query, bool v2_from_query) {

				    bool bad = false;

				    if (!v1 || !check_comparable_type(*v1)) {

				        if (v1_from_query) {

				            throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));

				        }

				        bad = true;

				    }

				    const auto& kv2 = *v2.MemberBegin();

				    if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {

				        throw api_error::validation(

				                        format("{} requires a single AttributeValue of type String, Number, or Binary",

				                               cmp.diagnostic));

				    if (!check_comparable_type(v2)) {

				        if (v2_from_query) {

				            throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));

				        }

				        bad = true;

				    }

				    if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {

				    if (bad) {

				        return false;

				    }

				    const auto& kv1 = *v1->MemberBegin();

				    const auto& kv2 = *v2.MemberBegin();

				    if (kv1.name != kv2.name) {

				        return false;

				    }

				@@ -308,9 +387,10 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara

				                   std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));

				    }

				    if (kv1.name == "B") {

				        return cmp(base64_decode(kv1.value), base64_decode(kv2.value));

				        return cmp(rjson::base64_decode(kv1.value), rjson::base64_decode(kv2.value));

				    }

				    clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");

				    // cannot reach here, as check_comparable_type() verifies the type is one

				    // of the above options.

				    return false;

				}

				@@ -341,56 +421,71 @@ struct cmp_gt {

				    static constexpr const char* diagnostic = "GT operator";

				};

				// True if v is between lb and ub, inclusive.  Throws if lb > ub.

				// True if v is between lb and ub, inclusive.  Throws or returns false

				// (depending on bounds_from_query parameter) if lb > ub.

				template <typename T>

				static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {

				static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {

				    if (cmp_lt()(ub, lb)) {

				        throw api_error::validation(

				                        format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));

				        if (bounds_from_query) {

				            throw api_error::validation(

				                format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));

				        } else {

				            return false;

				        }

				    }

				    return cmp_ge()(v, lb) && cmp_le()(v, ub);

				}

				static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {

				    if (!v) {

				static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,

				                          bool v_from_query, bool lb_from_query, bool ub_from_query) {

				    if ((v && v_from_query && !check_comparable_type(*v)) ||

				        (lb_from_query && !check_comparable_type(lb)) ||

				        (ub_from_query && !check_comparable_type(ub))) {

				        throw api_error::validation("between allow only the types String, Number, or Binary");

				    }

				    if (!v || !v->IsObject() || v->MemberCount() != 1 ||

				        !lb.IsObject() || lb.MemberCount() != 1 ||

				        !ub.IsObject() || ub.MemberCount() != 1) {

				        return false;

				    }

				    if (!v->IsObject() || v->MemberCount() != 1) {

				        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));

				    }

				    if (!lb.IsObject() || lb.MemberCount() != 1) {

				        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));

				    }

				    if (!ub.IsObject() || ub.MemberCount() != 1) {

				        throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));

				    }

				    const auto& kv_v = *v->MemberBegin();

				    const auto& kv_lb = *lb.MemberBegin();

				    const auto& kv_ub = *ub.MemberBegin();

				    bool bounds_from_query = lb_from_query && ub_from_query;

				    if (kv_lb.name != kv_ub.name) {

				        throw api_error::validation(

				        if (bounds_from_query) {

				           throw api_error::validation(

				                format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",

				                       kv_lb.name, kv_ub.name));

				        } else {

				            return false;

				        }

				    }

				    if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.

				        return false;

				    }

				    if (kv_v.name == "N") {

				        const char* diag = "BETWEEN operator";

				        return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));

				        return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);

				    }

				    if (kv_v.name == "S") {

				        return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),

				                             std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),

				                             std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));

				                             std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),

				                             bounds_from_query);

				    }

				    if (kv_v.name == "B") {

				        return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));

				        return check_BETWEEN(rjson::base64_decode(kv_v.value), rjson::base64_decode(kv_lb.value), rjson::base64_decode(kv_ub.value), bounds_from_query);

				    }

				    throw api_error::validation(

				        format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",

				    if (v_from_query) {

				        throw api_error::validation(

				            format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",

				               kv_lb.name));

				    } else {

				        return false;

				    }

				}

				// Verify one Expect condition on one attribute (whose content is "got")

				@@ -437,19 +532,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu

				            return check_NE(got, (*attribute_value_list)[0]);

				        case comparison_operator_type::LT:

				            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				            return check_compare(got, (*attribute_value_list)[0], cmp_lt{});

				            return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);

				        case comparison_operator_type::LE:

				            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				            return check_compare(got, (*attribute_value_list)[0], cmp_le{});

				            return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);

				        case comparison_operator_type::GT:

				            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				            return check_compare(got, (*attribute_value_list)[0], cmp_gt{});

				            return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);

				        case comparison_operator_type::GE:

				            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				            return check_compare(got, (*attribute_value_list)[0], cmp_ge{});

				            return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);

				        case comparison_operator_type::BEGINS_WITH:

				            verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				            return check_BEGINS_WITH(got, (*attribute_value_list)[0]);

				            return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);

				        case comparison_operator_type::IN:

				            verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);

				            return check_IN(got, *attribute_value_list);

				@@ -461,7 +556,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu

				            return check_NOT_NULL(got);

				        case comparison_operator_type::BETWEEN:

				            verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);

				            return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);

				            return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],

				                                 false, true, true);

				        case comparison_operator_type::CONTAINS:

				            {

				                verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);

				@@ -573,7 +669,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con

				            // Shouldn't happen unless we have a bug in the parser

				            throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));

				        }

				        return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);

				        return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],

				                             cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());

				    case parsed::primitive_condition::type::IN:

				        return check_IN(calculated_values);

				    case parsed::primitive_condition::type::VALUE:

				@@ -604,13 +701,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con

				    case parsed::primitive_condition::type::NE:

				        return check_NE(&calculated_values[0], calculated_values[1]);

				    case parsed::primitive_condition::type::GT:

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},

				            cond._values[0].is_constant(), cond._values[1].is_constant());

				    case parsed::primitive_condition::type::GE:

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},

				            cond._values[0].is_constant(), cond._values[1].is_constant());

				    case parsed::primitive_condition::type::LT:

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},

				            cond._values[0].is_constant(), cond._values[1].is_constant());

				    case parsed::primitive_condition::type::LE:

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});

				        return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},

				            cond._values[0].is_constant(), cond._values[1].is_constant());

				    default:

				        // Shouldn't happen unless we have a bug in the parser

				        throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

									
										3

alternator/conditions.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)

				bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);

				bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);

				bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);

				bool verify_condition_expression(

				        const parsed::condition_expression& condition_expression,

									
										128

alternator/controller.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,128 @@

				/*

				 * Copyright (C) 2021-present ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <seastar/net/dns.hh>

				#include "controller.hh"

				#include "server.hh"

				#include "executor.hh"

				#include "rmw_operation.hh"

				#include "db/config.hh"

				#include "cdc/generation_service.hh"

				#include "service/memory_limiter.hh"

				using namespace seastar;

				namespace alternator {

				static logging::logger logger("alternator_controller");

				controller::controller(

				        sharded<gms::gossiper>& gossiper,

				        sharded<service::storage_proxy>& proxy,

				        sharded<service::migration_manager>& mm,

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				        sharded<service::memory_limiter>& memory_limiter,

				        const db::config& config)

				    : _gossiper(gossiper)

				    , _proxy(proxy)

				    , _mm(mm)

				    , _sys_dist_ks(sys_dist_ks)

				    , _cdc_gen_svc(cdc_gen_svc)

				    , _memory_limiter(memory_limiter)

				    , _config(config)

				{

				}

				future<> controller::start() {

				    return seastar::async([this] {

				        auto preferred = _config.listen_interface_prefer_ipv6() ? std::make_optional(net::inet_address::family::INET6) : std::nullopt;

				        auto family = _config.enable_ipv6_dns_lookup() || preferred ? std::nullopt : std::make_optional(net::inet_address::family::INET);

				        // Create an smp_service_group to be used for limiting the

				        // concurrency when forwarding Alternator request between

				        // shards - if necessary for LWT.

				        smp_service_group_config c;

				        c.max_nonlocal_requests = 5000;

				        _ssg = create_smp_service_group(c).get0();

				        rmw_operation::set_default_write_isolation(_config.alternator_write_isolation());

				        executor::set_default_timeout(std::chrono::milliseconds(_config.alternator_timeout_in_ms()));

				        net::inet_address addr;

				        try {

				            addr = net::dns::get_host_by_name(_config.alternator_address(), family).get0().addr_list.front();

				        } catch (...) {

				            std::throw_with_nested(std::runtime_error(fmt::format("Unable to resolve alternator_address {}", _config.alternator_address())));

				        }

				        auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();

				        _server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();

				        std::optional<uint16_t> alternator_port;

				        if (_config.alternator_port()) {

				            alternator_port = _config.alternator_port();

				        }

				        std::optional<uint16_t> alternator_https_port;

				        std::optional<tls::credentials_builder> creds;

				        if (_config.alternator_https_port()) {

				            alternator_https_port = _config.alternator_https_port();

				            creds.emplace();

				            auto opts = _config.alternator_encryption_options();

				            if (opts.empty()) {

				                // Earlier versions mistakenly configured Alternator's

				                // HTTPS parameters via the "server_encryption_option"

				                // configuration parameter. We *temporarily* continue

				                // to allow this, for backward compatibility.

				                opts = _config.server_encryption_options();

				                if (!opts.empty()) {

				                logger.warn("Setting server_encryption_options to configure "

				                        "Alternator's HTTPS encryption is deprecated. Please "

				                        "switch to setting alternator_encryption_options instead.");

				                }

				            }

				            opts.erase("require_client_auth");

				            opts.erase("truststore");

				            utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();

				        }

				        bool alternator_enforce_authorization = _config.alternator_enforce_authorization();

				        _server.invoke_on_all(

				                [this, addr, alternator_port, alternator_https_port, creds = std::move(creds), alternator_enforce_authorization] (server& server) mutable {

				            return server.init(addr, alternator_port, alternator_https_port, creds, alternator_enforce_authorization,

				                    &_memory_limiter.local().get_semaphore(),

				                    _config.max_concurrent_requests_per_shard);

				        }).then([addr, alternator_port, alternator_https_port] {

				            logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",

				                    addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");

				        }).get();

				    });

				}

				future<> controller::stop() {

				    return seastar::async([this] {

				        _server.stop().get();

				        _executor.stop().get();

				        destroy_smp_service_group(_ssg.value()).get();

				    });

				}

				}

									
										82

alternator/controller.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,82 @@

				/*

				 * Copyright (C) 2021-present ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <seastar/core/sharded.hh>

				#include <seastar/core/smp.hh>

				namespace service {

				class storage_proxy;

				class migration_manager;

				class memory_limiter;

				}

				namespace db {

				class system_distributed_keyspace;

				class config;

				}

				namespace cdc {

				class generation_service;

				}

				namespace gms {

				class gossiper;

				}

				namespace alternator {

				using namespace seastar;

				class executor;

				class server;

				class controller {

				    sharded<gms::gossiper>& _gossiper;

				    sharded<service::storage_proxy>& _proxy;

				    sharded<service::migration_manager>& _mm;

				    sharded<db::system_distributed_keyspace>& _sys_dist_ks;

				    sharded<cdc::generation_service>& _cdc_gen_svc;

				    sharded<service::memory_limiter>& _memory_limiter;

				    const db::config& _config;

				    sharded<executor> _executor;

				    sharded<server> _server;

				    std::optional<smp_service_group> _ssg;

				public:

				    controller(

				        sharded<gms::gossiper>& gossiper,

				        sharded<service::storage_proxy>& proxy,

				        sharded<service::migration_manager>& mm,

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				        sharded<service::memory_limiter>& memory_limiter,

				        const db::config& config);

				    future<> start();

				    future<> stop();

				};

				}

									
										18

alternator/error.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -34,7 +34,7 @@ namespace alternator {

				// "ResourceNotFoundException", and a human readable message.

				// Eventually alternator::api_handler will convert a returned or thrown

				// api_error into a JSON object, and that is returned to the user.

				class api_error final {

				class api_error final : public std::exception {

				public:

				    using status_type = httpd::reply::status_type;

				    status_type _http_code;

				@@ -59,6 +59,9 @@ public:

				    static api_error invalid_signature(std::string msg) {

				        return api_error("InvalidSignatureException", std::move(msg));

				    }

				    static api_error missing_authentication_token(std::string msg) {

				        return api_error("MissingAuthenticationTokenException", std::move(msg));

				    }

				    static api_error unrecognized_client(std::string msg) {

				        return api_error("UnrecognizedClientException", std::move(msg));

				    }

				@@ -77,9 +80,20 @@ public:

				    static api_error trimmed_data_access_exception(std::string msg) {

				        return api_error("TrimmedDataAccessException", std::move(msg));

				    }

				    static api_error request_limit_exceeded(std::string msg) {

				        return api_error("RequestLimitExceeded", std::move(msg));

				    }

				    static api_error internal(std::string msg) {

				        return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);

				    }

				    // Provide the "std::exception" interface, to make it easier to print this

				    // exception in log messages. Note that this function is *not* used to

				    // format the error to send it back to the client - server.cc has

				    // generate_error_reply() to format an api_error as the DynamoDB protocol

				    // requires.

				    virtual const char* what() const noexcept override;

				    mutable std::string _what_string;

				};

				}

966

alternator/executor.cc

View File

File diff suppressed because it is too large Load Diff

									
										106

alternator/executor.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -27,9 +27,9 @@

				#include <seastar/json/json_elements.hh>

				#include <seastar/core/sharded.hh>

				#include "service/storage_proxy.hh"

				#include "service/migration_manager.hh"

				#include "service/client_state.hh"

				#include "service_permit.hh"

				#include "db/timeout_clock.hh"

				#include "alternator/error.hh"

				@@ -50,7 +50,17 @@ namespace cql3::selection {

				}

				namespace service {

				    class storage_service;

				    class storage_proxy;

				}

				namespace cdc {

				    class metadata;

				}

				namespace gms {

				class gossiper;

				}

				namespace alternator {

				@@ -70,11 +80,81 @@ public:

				    std::string to_json() const override;

				};

				namespace parsed {

				class path;

				};

				const std::map<sstring, sstring>& get_tags_of_table(schema_ptr schema);

				future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map);

				schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);

				// An attribute_path_map object is used to hold data for various attributes

				// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path

				// has a root attribute, and then modified by member and index operators -

				// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then

				// "[2]" index, and finally ".c" member.

				// Data can be added to an attribute_path_map using the add() function, but

				// requires that attributes with data not be *overlapping* or *conflicting*:

				//

				// 1. Two attribute paths which are identical or an ancestor of one another

				//    are considered *overlapping* and not allowed. If a.b.c has data,

				//    we can't add more data in a.b.c or any of its descendants like a.b.c.d.

				//

				// 2. Two attribute paths which need the same parent to have both a member and

				//    an index are considered *conflicting* and not allowed. E.g., if a.b has

				//    data, you can't add a[1]. The meaning of adding both would be that the

				//    attribute a is both a map and an array, which isn't sensible.

				//

				// These two requirements are common to the two places where Alternator uses

				// this abstraction to describe how a hierarchical item is to be transformed:

				//

				// 1. In ProjectExpression: for filtering from a full top-level attribute

				//    only the parts for which user asked in ProjectionExpression.

				//

				// 2. In UpdateExpression: for taking the previous value of a top-level

				//    attribute, and modifying it based on the instructions in the user

				//    wrote in UpdateExpression.

				template<typename T>

				class attribute_path_map_node {

				public:

				    using data_t = T;

				    // We need the extra unique_ptr<> here because libstdc++ unordered_map

				    // doesn't work with incomplete types :-(

				    using members_t =  std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;

				    // The indexes list is sorted because DynamoDB requires handling writes

				    // beyond the end of a list in index order.

				    using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;

				    // The prohibition on "overlap" and "conflict" explained above means

				    // That only one of data, members or indexes is non-empty.

				    std::optional<std::variant<data_t, members_t, indexes_t>> _content;

				    bool is_empty() const { return !_content; }

				    bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }

				    bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }

				    bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }

				    // get_members() assumes that has_members() is true

				    members_t& get_members() { return std::get<members_t>(*_content); }

				    const members_t& get_members() const { return std::get<members_t>(*_content); }

				    indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }

				    const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }

				    T& get_value() { return std::get<T>(*_content); }

				    const T& get_value() const { return std::get<T>(*_content); }

				};

				template<typename T>

				using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;

				using attrs_to_get_node = attribute_path_map_node<std::monostate>;

				using attrs_to_get = attribute_path_map<std::monostate>;

				class executor : public peering_sharded_service<executor> {

				    gms::gossiper& _gossiper;

				    service::storage_proxy& _proxy;

				    service::migration_manager& _mm;

				    db::system_distributed_keyspace& _sdks;

				    service::storage_service& _ss;

				    cdc::metadata& _cdc_metadata;

				    // An smp_service_group to be used for limiting the concurrency when

				    // forwarding Alternator request between shards - if necessary for LWT.

				    smp_service_group _ssg;

				@@ -87,8 +167,8 @@ public:

				    static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";

				    static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";

				    executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, smp_service_group ssg)

				        : _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _ssg(ssg) {}

				    executor(gms::gossiper& gossiper, service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, cdc::metadata& cdc_metadata, smp_service_group ssg)

				        : _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {}

				    future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				@@ -107,6 +187,8 @@ public:

				    future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);

				    future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);

				@@ -117,10 +199,12 @@ public:

				    future<> create_keyspace(std::string_view keyspace_name);

				    static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);

				    static sstring table_name(const schema&);

				    static db::timeout_clock::time_point default_timeout();

				    static void set_default_timeout(db::timeout_clock::duration timeout);

				private:

				    static db::timeout_clock::duration s_default_timeout;

				public:

				    static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);

				private:

				@@ -136,16 +220,14 @@ public:

				        const query::partition_slice&,

				        const cql3::selection::selection&,

				        const query::result&,

				        const std::unordered_set<std::string>&);

				        const attrs_to_get&);

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<bytes_opt>&,

				        const std::unordered_set<std::string>&,

				        const attrs_to_get&,

				        rjson::value&,

				        bool = false);

				    void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;

				    void supplement_table_info(rjson::value& descr, const schema& schema) const;

				    void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;

									
										193

alternator/expressions.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -21,7 +21,7 @@

				#include "expressions.hh"

				#include "serialization.hh"

				#include "base64.hh"

				#include "utils/base64.hh"

				#include "conditions.hh"

				#include "alternator/expressionsLexer.hpp"

				#include "alternator/expressionsParser.hpp"

				@@ -130,6 +130,27 @@ void condition_expression::append(condition_expression&& a, char op) {

				    }, _expression);

				}

				void path::check_depth_limit() {

				    if (1 + _operators.size() > depth_limit) {

				        throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));

				    }

				}

				std::ostream& operator<<(std::ostream& os, const path& p) {

				    os << p.root();

				    for (const auto& op : p.operators()) {

				        std::visit(overloaded_functor {

				            [&] (const std::string& member) {

				                os << '.' << member;

				            },

				            [&] (unsigned index) {

				                os << '[' << index << ']';

				            }

				        }, op);

				    }

				    return os;

				}

				} // namespace parsed

				// The following resolve_*() functions resolve references in parsed

				@@ -151,10 +172,9 @@ void condition_expression::append(condition_expression&& a, char op) {

				// we need to resolve the expression just once but then use it many times

				// (once for each item to be filtered).

				static void resolve_path(parsed::path& p,

				static std::optional<std::string> resolve_path_component(const std::string& column_name,

				        const rjson::value* expression_attribute_names,

				        std::unordered_set<std::string>& used_attribute_names) {

				    const std::string& column_name = p.root();

				    if (column_name.size() > 0 && column_name.front() == '#') {

				        if (!expression_attribute_names) {

				            throw api_error::validation(

				@@ -166,7 +186,30 @@ static void resolve_path(parsed::path& p,

				                    format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));

				        }

				        used_attribute_names.emplace(column_name);

				        p.set_root(std::string(rjson::to_string_view(*value)));

				        return std::string(rjson::to_string_view(*value));

				    }

				    return std::nullopt;

				}

				static void resolve_path(parsed::path& p,

				        const rjson::value* expression_attribute_names,

				        std::unordered_set<std::string>& used_attribute_names) {

				    std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);

				    if (r) {

				        p.set_root(std::move(*r));

				    }

				    for (auto& op : p.operators()) {

				        std::visit(overloaded_functor {

				            [&] (std::string& s) {

				                r = resolve_path_component(s, expression_attribute_names, used_attribute_names);

				                if (r) {

				                    s = std::move(*r);

				                }

				            },

				            [&] (unsigned index) {

				                // nothing to resolve

				            }

				        }, op);

				    }

				}

				@@ -385,24 +428,6 @@ void for_condition_expression_on(const parsed::condition_expression& ce, const n

				// expression. The parsed expression is assumed to have been "resolved", with

				// the matching resolve_* function.

				// Take two JSON-encoded list values (remember that a list value is

				// {"L": [...the actual list]}) and return the concatenation, again as

				// a list value.

				static rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2) {

				    const rjson::value* list1 = unwrap_list(v1);

				    const rjson::value* list2 = unwrap_list(v2);

				    if (!list1 || !list2) {

				        throw api_error::validation("UpdateExpression: list_append() given a non-list");

				    }

				    rjson::value cat = rjson::copy(*list1);

				    for (const auto& a : list2->GetArray()) {

				        rjson::push_back(cat, rjson::copy(a));

				    }

				    rjson::value ret = rjson::empty_object();

				    rjson::set(ret, "L", std::move(cat));

				    return ret;

				}

				// calculate_size() is ConditionExpression's size() function, i.e., it takes

				// a JSON-encoded value and returns its "size" as defined differently for the

				// different types - also as a JSON-encoded number.

				@@ -439,11 +464,11 @@ static rjson::value calculate_size(const rjson::value& v) {

				        ret = base64_decoded_len(rjson::to_string_view(it->value));

				    } else {

				        rjson::value json_ret = rjson::empty_object();

				        rjson::set(json_ret, "null", rjson::value(true));

				        rjson::add(json_ret, "null", rjson::value(true));

				        return json_ret;

				    }

				    rjson::value json_ret = rjson::empty_object();

				    rjson::set(json_ret, "N", rjson::from_string(std::to_string(ret)));

				    rjson::add(json_ret, "N", rjson::from_string(std::to_string(ret)));

				    return json_ret;

				}

				@@ -462,7 +487,7 @@ static const rjson::value& calculate_value(const parsed::constant& c) {

				static rjson::value to_bool_json(bool b) {

				    rjson::value json_ret = rjson::empty_object();

				    rjson::set(json_ret, "BOOL", rjson::value(b));

				    rjson::add(json_ret, "BOOL", rjson::value(b));

				    return json_ret;

				}

				@@ -487,7 +512,11 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {

				            }

				            rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);

				            rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);

				            return list_concatenate(v1, v2);

				            rjson::value ret = list_concatenate(v1, v2);

				            if (ret.IsNull()) {

				                throw api_error::validation("UpdateExpression: list_append() given a non-list");

				            }

				            return ret;

				        }

				    },

				    {"if_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

				@@ -603,52 +632,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {

				            }

				            rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);

				            rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);

				            // TODO: There's duplication here with check_BEGINS_WITH().

				            // But unfortunately, the two functions differ a bit.

				            // If one of v1 or v2 is malformed or has an unsupported type

				            // (not B or S), what we do depends on whether it came from

				            // the user's query (is_constant()), or the item. Unsupported

				            // values in the query result in an error, but if they are in

				            // the item, we silently return false (no match).

				            bool bad = false;

				            if (!v1.IsObject() || v1.MemberCount() != 1) {

				                bad = true;

				                if (f._parameters[0].is_constant()) {

				                    throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));

				                }

				            } else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {

				                bad = true;

				                if (f._parameters[0].is_constant()) {

				                    throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));

				                }

				            }

				            if (!v2.IsObject() || v2.MemberCount() != 1) {

				                bad = true;

				                if (f._parameters[1].is_constant()) {

				                    throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));

				                }

				            } else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {

				                bad = true;

				                if (f._parameters[1].is_constant()) {

				                    throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));

				                }

				            }

				            bool ret = false;

				            if (!bad) {

				                auto it1 = v1.MemberBegin();

				                auto it2 = v2.MemberBegin();

				                if (it1->name == it2->name) {

				                    if (it2->name == "S") {

				                        std::string_view val1 = rjson::to_string_view(it1->value);

				                        std::string_view val2 = rjson::to_string_view(it2->value);

				                        ret = val1.starts_with(val2);

				                    } else /* it2->name == "B" */ {

				                        ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));

				                    }

				                }

				            }

				            return to_bool_json(ret);

				            return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1,  v2,

				                                    f._parameters[0].is_constant(), f._parameters[1].is_constant()));

				        }

				    },

				    {"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

				@@ -667,6 +652,55 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {

				    },

				};

				// Given a parsed::path and an item read from the table, extract the value

				// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null

				// value if the item or the requested attribute does not exist.

				// Note that the item is assumed to be encoded in JSON using DynamoDB

				// conventions - each level of a nested document is a map with one key -

				// a type (e.g., "M" for map) - and its value is the representation of

				// that value.

				static rjson::value extract_path(const rjson::value* item,

				        const parsed::path& p, calculate_value_caller caller) {

				    if (!item) {

				        return rjson::null_value();

				    }

				    const rjson::value* v = rjson::find(*item, p.root());

				    if (!v) {

				        return rjson::null_value();

				    }

				    for (const auto& op : p.operators()) {

				        if (!v->IsObject() || v->MemberCount() != 1) {

				            // This shouldn't happen. We shouldn't have stored malformed

				            // objects. But today Alternator does not validate the structure

				            // of nested documents before storing them, so this can happen on

				            // read.

				            throw api_error::validation(format("{}: malformed item read: {}", *item));

				        }

				        const char* type = v->MemberBegin()->name.GetString();

				        v = &(v->MemberBegin()->value);

				        std::visit(overloaded_functor {

				            [&] (const std::string& member) {

				                if (type[0] == 'M' && v->IsObject()) {

				                    v = rjson::find(*v, member);

				                } else {

				                    v = nullptr;

				                }

				            },

				            [&] (unsigned index) {

				                if (type[0] == 'L' && v->IsArray() && index < v->Size()) {

				                    v = &(v->GetArray()[index]);

				                } else {

				                    v = nullptr;

				                }

				            }

				        }, op);

				        if (!v) {

				            return rjson::null_value();

				        }

				    }

				    return rjson::copy(*v);

				}

				// Given a parsed::value, which can refer either to a constant value from

				// ExpressionAttributeValues, to the value of some attribute, or to a function

				// of other values, this function calculates the resulting value.

				@@ -684,21 +718,12 @@ rjson::value calculate_value(const parsed::value& v,

				            auto function_it = function_handlers.find(std::string_view(f._function_name));

				            if (function_it == function_handlers.end()) {

				                throw api_error::validation(

				                        format("UpdateExpression: unknown function '{}' called.", f._function_name));

				                        format("{}: unknown function '{}' called.", caller, f._function_name));

				            }

				            return function_it->second(caller, previous_item, f);

				        },

				        [&] (const parsed::path& p) -> rjson::value {

				            if (!previous_item) {

				                return rjson::null_value();

				            }

				            std::string update_path = p.root();

				            if (p.has_operators()) {

				                // FIXME: support this

				                throw api_error::validation("Reading attribute paths not yet implemented");

				            }

				            const rjson::value* previous_value = rjson::find(*previous_item, update_path);

				            return previous_value ? rjson::copy(*previous_value) : rjson::null_value();

				            return extract_path(previous_item, p, caller);

				        }

				    }, v._value);

				}

5

alternator/expressions.g

View File

@@ -1,8 +1,5 @@
 /*
  * Copyright 2019 ScyllaDB
  *
  * This file is part of Scylla. See the LICENSE.PROPRIETARY file in the
  * top-level directory for licensing information.
  * Copyright 2019-present ScyllaDB
  */
 /*

									
										2

alternator/expressions.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

									
										17

alternator/expressions_types.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -49,15 +49,23 @@ class path {

				    // dot (e.g., ".xyz").

				    std::string _root;

				    std::vector<std::variant<std::string, unsigned>> _operators;

				    // It is useful to limit the depth of a user-specified path, because is

				    // allows us to use recursive algorithms without worrying about recursion

				    // depth. DynamoDB officially limits the length of paths to 32 components

				    // (including the root) so let's use the same limit.

				    static constexpr unsigned depth_limit = 32;

				    void check_depth_limit();

				public:

				    void set_root(std::string root) {

				        _root = std::move(root);

				    }

				    void add_index(unsigned i) {

				        _operators.emplace_back(i);

				        check_depth_limit();

				    }

				    void add_dot(std::string(name)) {

				        _operators.emplace_back(std::move(name));

				        check_depth_limit();

				    }

				    const std::string& root() const {

				        return _root;

				@@ -65,6 +73,13 @@ public:

				    bool has_operators() const {

				        return !_operators.empty();

				    }

				    const std::vector<std::variant<std::string, unsigned>>& operators() const {

				        return _operators;

				    }

				    std::vector<std::variant<std::string, unsigned>>& operators() {

				        return _operators;

				    }

				    friend std::ostream& operator<<(std::ostream&, const path&);

				};

				// When an expression is first parsed, all constants are references, like

									
										5

alternator/rmw_operation.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2020 ScyllaDB

				 * Copyright 2020-present ScyllaDB

				 */

				/*

				@@ -22,8 +22,7 @@

				#pragma once

				#include "seastarx.hh"

				#include "service/storage_proxy.hh"

				#include "service/storage_proxy.hh"

				#include "service/paxos/cas_request.hh"

				#include "utils/rjson.hh"

				#include "executor.hh"

									
										44

alternator/serialization.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -19,7 +19,8 @@

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "base64.hh"

				#include "utils/base64.hh"

				#include "utils/rjson.hh"

				#include "log.hh"

				#include "serialization.hh"

				#include "error.hh"

				@@ -68,7 +69,7 @@ struct from_json_visitor {

				        bo.write(t.from_string(rjson::to_string_view(v)));

				    }

				    void operator()(const bytes_type_impl& t) const {

				        bo.write(base64_decode(v));

				        bo.write(rjson::base64_decode(v));

				    }

				    void operator()(const boolean_type_impl& t) const {

				        bo.write(boolean_type->decompose(v.GetBool()));

				@@ -114,18 +115,18 @@ struct to_json_visitor {

				    void operator()(const decimal_type_impl& t) const {

				        auto s = to_json_string(*decimal_type, bytes(bv));

				        //FIXME(sarna): unnecessary copy

				        rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(s));

				        rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(s));

				    }

				    void operator()(const string_type_impl& t) {

				        rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(reinterpret_cast<const char *>(bv.data()), bv.size()));

				        rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(reinterpret_cast<const char *>(bv.data()), bv.size()));

				    }

				    void operator()(const bytes_type_impl& t) const {

				        std::string b64 = base64_encode(bv);

				        rjson::set_with_string_name(deserialized, type_ident, rjson::from_string(b64));

				        rjson::add_with_string_name(deserialized, type_ident, rjson::from_string(b64));

				    }

				    // default

				    void operator()(const abstract_type& t) const {

				        rjson::set_with_string_name(deserialized, type_ident, rjson::parse(to_json_string(t, bytes(bv))));

				        rjson::add_with_string_name(deserialized, type_ident, rjson::parse(to_json_string(t, bytes(bv))));

				    }

				};

				@@ -196,7 +197,7 @@ bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column

				                format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));

				    }

				    if (column.type == bytes_type) {

				        return base64_decode(it->value);

				        return rjson::base64_decode(it->value);

				    } else {

				        return column.type->from_string(rjson::to_string_view(it->value));

				    }

				@@ -301,7 +302,7 @@ rjson::value number_add(const rjson::value& v1, const rjson::value& v2) {

				    auto n2 = unwrap_number(v2, "UpdateExpression");

				    rjson::value ret = rjson::empty_object();

				    std::string str_ret = std::string((n1 + n2).to_string());

				    rjson::set(ret, "N", rjson::from_string(str_ret));

				    rjson::add(ret, "N", rjson::from_string(str_ret));

				    return ret;

				}

				@@ -310,7 +311,7 @@ rjson::value number_subtract(const rjson::value& v1, const rjson::value& v2) {

				    auto n2 = unwrap_number(v2, "UpdateExpression");

				    rjson::value ret = rjson::empty_object();

				    std::string str_ret = std::string((n1 - n2).to_string());

				    rjson::set(ret, "N", rjson::from_string(str_ret));

				    rjson::add(ret, "N", rjson::from_string(str_ret));

				    return ret;

				}

				@@ -336,7 +337,7 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {

				        }

				    }

				    rjson::value ret = rjson::empty_object();

				    rjson::set_with_string_name(ret, set1_type, std::move(sum));

				    rjson::add_with_string_name(ret, set1_type, std::move(sum));

				    return ret;

				}

				@@ -364,7 +365,7 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&

				        return std::nullopt;

				    }

				    rjson::value ret = rjson::empty_object();

				    rjson::set_with_string_name(ret, set1_type, rjson::empty_array());

				    rjson::add_with_string_name(ret, set1_type, rjson::empty_array());

				    rjson::value& result_set = ret[set1_type];

				    for (const auto& a : set1_raw) {

				        rjson::push_back(result_set, rjson::copy(a));

				@@ -372,4 +373,23 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&

				    return ret;

				}

				// Take two JSON-encoded list values (remember that a list value is

				// {"L": [...the actual list]}) and return the concatenation, again as

				// a list value.

				// Returns a null value if one of the arguments is not actually a list.

				rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2) {

				    const rjson::value* list1 = unwrap_list(v1);

				    const rjson::value* list2 = unwrap_list(v2);

				    if (!list1 || !list2) {

				        return rjson::null_value();

				    }

				    rjson::value cat = rjson::copy(*list1);

				    for (const auto& a : list2->GetArray()) {

				        rjson::push_back(cat, rjson::copy(a));

				    }

				    rjson::value ret = rjson::empty_object();

				    rjson::add(ret, "L", std::move(cat));

				    return ret;

				}

				}

									
										8

alternator/serialization.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -85,5 +85,11 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2);

				// DynamoDB does not allow empty sets, so if resulting set is empty, return

				// an unset optional instead.

				std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value& v2);

				// Take two JSON-encoded list values (remember that a list value is

				// {"L": [...the actual list]}) and return the concatenation, again as

				// a list value.

				// Returns a null value if one of the arguments is not actually a list.

				rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2);

				}

									
										279

alternator/server.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -22,15 +22,20 @@

				#include "alternator/server.hh"

				#include "log.hh"

				#include <seastar/http/function_handlers.hh>

				#include <seastar/http/short_streams.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/json/json_elements.hh>

				#include <seastar/util/defer.hh>

				#include "seastarx.hh"

				#include "error.hh"

				#include "utils/rjson.hh"

				#include "auth.hh"

				#include <cctype>

				#include "cql3/query_processor.hh"

				#include "service/storage_service.hh"

				#include "service/storage_proxy.hh"

				#include "locator/snitch_base.hh"

				#include "gms/gossiper.hh"

				#include "utils/overloaded_functor.hh"

				#include "utils/fb_utilities.hh"

				static logging::logger slogger("alternator-server");

				@@ -59,6 +64,40 @@ inline std::vector<std::string_view> split(std::string_view text, char separator

				    return tokens;

				}

				// Handle CORS (Cross-origin resource sharing) in the HTTP request:

				// If the request has the "Origin" header specifying where the script which

				// makes this request comes from, we need to reply with the header

				// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.

				// Additionally, if preflight==true (i.e., this is an OPTIONS request),

				// the script can also "request" in headers that the server allows it to use

				// some HTTP methods and headers in the followup request, and the server

				// should respond by "allowing" them in the response headers.

				// We also add the header "Access-Control-Expose-Headers" to let the script

				// access additional headers in the response.

				// This handle_CORS() should be used when handling any HTTP method - both the

				// usual GET and POST, and also the "preflight" OPTIONS method.

				static void handle_CORS(const request& req, reply& rep, bool preflight) {

				    if (!req.get_header("origin").empty()) {

				        rep.add_header("Access-Control-Allow-Origin", "*");

				        // This is the list that DynamoDB returns for expose headers. I am

				        // not sure why not just return "*" here, what's the risk?

				        rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");

				        if (preflight) {

				            sstring s = req.get_header("Access-Control-Request-Headers");

				            if (!s.empty()) {

				                rep.add_header("Access-Control-Allow-Headers", std::move(s));

				            }

				            s = req.get_header("Access-Control-Request-Method");

				            if (!s.empty()) {

				                rep.add_header("Access-Control-Allow-Methods", std::move(s));

				            }

				            // Our CORS response never change anyway, let the browser cache it

				            // for two hours (Chrome's maximum):

				            rep.add_header("Access-Control-Max-Age", "7200");

				        }

				    }

				}

				// DynamoDB HTTP error responses are structured as follows

				// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html

				// Our handlers throw an exception to report an error. If the exception

				@@ -93,6 +132,10 @@ public:

				                 [&] (const json::json_return_type& json_return_value) {

				                     slogger.trace("api_handler success case");

				                     if (json_return_value._body_writer) {

				                         // Unfortunately, write_body() forces us to choose

				                         // from a fixed and irrelevant list of "mime-types"

				                         // at this point. But we'll override it with the

				                         // one (application/x-amz-json-1.0) below.

				                         rep->write_body("json", std::move(json_return_value._body_writer));

				                     } else {

				                         rep->_content += json_return_value._res;

				@@ -105,14 +148,16 @@ public:

				             return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				         });

				    }), _type("json") { }

				    }) { }

				    api_handler(const api_handler&) = default;

				    future<std::unique_ptr<reply>> handle(const sstring& path,

				            std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {

				        handle_CORS(*req, *rep, false);

				        return _f_handle(std::move(req), std::move(rep)).then(

				                [this](std::unique_ptr<reply> rep) {

				                    rep->done(_type);

				                    rep->set_mime_type("application/x-amz-json-1.0");

				                    rep->done();

				                    return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				                });

				    }

				@@ -126,7 +171,6 @@ protected:

				    }

				    future_handler_function _f_handle;

				    sstring _type;

				};

				class gated_handler : public handler_base {

				@@ -146,6 +190,7 @@ public:

				    health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}

				protected:

				    virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {

				        handle_CORS(*req, *rep, false);

				        rep->set_status(reply::status_type::ok);

				        rep->write_body("txt", format("healthy: {}", req->get_header("Host")));

				        return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				@@ -153,21 +198,25 @@ protected:

				};

				class local_nodelist_handler : public gated_handler {

				    service::storage_proxy& _proxy;

				    gms::gossiper& _gossiper;

				public:

				    local_nodelist_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}

				    local_nodelist_handler(seastar::gate& pending_requests, service::storage_proxy& proxy, gms::gossiper& gossiper)

				        : gated_handler(pending_requests)

				        , _proxy(proxy)

				        , _gossiper(gossiper) {}

				protected:

				    virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {

				        rjson::value results = rjson::empty_array();

				        // It's very easy to get a list of all live nodes on the cluster,

				        // using gms::get_local_gossiper().get_live_members(). But getting

				        // using _gossiper().get_live_members(). But getting

				        // just the list of live nodes in this DC needs more elaborate code:

				        sstring local_dc = locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(

				                utils::fb_utilities::get_broadcast_address());

				        std::unordered_set<gms::inet_address> local_dc_nodes =

				                service::get_local_storage_service().get_token_metadata().

				                get_topology().get_datacenter_endpoints().at(local_dc);

				                _proxy.get_token_metadata_ptr()->get_topology().get_datacenter_endpoints().at(local_dc);

				        for (auto& ip : local_dc_nodes) {

				            if (gms::get_local_gossiper().is_alive(ip)) {

				            if (_gossiper.is_alive(ip)) {

				                rjson::push_back(results, rjson::from_string(ip.to_sstring()));

				            }

				        }

				@@ -178,10 +227,26 @@ protected:

				    }

				};

				future<> server::verify_signature(const request& req) {

				// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS

				// request before ("pre-flight") the main request. The response to this

				// request can be empty, but needs to have the right headers (which we

				// fill with handle_CORS())

				class options_handler : public gated_handler {

				public:

				    options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}

				protected:

				    virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {

				        handle_CORS(*req, *rep, true);

				        rep->set_status(reply::status_type::ok);

				        rep->write_body("txt", sstring(""));

				        return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				    }

				};

				future<std::string> server::verify_signature(const request& req, const chunked_content& content) {

				    if (!_enforce_authorization) {

				        slogger.debug("Skipping authorization");

				        return make_ready_future<>();

				        return make_ready_future<std::string>("<unauthenticated request>");

				    }

				    auto host_it = req._headers.find("Host");

				    if (host_it == req._headers.end()) {

				@@ -189,27 +254,34 @@ future<> server::verify_signature(const request& req) {

				    }

				    auto authorization_it = req._headers.find("Authorization");

				    if (authorization_it == req._headers.end()) {

				        throw api_error::invalid_signature("Authorization header is mandatory for signature verification");

				        throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");

				    }

				    std::string host = host_it->second;

				    std::vector<std::string_view> credentials_raw = split(authorization_it->second, ' ');

				    std::string_view authorization_header = authorization_it->second;

				    auto pos = authorization_header.find_first_of(' ');

				    if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {

				        throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));

				    }

				    authorization_header.remove_prefix(pos+1);

				    std::string credential;

				    std::string user_signature;

				    std::string signed_headers_str;

				    std::vector<std::string_view> signed_headers;

				    for (std::string_view entry : credentials_raw) {

				    do {

				        // Either one of a comma or space can mark the end of an entry

				        pos = authorization_header.find_first_of(" ,");

				        std::string_view entry = authorization_header.substr(0, pos);

				        if (pos != std::string_view::npos) {

				            authorization_header.remove_prefix(pos + 1);

				        }

				        if (entry.empty()) {

				            continue;

				        }

				        std::vector<std::string_view> entry_split = split(entry, '=');

				        if (entry_split.size() != 2) {

				            if (entry != "AWS4-HMAC-SHA256") {

				                throw api_error::invalid_signature(format("Only AWS4-HMAC-SHA256 algorithm is supported. Found: {}", entry));

				            }

				            continue;

				        }

				        std::string_view auth_value = entry_split[1];

				        // Commas appear as an additional (quite redundant) delimiter

				        if (auth_value.back() == ',') {

				            auth_value.remove_suffix(1);

				        }

				        if (entry_split[0] == "Credential") {

				            credential = std::string(auth_value);

				        } else if (entry_split[0] == "Signature") {

				@@ -219,7 +291,8 @@ future<> server::verify_signature(const request& req) {

				            signed_headers = split(auth_value, ';');

				            std::sort(signed_headers.begin(), signed_headers.end());

				        }

				    }

				    } while (pos != std::string_view::npos);

				    std::vector<std::string_view> credential_split = split(credential, '/');

				    if (credential_split.size() != 5) {

				        throw api_error::validation(format("Incorrect credential information format: {}", credential));

				@@ -243,10 +316,10 @@ future<> server::verify_signature(const request& req) {

				        }

				    }

				    auto cache_getter = [&qp = _qp] (std::string username) {

				        return get_key_from_roles(qp, std::move(username));

				    auto cache_getter = [&proxy = _proxy] (std::string username) {

				        return get_key_from_roles(proxy, std::move(username));

				    };

				    return _key_cache.get_ptr(user, cache_getter).then([this, &req,

				    return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,

				                                                    user = std::move(user),

				                                                    host = std::move(host),

				                                                    datestamp = std::move(datestamp),

				@@ -256,52 +329,102 @@ future<> server::verify_signature(const request& req) {

				                                                    service = std::move(service),

				                                                    user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {

				        std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,

				                datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");

				                datestamp, signed_headers_str, signed_headers_map, content, region, service, "");

				        if (signature != std::string_view(user_signature)) {

				            _key_cache.remove(user);

				            throw api_error::unrecognized_client("The security token included in the request is invalid.");

				        }

				        return user;

				    });

				}

				future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {

				static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {

				    tracing::trace_state_props_set props;

				    props.set<tracing::trace_state_props::full_tracing>();

				    props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());

				    return tracing_instance.create_session(tracing::trace_type::QUERY, props);

				}

				// truncated_content_view() prints a potentially long chunked_content for

				// debugging purposes. In the common case when the content is not excessively

				// long, it just returns a view into the given content, without any copying.

				// But when the content is very long, it is truncated after some arbitrary

				// max_len (or one chunk, whichever comes first), with "<truncated>" added at

				// the end. To do this modification to the string, we need to create a new

				// std::string, so the caller must pass us a reference to one, "buf", where

				// we can store the content. The returned view is only alive for as long this

				// buf is kept alive.

				static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {

				    constexpr size_t max_len = 1024;

				    if (content.empty()) {

				        return std::string_view();

				    } else if (content.size() == 1 && content.begin()->size() <= max_len) {

				        return std::string_view(content.begin()->get(), content.begin()->size());

				    } else {

				        buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";

				        return std::string_view(buf);

				    }

				}

				static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, sstring_view op, const chunked_content& query) {

				    tracing::trace_state_ptr trace_state;

				    tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();

				    if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {

				        trace_state = create_tracing_session(tracing_instance);

				        std::string buf;

				        tracing::add_session_param(trace_state, "alternator_op", op);

				        tracing::add_query(trace_state, truncated_content_view(query, buf));

				        tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());

				        tracing::set_username(trace_state, auth::authenticated_user(username));

				    }

				    return trace_state;

				}

				future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {

				    _executor._stats.total_operations++;

				    sstring target = req->get_header(TARGET);

				    std::vector<std::string_view> split_target = split(target, '.');

				    //NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)

				    std::string op = split_target.empty() ? std::string() : std::string(split_target.back());

				    slogger.trace("Request: {} {} {}", op, req->content, req->_headers);

				    return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {

				        auto callback_it = _callbacks.find(op);

				        if (callback_it == _callbacks.end()) {

				            _executor._stats.unsupported_operations++;

				            throw api_error::unknown_operation(format("Unsupported operation {}", op));

				        }

				        return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {

				            //FIXME: Client state can provide more context, e.g. client's endpoint address

				            // We use unique_ptr because client_state cannot be moved or copied

				            return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),

				                    [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {

				                tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);

				                tracing::trace(trace_state, op);

				                // JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.

				                // FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.

				                // Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.

				                size_t mem_estimate = req->content.size() * 3 + 8000;

				                auto units_fut = get_units(*_memory_limiter, mem_estimate);

				                if (_memory_limiter->waiters()) {

				                    ++_executor._stats.requests_blocked_memory;

				                }

				                return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {

				                    return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,

				                            units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {

				                        return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});

				                    });

				                });

				            });

				        });

				    });

				    // JSON parsing can allocate up to roughly 2x the size of the raw

				    // document, + a couple of bytes for maintenance.

				    // TODO: consider the case where req->content_length is missing. Maybe

				    // we need to take the content_length_limit and return some of the units

				    // when we finish read_content_and_verify_signature?

				    size_t mem_estimate = req->content_length * 2 + 8000;

				    auto units_fut = get_units(*_memory_limiter, mem_estimate);

				    if (_memory_limiter->waiters()) {

				        ++_executor._stats.requests_blocked_memory;

				    }

				    auto units = co_await std::move(units_fut);

				    assert(req->content_stream);

				    chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);

				    auto username = co_await verify_signature(*req, content);

				    if (slogger.is_enabled(log_level::trace)) {

				        std::string buf;

				        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);

				    }

				    auto callback_it = _callbacks.find(op);

				    if (callback_it == _callbacks.end()) {

				        _executor._stats.unsupported_operations++;

				        co_return api_error::unknown_operation(format("Unsupported operation {}", op));

				    }

				    if (_pending_requests.get_count() >= _max_concurrent_requests) {

				        _executor._stats.requests_shed++;

				        co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));

				    }

				    _pending_requests.enter();

				    auto leave = defer([this] () noexcept { _pending_requests.leave(); });

				    //FIXME: Client state can provide more context, e.g. client's endpoint address

				    // We use unique_ptr because client_state cannot be moved or copied

				    executor::client_state client_state{executor::client_state::internal_tag()};

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);

				    tracing::trace(trace_state, op);

				    rjson::value json_request = co_await _json_parser.parse(std::move(content));

				    co_return co_await callback_it->second(_executor, client_state, trace_state,

				            make_service_permit(std::move(units)), std::move(json_request), std::move(req));

				}

				void server::set_routes(routes& r) {

				@@ -322,17 +445,19 @@ void server::set_routes(routes& r) {

				    // consider this to be a security risk, because an attacker can already

				    // scan an entire subnet for nodes responding to the health request,

				    // or even just scan for open ports.

				    r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));

				    r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests, _proxy, _gossiper));

				    r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));

				}

				//FIXME: A way to immediately invalidate the cache should be considered,

				// e.g. when the system table which stores the keys is changed.

				// For now, this propagation may take up to 1 minute.

				server::server(executor& exec, cql3::query_processor& qp)

				server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper)

				        : _http_server("http-alternator")

				        , _https_server("https-alternator")

				        , _executor(exec)

				        , _qp(qp)

				        , _proxy(proxy)

				        , _gossiper(gossiper)

				        , _key_cache(1024, 1min, slogger)

				        , _enforce_authorization(false)

				        , _enabled_servers{}

				@@ -389,6 +514,12 @@ server::server(executor& exec, cql3::query_processor& qp)

				        {"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));

				        }},

				        {"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.update_time_to_live(client_state, std::move(permit), std::move(json_request));

				        }},

				        {"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request));

				        }},

				        {"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				            return e.list_streams(client_state, std::move(permit), std::move(json_request));

				        }},

				@@ -405,9 +536,10 @@ server::server(executor& exec, cql3::query_processor& qp)

				}

				future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				        bool enforce_authorization, semaphore* memory_limiter) {

				        bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {

				    _memory_limiter = memory_limiter;

				    _enforce_authorization = enforce_authorization;

				    _max_concurrent_requests = std::move(max_concurrent_requests);

				    if (!port && !https_port) {

				        return make_exception_future<>(std::runtime_error("Either regular port or TLS port"

				                " must be specified in order to init an alternator HTTP server instance"));

				@@ -419,12 +551,14 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:

				            if (port) {

				                set_routes(_http_server._routes);

				                _http_server.set_content_length_limit(server::content_length_limit);

				                _http_server.set_content_streaming(true);

				                _http_server.listen(socket_address{addr, *port}).get();

				                _enabled_servers.push_back(std::ref(_http_server));

				            }

				            if (https_port) {

				                set_routes(_https_server._routes);

				                _https_server.set_content_length_limit(server::content_length_limit);

				                _https_server.set_content_streaming(true);

				                _https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {

				                    if (ep) {

				                        slogger.warn("Exception loading {}: {}", files, ep);

				@@ -462,7 +596,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {

				                return;

				            }

				            try {

				                _parsed_document = rjson::parse_yieldable(_raw_document);

				                _parsed_document = rjson::parse_yieldable(std::move(_raw_document));

				                _current_exception = nullptr;

				            } catch (...) {

				                _current_exception = std::current_exception();

				@@ -472,12 +606,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {

				    })) {

				}

				future<rjson::value> server::json_parser::parse(std::string_view content) {

				future<rjson::value> server::json_parser::parse(chunked_content&& content) {

				    if (content.size() < yieldable_parsing_threshold) {

				        return make_ready_future<rjson::value>(rjson::parse(content));

				        return make_ready_future<rjson::value>(rjson::parse(std::move(content)));

				    }

				    return with_semaphore(_parsing_sem, 1, [this, content] {

				        _raw_document = content;

				    return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {

				        _raw_document = std::move(content);

				        _document_waiting.signal();

				        return _document_parsed.wait().then([this] {

				            if (_current_exception) {

				@@ -495,5 +629,12 @@ future<> server::json_parser::stop() {

				    return std::move(_run_parse_json_thread);

				}

				const char* api_error::what() const noexcept {

				    if (_what_string.empty()) {

				        _what_string = format("{} {}: {}", _http_code, _type, _msg);

				    }

				    return _what_string.c_str();

				}

				}

									
										26

alternator/server.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -23,15 +23,19 @@

				#include "alternator/executor.hh"

				#include <seastar/core/future.hh>

				#include <seastar/core/condition-variable.hh>

				#include <seastar/http/httpd.hh>

				#include <seastar/net/tls.hh>

				#include <optional>

				#include "alternator/auth.hh"

				#include "utils/small_vector.hh"

				#include "utils/updateable_value.hh"

				#include <seastar/core/units.hh>

				namespace alternator {

				using chunked_content = rjson::chunked_content;

				class server {

				    static constexpr size_t content_length_limit = 16*MB;

				    using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,

				@@ -41,7 +45,8 @@ class server {

				    http_server _http_server;

				    http_server _https_server;

				    executor& _executor;

				    cql3::query_processor& _qp;

				    service::storage_proxy& _proxy;

				    gms::gossiper& _gossiper;

				    key_cache _key_cache;

				    bool _enforce_authorization;

				@@ -50,10 +55,11 @@ class server {

				    alternator_callbacks_map _callbacks;

				    semaphore* _memory_limiter;

				    utils::updateable_value<uint32_t> _max_concurrent_requests;

				    class json_parser {

				        static constexpr size_t yieldable_parsing_threshold = 16*KB;

				        std::string_view _raw_document;

				        chunked_content _raw_document;

				        rjson::value _parsed_document;

				        std::exception_ptr _current_exception;

				        semaphore _parsing_sem{1};

				@@ -63,21 +69,25 @@ class server {

				        future<> _run_parse_json_thread;

				    public:

				        json_parser();

				        future<rjson::value> parse(std::string_view content);

				        // Moving a chunked_content into parse() allows parse() to free each

				        // chunk as soon as it is parsed, so when chunks are relatively small,

				        // we don't need to store the sum of unparsed and parsed sizes.

				        future<rjson::value> parse(chunked_content&& content);

				        future<> stop();

				    };

				    json_parser _json_parser;

				public:

				    server(executor& executor, cql3::query_processor& qp);

				    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper);

				    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				            bool enforce_authorization, semaphore* memory_limiter);

				            bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				    future<> stop();

				private:

				    void set_routes(seastar::httpd::routes& r);

				    future<> verify_signature(const seastar::httpd::request& r);

				    future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);

				    // If verification succeeds, returns the authenticated user's username

				    future<std::string> verify_signature(const seastar::httpd::request&, const chunked_content&);

				    future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);

				};

				}

									
										5

alternator/stats.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -38,6 +38,7 @@ stats::stats() : api_operations{} {

				#define OPERATION_LATENCY(name, CamelCaseName) \

				                seastar::metrics::make_histogram("op_latency", \

				                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name);}),

				            OPERATION(batch_get_item, "BatchGetItem")

				            OPERATION(batch_write_item, "BatchWriteItem")

				            OPERATION(create_backup, "CreateBackup")

				            OPERATION(create_global_table, "CreateGlobalTable")

				@@ -96,6 +97,8 @@ stats::stats() : api_operations{} {

				                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),

				            seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,

				                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),

				            seastar::metrics::make_total_operations("requests_shed", requests_shed,

				                    seastar::metrics::description("Counts a number of requests shed due to overload.")),

				            seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,

				                    seastar::metrics::description("number of rows read during filtering operations")),

				            seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

									
										3

alternator/stats.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

				@@ -92,6 +92,7 @@ public:

				    uint64_t write_using_lwt = 0;

				    uint64_t shard_bounce_for_lwt = 0;

				    uint64_t requests_blocked_memory = 0;

				    uint64_t requests_shed = 0;

				    // CQL-derived stats

				    cql3::cql_stats cql_stats;

				private:

									
										134

alternator/streams.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2020 ScyllaDB

				 * Copyright 2020-present ScyllaDB

				 */

				/*

				@@ -26,7 +26,7 @@

				#include <seastar/json/formatter.hh>

				#include "base64.hh"

				#include "utils/base64.hh"

				#include "log.hh"

				#include "database.hh"

				#include "db/config.hh"

				@@ -34,13 +34,15 @@

				#include "cdc/log.hh"

				#include "cdc/generation.hh"

				#include "cdc/cdc_options.hh"

				#include "cdc/metadata.hh"

				#include "db/system_distributed_keyspace.hh"

				#include "utils/UUID_gen.hh"

				#include "cql3/selection/selection.hh"

				#include "cql3/result_set.hh"

				#include "cql3/type_json.hh"

				#include "cql3/column_identifier.hh"

				#include "schema_builder.hh"

				#include "service/storage_service.hh"

				#include "service/storage_proxy.hh"

				#include "gms/feature.hh"

				#include "gms/feature_service.hh"

				@@ -88,7 +90,7 @@ struct rapidjson::internal::TypeHelper<ValueType, utils::UUID>

				{};

				static db_clock::time_point as_timepoint(const utils::UUID& uuid) {

				    return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};

				    return db_clock::time_point{utils::UUID_gen::unix_timestamp(uuid)};

				}

				/**

				@@ -194,7 +196,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				        if (table && ks_name != table->ks_name()) {

				            continue;

				        }

				        if (cdc::is_log_for_some_table(ks_name, cf_name)) {

				        if (cdc::is_log_for_some_table(db, ks_name, cf_name)) {

				            if (table && table != cdc::get_base_table(db, *s)) {

				                continue;

				            }

				@@ -202,19 +204,19 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				            rjson::value new_entry = rjson::empty_object();

				            last = i->first;

				            rjson::set(new_entry, "StreamArn", *last);

				            rjson::set(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));

				            rjson::set(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));

				            rjson::add(new_entry, "StreamArn", *last);

				            rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));

				            rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));

				            rjson::push_back(streams, std::move(new_entry));

				            --limit;

				        }

				    }

				    rjson::set(ret, "Streams", std::move(streams));

				    rjson::add(ret, "Streams", std::move(streams));

				    if (last) {

				        rjson::set(ret, "LastEvaluatedStreamArn", *last);

				        rjson::add(ret, "LastEvaluatedStreamArn", *last);

				    }

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				@@ -294,7 +296,8 @@ sequence_number::sequence_number(std::string_view v)

				        // view directly.

				        uint128_t tmp{std::string(v)};

				        // see above

				        return utils::UUID_gen::get_time_UUID_raw(uint64_t(tmp >> 64), uint64_t(tmp & std::numeric_limits<uint64_t>::max()));

				        return utils::UUID_gen::get_time_UUID_raw(utils::UUID_gen::decimicroseconds{uint64_t(tmp >> 64)},

				            uint64_t(tmp & std::numeric_limits<uint64_t>::max()));

				    }())

				{}

				@@ -470,8 +473,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    auto status = "DISABLED";

				    if (opts.enabled()) {

				        auto& metadata = _ss.get_cdc_metadata();

				        if (!metadata.streams_available()) {

				        if (!_cdc_metadata.streams_available()) {

				            status = "ENABLING";

				        } else {

				            status = "ENABLED";

				@@ -480,18 +482,18 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    auto ttl = std::chrono::seconds(opts.ttl());

				    rjson::set(stream_desc, "StreamStatus", rjson::from_string(status));

				    rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));

				    stream_view_type type = cdc_options_to_steam_view_type(opts);

				    rjson::set(stream_desc, "StreamArn", alternator::stream_arn(schema->id()));

				    rjson::set(stream_desc, "StreamViewType", type);

				    rjson::set(stream_desc, "TableName", rjson::from_string(table_name(*bs)));

				    rjson::add(stream_desc, "StreamArn", alternator::stream_arn(schema->id()));

				    rjson::add(stream_desc, "StreamViewType", type);

				    rjson::add(stream_desc, "TableName", rjson::from_string(table_name(*bs)));

				    describe_key_schema(stream_desc, *bs);

				    if (!opts.enabled()) {

				        rjson::set(ret, "StreamDescription", std::move(stream_desc));

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    }

				@@ -499,19 +501,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    // TODO: creation time

				    auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();

				    // cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,

				    // or on expired...

				    // TODO: maybe add secondary index to topology table to enable this?

				    return _sdks.cdc_get_versioned_streams({ normal_token_owners }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

				        // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)

				        auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);

				    // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)

				    auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);

				        auto i = topologies.lower_bound(low_ts);

				        // need first gen _intersecting_ the timestamp.

				        if (i != topologies.begin()) {

				            i = std::prev(i);

				        }

				    return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, &db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

				        auto e = topologies.end();

				        auto prev = e;

				@@ -519,9 +513,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        std::optional<shard_id> last;

				        // i is now at the youngest generation we include. make a mark of it.

				        auto first = i;

				        auto i = topologies.begin();

				        // if we're a paged query, skip to the generation where we left of.

				        if (shard_start) {

				            i = topologies.find(shard_start->time);

				@@ -547,7 +539,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        };

				        // need a prev even if we are skipping stuff

				        if (i != first) {

				        if (i != topologies.begin()) {

				            prev = std::prev(i);

				        }

				@@ -598,19 +590,19 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				                        pid = std::prev(pid);

				                    }

				                    if (pid != pids.end()) {

				                        rjson::set(shard, "ParentShardId", shard_id(prev->first, *pid));

				                        rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));

				                    }

				                }

				                last.emplace(ts, id);

				                rjson::set(shard, "ShardId", *last);

				                rjson::add(shard, "ShardId", *last);

				                auto range = rjson::empty_object();

				                rjson::set(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch().count())));

				                rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));

				                if (expired) {

				                    rjson::set(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch().count())));

				                    rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));

				                }

				                rjson::set(shard, "SequenceNumberRange", std::move(range));

				                rjson::add(shard, "SequenceNumberRange", std::move(range));

				                rjson::push_back(shards, std::move(shard));

				                if (--limit == 0) {

				@@ -622,11 +614,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        }

				        if (last) {

				            rjson::set(stream_desc, "LastEvaluatedShardId", *last);

				            rjson::add(stream_desc, "LastEvaluatedShardId", *last);

				        }

				        rjson::set(stream_desc, "Shards", std::move(shards));

				        rjson::set(ret, "StreamDescription", std::move(stream_desc));

				        rjson::add(stream_desc, "Shards", std::move(shards));

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    });

				@@ -771,7 +763,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				            inclusive_of_threshold = true;

				            break;

				        case shard_iterator_type::LATEST:

				            threshold = utils::UUID_gen::min_time_UUID((db_clock::now() - confidence_interval(db)).time_since_epoch().count());

				            threshold = utils::UUID_gen::min_time_UUID((db_clock::now() - confidence_interval(db)).time_since_epoch());

				            inclusive_of_threshold = true;

				            break;

				    }

				@@ -779,7 +771,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    shard_iterator iter(stream_arn, *sid, threshold, inclusive_of_threshold);

				    auto ret = rjson::empty_object();

				    rjson::set(ret, "ShardIterator", iter);

				    rjson::add(ret, "ShardIterator", iter);

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				}

				@@ -843,7 +835,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    dht::partition_range_vector partition_ranges{ dht::partition_range::make_singular(dht::decorate_key(*schema, pk)) };

				    auto high_ts = db_clock::now() - confidence_interval(db);

				    auto high_uuid = utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch().count());

				    auto high_uuid = utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch());

				    auto lo = clustering_key_prefix::from_exploded(*schema, { iter.threshold.serialize() });

				    auto hi = clustering_key_prefix::from_exploded(*schema, { high_uuid.serialize() });

				@@ -855,16 +847,18 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");

				    static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");

				    auto key_names = boost::copy_range<std::unordered_set<std::string>>(

				    auto key_names = boost::copy_range<attrs_to_get>(

				        boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))

				        | boost::adaptors::transformed([&] (const column_definition& cdef) { return cdef.name_as_text(); })

				        | boost::adaptors::transformed([&] (const column_definition& cdef) {

				            return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })

				    );

				    // Include all base table columns as values (in case pre or post is enabled).

				    // This will include attributes not stored in the frozen map column

				    auto attr_names = boost::copy_range<std::unordered_set<std::string>>(base->regular_columns()

				    auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()

				        // this will include the :attrs column, which we will also force evaluating. 

				        // But not having this set empty forces out any cdc columns from actual result 

				        | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.name_as_text(); })

				        | boost::adaptors::transformed([] (const column_definition& cdef) {

				            return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })

				    );

				    std::vector<const column_definition*> columns;

				@@ -933,13 +927,13 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        auto maybe_add_record = [&] {

				            if (!dynamodb.ObjectEmpty()) {

				                rjson::set(record, "dynamodb", std::move(dynamodb));

				                rjson::add(record, "dynamodb", std::move(dynamodb));

				                dynamodb = rjson::empty_object();

				            }

				            if (!record.ObjectEmpty()) {

				                // TODO: awsRegion?

				                rjson::set(record, "eventID", event_id(iter.shard.id, *timestamp));

				                rjson::set(record, "eventSource", "scylladb:alternator");

				                rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));

				                rjson::add(record, "eventSource", "scylladb:alternator");

				                rjson::push_back(records, std::move(record));

				                record = rjson::empty_object();

				                --limit;

				@@ -954,10 +948,10 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            if (!dynamodb.HasMember("Keys")) {

				                auto keys = rjson::empty_object();

				                describe_single_item(*selection, row, key_names, keys);

				                rjson::set(dynamodb, "Keys", std::move(keys));

				                rjson::set(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());

				                rjson::set(dynamodb, "SequenceNumber", sequence_number(ts));

				                rjson::set(dynamodb, "StreamViewType", type);

				                rjson::add(dynamodb, "Keys", std::move(keys));

				                rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());

				                rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));

				                rjson::add(dynamodb, "StreamViewType", type);

				                //TODO: SizeInBytes

				            }

				@@ -989,17 +983,17 @@ future<executor::request_return_type> executor::get_records(client_state& client

				                auto item = rjson::empty_object();

				                describe_single_item(*selection, row, attr_names, item, true);

				                describe_single_item(*selection, row, key_names, item);

				                rjson::set(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));

				                rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));

				                break;

				            }

				            case cdc::operation::update:

				                rjson::set(record, "eventName", "MODIFY");

				                rjson::add(record, "eventName", "MODIFY");

				                break;

				            case cdc::operation::insert:

				                rjson::set(record, "eventName", "INSERT");

				                rjson::add(record, "eventName", "INSERT");

				                break;

				            default:

				                rjson::set(record, "eventName", "REMOVE");

				                rjson::add(record, "eventName", "REMOVE");

				                break;

				            }

				            if (eor) {

				@@ -1013,7 +1007,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        auto ret = rjson::empty_object();

				        auto nrecords = records.Size();

				        rjson::set(ret, "Records", std::move(records));

				        rjson::add(ret, "Records", std::move(records));

				        if (nrecords != 0) {

				            // #9642. Set next iterators threshold to > last

				@@ -1022,13 +1016,15 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            // without checking if maybe we reached the end-of-shard. If the

				            // shard did end, then the next read will have nrecords == 0 and

				            // will notice end end of shard and not return NextShardIterator.

				            rjson::set(ret, "NextShardIterator", next_iter);

				            rjson::add(ret, "NextShardIterator", next_iter);

				            _stats.api_operations.get_records_latency.add(std::chrono::steady_clock::now() - start_time);

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        }

				        // ugh. figure out if we are and end-of-shard

				        return cdc::get_local_streams_timestamp().then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {

				        auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();

				        return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {

				            auto& shard = iter.shard;            

				            if (shard.time < ts && ts < high_ts) {

				@@ -1041,8 +1037,8 @@ future<executor::request_return_type> executor::get_records(client_state& client

				                // a search from it until high_ts and found nothing, so we

				                // can also start the next search from high_ts.

				                // TODO: but why? It's simpler just to leave the iterator be.

				                shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch().count()), true);

				                rjson::set(ret, "NextShardIterator", iter);

				                shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);

				                rjson::add(ret, "NextShardIterator", iter);

				            }

				            _stats.api_operations.get_records_latency.add(std::chrono::steady_clock::now() - start_time);

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				@@ -1100,11 +1096,11 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s

				        auto& db = _proxy.get_db().local();

				        auto& cf = db.find_column_family(schema.ks_name(), cdc::log_name(schema.cf_name()));

				        stream_arn arn(cf.schema()->id());

				        rjson::set(descr, "LatestStreamArn", arn);

				        rjson::set(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));

				        rjson::add(descr, "LatestStreamArn", arn);

				        rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));

				        auto stream_desc = rjson::empty_object();

				        rjson::set(stream_desc, "StreamEnabled", true);

				        rjson::add(stream_desc, "StreamEnabled", true);

				        auto mode = stream_view_type::KEYS_ONLY;

				        if (opts.preimage() && opts.postimage()) {

				@@ -1114,8 +1110,8 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s

				        } else if (opts.postimage()) {

				            mode = stream_view_type::NEW_IMAGE;

				        }

				        rjson::set(stream_desc, "StreamViewType", mode);

				        rjson::set(descr, "StreamSpecification", std::move(stream_desc));

				        rjson::add(stream_desc, "StreamViewType", mode);

				        rjson::add(descr, "StreamSpecification", std::move(stream_desc));

				    }

				}

									
										2

alternator/tags_extension.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2019 ScyllaDB

				 * Copyright 2019-present ScyllaDB

				 */

				/*

									
										113

alternator/ttl.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,113 @@

				/*

				 * Copyright 2021-present ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU Affero General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <seastar/core/sstring.hh>

				#include <seastar/core/coroutine.hh>

				#include "executor.hh"

				#include "service/storage_proxy.hh"

				#include "gms/feature_service.hh"

				#include "database.hh"

				#include "utils/rjson.hh"

				namespace alternator {

				// We write the expiration-time attribute enabled on a table using a

				// tag TTL_TAG_KEY.

				// Currently, the *value* of this tag is simply the name of the attribute,

				// and the expiration scanner interprets it as an Alternator attribute name -

				// It can refer to a real column or if that doesn't exist, to a member of

				// the ":attrs" map column. Although this is designed for Alternator, it may

				// be good enough for CQL as well (there, the ":attrs" column won't exist).

				static const sstring TTL_TAG_KEY("system:ttl_attribute");

				future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.update_time_to_live++;

				    if (!_proxy.get_db().local().features().cluster_supports_alternator_ttl()) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");

				    }

				    schema_ptr schema = get_table(_proxy, request);

				    rjson::value* spec = rjson::find(request, "TimeToLiveSpecification");

				    if (!spec || !spec->IsObject()) {

				        co_return api_error::validation("UpdateTimeToLive missing mandatory TimeToLiveSpecification");

				    }

				    const rjson::value* v = rjson::find(*spec, "Enabled");

				    if (!v || !v->IsBool()) {

				        co_return api_error::validation("UpdateTimeToLive requires boolean Enabled");

				    }

				    bool enabled = v->GetBool();

				    v = rjson::find(*spec, "AttributeName");

				    if (!v || !v->IsString()) {

				        co_return api_error::validation("UpdateTimeToLive requires string AttributeName");

				    }

				    // Although the DynamoDB documentation specifies that attribute names

				    // should be between 1 and 64K bytes, in practice, it only allows

				    // between 1 and 255 bytes. There are no other limitations on which

				    // characters are allowed in the name.

				    if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {

				        co_return api_error::validation("The length of AttributeName must be between 1 and 255");

				    }

				    sstring attribute_name(v->GetString(), v->GetStringLength());

				    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);

				    if (enabled) {

				        if (tags_map.contains(TTL_TAG_KEY)) {

				            co_return api_error::validation("TTL is already enabled");

				        }

				        tags_map[TTL_TAG_KEY] = attribute_name;

				    } else {

				        auto i = tags_map.find(TTL_TAG_KEY);

				        if (i == tags_map.end()) {

				            co_return api_error::validation("TTL is already disabled");

				        } else if (i->second != attribute_name) {

				            co_return api_error::validation(format(

				                "Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",

				                attribute_name, i->second));

				        }

				        tags_map.erase(TTL_TAG_KEY);

				    }

				    co_await update_tags(_mm, schema, std::move(tags_map));

				    // Prepare the response, which contains a TimeToLiveSpecification

				    // basically identical to the request's

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveSpecification", std::move(*spec));

				    co_return make_jsonable(std::move(response));

				}

				future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.describe_time_to_live++;

				    schema_ptr schema = get_table(_proxy, request);

				    std::map<sstring, sstring> tags_map = get_tags_of_table(schema);

				    rjson::value desc = rjson::empty_object();

				    auto i = tags_map.find(TTL_TAG_KEY);

				    if (i == tags_map.end()) {

				        rjson::add(desc, "TimeToLiveStatus", "DISABLED");

				    } else {

				        rjson::add(desc, "TimeToLiveStatus", "ENABLED");

				        rjson::add(desc, "AttributeName", rjson::from_string(i->second));

				    }

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveDescription", std::move(desc));

				    co_return make_jsonable(std::move(response));

				}

				} // namespace alternator

									
										10

api/api-doc/column_family.json
									
												View File
												
				@@ -89,7 +89,7 @@

				                     "description":"true if the output of the major compaction should be split in several sstables",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"bool",

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				@@ -2925,6 +2925,10 @@

				         "id":"toppartitions_query_results",

				         "description":"nodetool toppartitions query results",

				         "properties":{

				            "read_cardinality":{

				               "type":"long",

				               "description":"Number of the unique operations in the sample set"

				            },

				            "read":{

				               "type":"array",

				               "items":{

				@@ -2932,6 +2936,10 @@

				               },

				               "description":"Read results"

				            },

				            "write_cardinality":{

				               "type":"long",

				               "description":"Number of the unique operations in the sample set"

				            },

				            "write":{

				               "type":"array",

				               "items":{

									
										24

api/api-doc/gossiper.json
									
												View File
												
				@@ -148,6 +148,30 @@

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/gossiper/force_remove_endpoint/{addr}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Force remove an endpoint from gossip",

				               "type":"void",

				               "nickname":"force_remove_endpoint",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"addr",

				                     "description":"The endpoint address",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      }

				   ]

				}

									
										55

api/api-doc/hinted_handoff.json
									
												View File
												
				@@ -7,6 +7,61 @@

				      "application/json"

				   ],

				   "apis":[

				      {

				         "path":"/hinted_handoff/sync_point",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Creates a hints sync point. It can be used to wait until hints between given nodes are replayed. A sync point allows you to wait for hints accumulated at the moment of its creation - it won't wait for hints generated later. A sync point is described entirely by its ID - there is no state kept server-side, so there is no need to delete it.",

				               "type":"string",

				               "nickname":"create_hints_sync_point",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"target_hosts",

				                     "description":"A list of nodes towards which hints should be replayed. Multiple hosts can be listed by separating them with commas. If not provided or empty, the point will resolve when current hints towards all nodes in the cluster are sent.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            },

				            {

				               "method":"GET",

				               "summary":"Get the status of a hints sync point, possibly waiting for it to be reached.",

				               "type":"string",

				               "enum":[

				                  "DONE",

				                  "IN_PROGRESS"

				               ],

				               "nickname":"get_hints_sync_point",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"id",

				                     "description":"The ID of the hint sync point which should be checked or waited on",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"timeout",

				                     "description":"Timeout in seconds after which the query returns even if hints are still being replayed. No value or 0 will cause the query to return immediately. A negative value will cause the query to wait until the sync point is reached",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/hinted_handoff/hints",

				         "operations":[

									
										4

api/api-doc/messaging_service.json
									
												View File
												
				@@ -76,7 +76,7 @@

				               "items":{

				                  "type":"message_counter"

				               },

				               "nickname":"get_completed_messages",

				               "nickname":"get_replied_messages",

				               "produces":[

				                  "application/json"

				               ],

				@@ -252,7 +252,7 @@

				                 "UNUSED__STREAM_MUTATION",

				                 "STREAM_MUTATION_DONE",

				                 "COMPLETE_MESSAGE",

				                 "REPAIR_CHECKSUM_RANGE",

				                 "UNUSED__REPAIR_CHECKSUM_RANGE",

				                 "GET_SCHEMA_VERSION"

				               ]

				            }

									
										130

api/api-doc/storage_service.json
									
												View File
												
				@@ -104,6 +104,68 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/toppartitions/",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Toppartitions query",

				               "type":"toppartitions_query_results",

				               "nickname":"toppartitions_generic",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"table_filters",

				                     "description":"Optional list of table name filters in keyspace:name format",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"array",

				                     "items":{

				                        "type":"string"

				                     },

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace_filters",

				                     "description":"Optional list of keyspace filters",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"array",

				                     "items":{

				                        "type":"string"

				                     },

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"duration",

				                     "description":"Duration (in milliseconds) of monitoring operation",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type": "long",

				                     "paramType":"query"

				                  },

				                  {

				                    "name":"list_size",

				                    "description":"number of the top partitions to list",

				                    "required":false,

				                    "allowMultiple":false,

				                    "type": "long",

				                    "paramType":"query"

				                 },

				                 {

				                    "name":"capacity",

				                    "description":"capacity of stream summary: determines amount of resources used in query processing",

				                    "required":false,

				                    "allowMultiple":false,

				                    "type": "long",

				                    "paramType":"query"

				                 }

				              ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/nodes/leaving",

				         "operations":[

				@@ -575,6 +637,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"sf",

				                     "description":"Skip flush. When set to \"true\", do not flush memtables before snapshotting (snapshot will not contain unflushed data)",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            },

				@@ -700,7 +770,7 @@

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Scrub (deserialize + reserialize at the latest version, skipping bad rows if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false",

				               "summary":"Scrub (deserialize + reserialize at the latest version, resolving corruptions if any) the given keyspace. If columnFamilies array is empty, all CFs are scrubbed. Scrubbed CFs will be snapshotted first, if disableSnapshot is false. Scrub has the following modes: Abort (default) - abort scrub if corruption is detected; Skip (same as `skip_corrupted=true`) skip over corrupt data, omitting them from the output; Segregate - segregate data into multiple sstables if needed, such that each sstable contains data with valid order; Validate - read (no rewrite) and validate data, logging any problems found.",

				               "type": "long",

				               "nickname":"scrub",

				               "produces":[

				@@ -723,6 +793,20 @@

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"scrub_mode",

				                     "description":"How to handle corrupt data (overrides 'skip_corrupted'); ",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "enum":[

				                        "ABORT",

				                        "SKIP",

				                        "SEGREGATE",

				                        "VALIDATE"

				                     ],

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace to query about",

				@@ -970,6 +1054,14 @@

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"ignore_nodes",

				                     "description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"trace",

				                     "description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",

				@@ -1105,6 +1197,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"ignore_nodes",

				                     "description":"List of dead nodes to ingore in removenode operation",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -1756,6 +1856,22 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"load_and_stream",

				                     "description":"Load the sstables and stream to all replica nodes that owns the data",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"primary_replica_only",

				                     "description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -1866,6 +1982,14 @@

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"fast",

				                     "description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            },

				@@ -2364,6 +2488,10 @@

				            "threshold":{

				               "type":"long",

				               "description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"

				            },

				            "fast":{

				               "type":"boolean",

				               "description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."

				            }

				         }

				      },

									
										16

api/api-doc/system.json
									
												View File
												
				@@ -52,6 +52,22 @@

				            }

				         ]

				      },

				      {

				         "path":"/system/drop_sstable_caches",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Drop in-memory caches for data which is in sstables",

				               "type":"void",

				               "nickname":"drop_sstable_caches",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/system/uptime_ms",

				         "operations":[

									
										81

api/api.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 ScyllaDB

				 * Copyright 2015-present ScyllaDB

				 */

				/*

				@@ -75,10 +75,10 @@ future<> set_server_init(http_context& ctx) {

				    });

				}

				future<> set_server_config(http_context& ctx) {

				future<> set_server_config(http_context& ctx, const db::config& cfg) {

				    auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");

				    return ctx.http_server.set_routes([&ctx, rb02](routes& r) {

				        set_config(rb02, ctx, r);

				    return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {

				        set_config(rb02, ctx, r, cfg);

				    });

				}

				@@ -109,12 +109,30 @@ future<> unset_rpc_controller(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });

				}

				future<> set_server_storage_service(http_context& ctx) {

				    return register_api(ctx, "storage_service", "The storage service API", set_storage_service);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs) {

				    return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs] (http_context& ctx, routes& r) {

				            set_storage_service(ctx, r, ss, g.local(), cdc_gs);

				        });

				}

				future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms) {

				    return ctx.http_server.set_routes([&ctx, &ms] (routes& r) { set_repair(ctx, r, ms); });

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {

				    return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });

				}

				future<> unset_server_sstables_loader(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_sstables_loader(ctx, r); });

				}

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb) {

				    return ctx.http_server.set_routes([&ctx, &vb] (routes& r) { set_view_builder(ctx, r, vb); });

				}

				future<> unset_server_view_builder(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_view_builder(ctx, r); });

				}

				future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair) {

				    return ctx.http_server.set_routes([&ctx, &repair] (routes& r) { set_repair(ctx, r, repair); });

				}

				future<> unset_server_repair(http_context& ctx) {

				@@ -133,9 +151,11 @@ future<> set_server_snitch(http_context& ctx) {

				    return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);

				}

				future<> set_server_gossip(http_context& ctx) {

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g) {

				    return register_api(ctx, "gossiper",

				                "The gossiper API", set_gossiper);

				                "The gossiper API", [&g] (http_context& ctx, routes& r) {

				                    set_gossiper(ctx, r, g.local());

				                });

				}

				future<> set_server_load_sstable(http_context& ctx) {

				@@ -153,9 +173,11 @@ future<> unset_server_messaging_service(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_messaging_service(ctx, r); });

				}

				future<> set_server_storage_proxy(http_context& ctx) {

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss) {

				    return register_api(ctx, "storage_proxy",

				                "The storage proxy API", set_storage_proxy);

				                "The storage proxy API", [&ss] (http_context& ctx, routes& r) {

				                    set_storage_proxy(ctx, r, ss);

				                });

				}

				future<> set_server_stream_manager(http_context& ctx) {

				@@ -168,13 +190,34 @@ future<> set_server_cache(http_context& ctx) {

				            "The cache service API", set_cache_service);

				}

				future<> set_server_gossip_settle(http_context& ctx) {

				future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g) {

				    return register_api(ctx, "hinted_handoff",

				                "The hinted handoff API", [&g] (http_context& ctx, routes& r) {

				                    set_hinted_handoff(ctx, r, g.local());

				                });

				}

				future<> unset_hinted_handoff(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_hinted_handoff(ctx, r); });

				}

				future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx, &g](routes& r) {

				        rb->register_function(r, "failure_detector",

				                "The failure detector API");

				        set_failure_detector(ctx, r, g.local());

				    });

				}

				future<> set_server_compaction_manager(http_context& ctx) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx](routes& r) {

				        rb->register_function(r, "failure_detector",

				                "The failure detector API");

				        set_failure_detector(ctx,r);

				        rb->register_function(r, "compaction_manager",

				                "The Compaction manager API");

				        set_compaction_manager(ctx, r);

				    });

				}

				@@ -182,18 +225,12 @@ future<> set_server_done(http_context& ctx) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx](routes& r) {

				        rb->register_function(r, "compaction_manager",

				                "The Compaction manager API");

				        set_compaction_manager(ctx, r);

				        rb->register_function(r, "lsa", "Log-structured allocator API");

				        set_lsa(ctx, r);

				        rb->register_function(r, "commitlog",

				                "The commit log API");

				        set_commitlog(ctx,r);

				        rb->register_function(r, "hinted_handoff",

				                "The hinted handoff API");

				        set_hinted_handoff(ctx, r);

				        rb->register_function(r, "collectd",

				                "The collectd API");

				        set_collectd(ctx, r);

									
										5

api/api.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 ScyllaDB

				 * Copyright 2015-present ScyllaDB

				 */

				/*

				@@ -29,6 +29,7 @@

				#include <boost/units/detail/utility.hpp>

				#include "api/api-doc/utils.json.hh"

				#include "utils/histogram.hh"

				#include "utils/estimated_histogram.hh"

				#include <seastar/http/exception.hh>

				#include "api_init.hh"

				#include "seastarx.hh"

				@@ -70,7 +71,7 @@ T map_sum(T&& dest, const S& src) {

				    for (auto i : src) {

				        dest[i.first] += i.second;

				    }

				    return dest;

				    return std::move(dest);

				}

				template <typename MAP>

									
										63

api/api_init.hh
									
												View File
												
				@@ -19,16 +19,48 @@

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "database_fwd.hh"

				#include "service/storage_proxy.hh"

				#include <seastar/http/httpd.hh>

				namespace service { class load_meter; }

				namespace locator { class shared_token_metadata; }

				#include <seastar/http/httpd.hh>

				#include <seastar/core/future.hh>

				#include "database_fwd.hh"

				#include "seastarx.hh"

				namespace service {

				class load_meter;

				class storage_proxy;

				class storage_service;

				} // namespace service

				class sstables_loader;

				namespace locator {

				class token_metadata;

				class shared_token_metadata;

				} // namespace locator

				namespace cql_transport { class controller; }

				class thrift_controller;

				namespace db { class snapshot_ctl; }

				namespace db {

				class snapshot_ctl;

				class config;

				namespace view {

				class view_builder;

				}

				}

				namespace netw { class messaging_service; }

				class repair_service;

				namespace cdc { class generation_service; }

				namespace gms {

				class gossiper;

				}

				namespace api {

				@@ -51,10 +83,14 @@ struct http_context {

				};

				future<> set_server_init(http_context& ctx);

				future<> set_server_config(http_context& ctx);

				future<> set_server_config(http_context& ctx, const db::config& cfg);

				future<> set_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx);

				future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs);

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);

				future<> unset_server_sstables_loader(http_context& ctx);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);

				future<> unset_server_view_builder(http_context& ctx);

				future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair);

				future<> unset_server_repair(http_context& ctx);

				future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl);

				future<> unset_transport_controller(http_context& ctx);

				@@ -62,14 +98,17 @@ future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);

				future<> unset_rpc_controller(http_context& ctx);

				future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);

				future<> unset_server_snapshot(http_context& ctx);

				future<> set_server_gossip(http_context& ctx);

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);

				future<> set_server_load_sstable(http_context& ctx);

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> unset_server_messaging_service(http_context& ctx);

				future<> set_server_storage_proxy(http_context& ctx);

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss);

				future<> set_server_stream_manager(http_context& ctx);

				future<> set_server_gossip_settle(http_context& ctx);

				future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g);

				future<> unset_hinted_handoff(http_context& ctx);

				future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);

				future<> set_server_cache(http_context& ctx);

				future<> set_server_compaction_manager(http_context& ctx);

				future<> set_server_done(http_context& ctx);

				}

									
										2

api/cache_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/cache_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/collectd.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/collectd.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										133

api/column_family.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -24,10 +24,13 @@

				#include <vector>

				#include <seastar/http/exception.hh>

				#include "sstables/sstables.hh"

				#include "sstables/metadata_collector.hh"

				#include "utils/estimated_histogram.hh"

				#include <algorithm>

				#include "db/system_keyspace_view_types.hh"

				#include "db/data_listeners.hh"

				#include "storage_service.hh"

				#include "unimplemented.hh"

				extern logging::logger apilog;

				@@ -180,7 +183,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct

				static int64_t min_partition_size(column_family& cf) {

				    int64_t res = INT64_MAX;

				    for (auto i: *cf.get_sstables() ) {

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());

				    }

				    return (res == INT64_MAX) ? 0 : res;

				@@ -188,7 +191,7 @@ static int64_t min_partition_size(column_family& cf) {

				static int64_t max_partition_size(column_family& cf) {

				    int64_t res = 0;

				    for (auto i: *cf.get_sstables() ) {

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);

				    }

				    return res;

				@@ -196,7 +199,7 @@ static int64_t max_partition_size(column_family& cf) {

				static integral_ratio_holder mean_partition_size(column_family& cf) {

				    integral_ratio_holder res;

				    for (auto i: *cf.get_sstables() ) {

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        auto c = i->get_stats_metadata().estimated_partition_size.count();

				        res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;

				        res.total += c;

				@@ -274,7 +277,7 @@ public:

				static double get_compression_ratio(column_family& cf) {

				    sum_ratio<double> result;

				    for (auto i : *cf.get_sstables()) {

				    for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				        auto compression_ratio = i->get_compression_ratio();

				        if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {

				            result(compression_ratio);

				@@ -310,8 +313,8 @@ void set_column_family(http_context& ctx, routes& r) {

				        return res;

				    });

				    cf::get_column_family.set(r, [&ctx] (const_req req){

				            vector<cf::column_family_info> res;

				    cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){

				            std::list<cf::column_family_info> res;

				            for (auto i: ctx.db.local().get_column_families_mapping()) {

				                cf::column_family_info info;

				                info.ks = i.first.first;

				@@ -319,7 +322,7 @@ void set_column_family(http_context& ctx, routes& r) {

				                info.type = "ColumnFamilies";

				                res.push_back(info);

				            }

				            return res;

				            return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));

				        });

				    cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){

				@@ -331,15 +334,15 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {

				            return cf.active_memtable().partition_count();

				        }, std::plus<int>());

				        }, std::plus<>());

				    });

				    cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, 0, [](column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {

				            return cf.active_memtable().partition_count();

				        }, std::plus<int>());

				        }, std::plus<>());

				    });

				    cf::get_memtable_on_heap_size.set(r, [] (const_req req) {

				@@ -424,7 +427,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto i: *cf.get_sstables() ) {

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res.merge(i->get_stats_metadata().estimated_partition_size);

				            }

				            return res;

				@@ -436,7 +439,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            uint64_t res = 0;

				            for (auto i: *cf.get_sstables() ) {

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res += i->get_stats_metadata().estimated_partition_size.count();

				            }

				            return res;

				@@ -447,7 +450,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto i: *cf.get_sstables() ) {

				            for (auto sstables = cf.get_sstables(); auto& i : *sstables) {

				                res.merge(i->get_stats_metadata().estimated_cells_count);

				            }

				            return res;

				@@ -599,7 +602,8 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_false_positive();

				            });

				        }, std::plus<uint64_t>());

				@@ -607,7 +611,8 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_false_positive();

				            });

				        }, std::plus<uint64_t>());

				@@ -615,7 +620,8 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_recent_false_positive();

				            });

				        }, std::plus<uint64_t>());

				@@ -623,7 +629,8 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_get_recent_false_positive();

				            });

				        }, std::plus<uint64_t>());

				@@ -655,48 +662,54 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->filter_size();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->filter_size();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->filter_memory_size();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_memory_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->filter_memory_size();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->filter_memory_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->get_summary().memory_footprint();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->get_summary().memory_footprint();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst->get_summary().memory_footprint();

				            auto sstables = cf.get_sstables();

				            return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst->get_summary().memory_footprint();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -849,18 +862,24 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				        return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				            cf.enable_auto_compaction();

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (database& db) {

				            auto g = database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				                cf.enable_auto_compaction();

				            }).then([g = std::move(g)] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        });

				    });

				    cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				        return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				            cf.disable_auto_compaction();

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (database& db) {

				            auto g = database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](column_family &cf) {

				                cf.disable_auto_compaction();

				            }).then([g = std::move(g)] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        });

				    });

				@@ -868,7 +887,7 @@ void set_column_family(http_context& ctx, routes& r) {

				        auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);

				        auto&& ks = std::get<0>(ks_cf);

				        auto&& cf_name = std::get<1>(ks_cf);

				        return db::system_keyspace::load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_build_progress>& vb) mutable {

				        return db::system_keyspace::load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace_view_build_progress>& vb) mutable {

				            std::set<sstring> vp;

				            for (auto b : vb) {

				                if (b.view.first == ks) {

				@@ -973,42 +992,20 @@ void set_column_family(http_context& ctx, routes& r) {

				        });

				    });

				    cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        auto name_param = req->param["name"];

				        auto [ks, cf] = parse_fully_qualified_cf_name(name_param);

				        auto name = req->param["name"];

				        auto [ks, cf] = parse_fully_qualified_cf_name(name);

				        api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};

				        api::req_param<unsigned> capacity(*req, "capacity", 256);

				        api::req_param<unsigned> list_size(*req, "list_size", 10);

				        apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",

				            name_param, duration.param, list_size.param, capacity.param);

				            name, duration.param, list_size.param, capacity.param);

				        return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {

				            return q.scatter().then([&q] {

				                return sleep(q.duration()).then([&q] {

				                    return q.gather(q.capacity()).then([&q] (auto topk_results) {

				                        apilog.debug("toppartitions query: processing results");

				                        cf::toppartitions_query_results results;

				                        for (auto& d: topk_results.read.top(q.list_size())) {

				                            cf::toppartitions_record r;

				                            r.partition = sstring(d.item);

				                            r.count = d.count;

				                            r.error = d.error;

				                            results.read.push(r);

				                        }

				                        for (auto& d: topk_results.write.top(q.list_size())) {

				                            cf::toppartitions_record r;

				                            r.partition = sstring(d.item);

				                            r.count = d.count;

				                            r.error = d.error;

				                            results.write.push(r);

				                        }

				                        return make_ready_future<json::json_return_type>(results);

				                    });

				                });

				            });

				        return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {

				            return run_toppartitions_query(q, ctx, true);

				        });

				    });

									
										7

api/column_family.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -77,7 +77,7 @@ struct map_reduce_column_families_locally {

				    future<std::unique_ptr<std::any>> operator()(database& db) const {

				        auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));

				        return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<table>>& i) {

				            *res = std::move(reducer(std::move(*res), mapper(*i.second.get())));

				            *res = reducer(std::move(*res), mapper(*i.second.get()));

				        }).then([res] {

				            return std::move(*res);

				        });

				@@ -116,4 +116,7 @@ future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& n

				future<json::json_return_type>  get_cf_stats(http_context& ctx,

				        int64_t column_family_stats::*f);

				std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);

				}

									
										2

api/commitlog.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/commitlog.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										16

api/compaction_manager.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -20,10 +20,11 @@

				 */

				#include "compaction_manager.hh"

				#include "sstables/compaction_manager.hh"

				#include "compaction/compaction_manager.hh"

				#include "api/api-doc/compaction_manager.json.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				#include "unimplemented.hh"

				#include <utility>

				namespace api {

				@@ -58,12 +59,13 @@ void set_compaction_manager(http_context& ctx, routes& r) {

				            for (const auto& c : cm.get_compactions()) {

				                cm::summary s;

				                s.ks = c->ks_name;

				                s.cf = c->cf_name;

				                s.id = c.compaction_uuid.to_sstring();

				                s.ks = c.ks_name;

				                s.cf = c.cf_name;

				                s.unit = "keys";

				                s.task_type = sstables::compaction_name(c->type);

				                s.completed = c->total_keys_written;

				                s.total = c->total_partitions;

				                s.task_type = sstables::compaction_name(c.type);

				                s.completed = c.total_keys_written;

				                s.total = c.total_partitions;

				                summaries.push_back(std::move(s));

				            }

				            return summaries;

									
										2

api/compaction_manager.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										15

api/config.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2018 ScyllaDB

				 * Copyright 2018-present ScyllaDB

				 */

				/*

				@@ -22,7 +22,6 @@

				#include "api/config.hh"

				#include "api/api-doc/config.json.hh"

				#include "db/config.hh"

				#include "database.hh"

				#include <sstream>

				#include <boost/algorithm/string/replace.hpp>

				@@ -89,11 +88,11 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc

				namespace cs = httpd::config_json;

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r) {

				    rb->register_function(r, [&ctx] (output_stream<char>& os) {

				        return do_with(true, [&os, &ctx] (bool& first) {

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg) {

				    rb->register_function(r, [&cfg] (output_stream<char>& os) {

				        return do_with(true, [&os, &cfg] (bool& first) {

				            auto f = make_ready_future();

				            for (auto&& cfg_ref : ctx.db.local().get_config().values()) {

				            for (auto&& cfg_ref : cfg.values()) {

				                auto&& cfg = cfg_ref.get();

				                f = f.then([&os, &first, &cfg] {

				                    return get_config_swagger_entry(cfg.name(), std::string(cfg.desc()), cfg.type_name(), first, os);

				@@ -103,9 +102,9 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx

				        });

				    });

				    cs::find_config_id.set(r, [&ctx] (const_req r) {

				    cs::find_config_id.set(r, [&cfg] (const_req r) {

				        auto id = r.param["id"];

				        for (auto&& cfg_ref : ctx.db.local().get_config().values()) {

				        for (auto&& cfg_ref : cfg.values()) {

				            auto&& cfg = cfg_ref.get();

				            if (id == cfg.name()) {

				                return cfg.value_as_json();

									
										4

api/config.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2018 ScyllaDB

				 * Copyright (C) 2018-present ScyllaDB

				 */

				/*

				@@ -26,5 +26,5 @@

				namespace api {

				void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r);

				void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r, const db::config& cfg);

				}

									
										2

api/endpoint_snitch.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/endpoint_snitch.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/error_injection.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2020 ScyllaDB

				 * Copyright (C) 2020-present ScyllaDB

				 */

				/*

									
										2

api/error_injection.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2019 ScyllaDB

				 * Copyright (C) 2019-present ScyllaDB

				 */

				/*

									
										24

api/failure_detector.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -28,10 +28,10 @@ namespace api {

				namespace fd = httpd::failure_detector_json;

				void set_failure_detector(http_context& ctx, routes& r) {

				    fd::get_all_endpoint_states.set(r, [](std::unique_ptr<request> req) {

				void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {

				        std::vector<fd::endpoint_state> res;

				        for (auto i : gms::get_local_gossiper().endpoint_state_map) {

				        for (auto i : g.endpoint_state_map) {

				            fd::endpoint_state val;

				            val.addrs = boost::lexical_cast<std::string>(i.first);

				            val.is_alive = i.second.is_alive();

				@@ -52,14 +52,14 @@ void set_failure_detector(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				    fd::get_up_endpoint_count.set(r, [](std::unique_ptr<request> req) {

				        return gms::get_up_endpoint_count().then([](int res) {

				    fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {

				        return gms::get_up_endpoint_count(g).then([](int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    fd::get_down_endpoint_count.set(r, [](std::unique_ptr<request> req) {

				        return gms::get_down_endpoint_count().then([](int res) {

				    fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {

				        return gms::get_down_endpoint_count(g).then([](int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				@@ -70,8 +70,8 @@ void set_failure_detector(http_context& ctx, routes& r) {

				        });

				    });

				    fd::get_simple_states.set(r, [] (std::unique_ptr<request> req) {

				        return gms::get_simple_states().then([](const std::map<sstring, sstring>& map) {

				    fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {

				        return gms::get_simple_states(g).then([](const std::map<sstring, sstring>& map) {

				            return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(map));

				        });

				    });

				@@ -83,8 +83,8 @@ void set_failure_detector(http_context& ctx, routes& r) {

				        });

				    });

				    fd::get_endpoint_state.set(r, [](std::unique_ptr<request> req) {

				        return gms::get_endpoint_state(req->param["addr"]).then([](const sstring& state) {

				    fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {

				        return get_endpoint_state(g, req->param["addr"]).then([](const sstring& state) {

				            return make_ready_future<json::json_return_type>(state);

				        });

				    });

									
										12

api/failure_detector.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -23,8 +23,14 @@

				#include "api.hh"

				namespace api {

				namespace gms {

				void set_failure_detector(http_context& ctx, routes& r);

				class gossiper;

				}

				namespace api {

				void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g);

				}

									
										37

api/gossiper.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -26,43 +26,50 @@

				namespace api {

				using namespace json;

				void set_gossiper(http_context& ctx, routes& r) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [] (const_req req) {

				        auto res = gms::get_local_gossiper().get_unreachable_members();

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {

				        auto res = g.get_unreachable_members();

				        return container_to_vec(res);

				    });

				    httpd::gossiper_json::get_live_endpoint.set(r, [] (const_req req) {

				        auto res = gms::get_local_gossiper().get_live_members();

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (const_req req) {

				        auto res = g.get_live_members();

				        return container_to_vec(res);

				    });

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [] (const_req req) {

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {

				        gms::inet_address ep(req.param["addr"]);

				        return gms::get_local_gossiper().get_endpoint_downtime(ep);

				        return g.get_endpoint_downtime(ep);

				    });

				    httpd::gossiper_json::get_current_generation_number.set(r, [] (std::unique_ptr<request> req) {

				    httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<request> req) {

				        gms::inet_address ep(req->param["addr"]);

				        return gms::get_local_gossiper().get_current_generation_number(ep).then([] (int res) {

				        return g.get_current_generation_number(ep).then([] (int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    httpd::gossiper_json::get_current_heart_beat_version.set(r, [] (std::unique_ptr<request> req) {

				    httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<request> req) {

				        gms::inet_address ep(req->param["addr"]);

				        return gms::get_local_gossiper().get_current_heart_beat_version(ep).then([] (int res) {

				        return g.get_current_heart_beat_version(ep).then([] (int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    httpd::gossiper_json::assassinate_endpoint.set(r, [](std::unique_ptr<request> req) {

				    httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<request> req) {

				        if (req->get_query_param("unsafe") != "True") {

				            return gms::get_local_gossiper().assassinate_endpoint(req->param["addr"]).then([] {

				            return g.assassinate_endpoint(req->param["addr"]).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        }

				        return gms::get_local_gossiper().unsafe_assassinate_endpoint(req->param["addr"]).then([] {

				        return g.unsafe_assassinate_endpoint(req->param["addr"]).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<request> req) {

				        gms::inet_address ep(req->param["addr"]);

				        return g.force_remove_endpoint(ep).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

									
										12

api/gossiper.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -23,8 +23,14 @@

				#include "api.hh"

				namespace api {

				namespace gms {

				void set_gossiper(http_context& ctx, routes& r);

				class gossiper;

				}

				namespace api {

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g);

				}

									
										93

api/hinted_handoff.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -19,15 +19,93 @@

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <algorithm>

				#include <vector>

				#include "hinted_handoff.hh"

				#include "api/api-doc/hinted_handoff.json.hh"

				#include "gms/inet_address.hh"

				#include "gms/gossiper.hh"

				#include "service/storage_proxy.hh"

				namespace api {

				using namespace json;

				namespace hh = httpd::hinted_handoff_json;

				void set_hinted_handoff(http_context& ctx, routes& r) {

				void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {

				    hh::create_hints_sync_point.set(r, [&ctx, &g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto parse_hosts_list = [&g] (sstring arg) {

				            std::vector<sstring> hosts_str = split(arg, ",");

				            std::vector<gms::inet_address> hosts;

				            hosts.reserve(hosts_str.size());

				            if (hosts_str.empty()) {

				                // No target_hosts specified means that we should wait for hints for all nodes to be sent

				                const auto members_set = g.get_live_members();

				                std::copy(members_set.begin(), members_set.end(), std::back_inserter(hosts));

				            } else {

				                for (const auto& host_str : hosts_str) {

				                    try {

				                        gms::inet_address host;

				                        host = gms::inet_address(host_str);

				                        hosts.push_back(host);

				                    } catch (std::exception& e) {

				                        throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));

				                    }

				                }

				            }

				            return hosts;

				        };

				        std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));

				        return ctx.sp.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {

				            return json::json_return_type(sync_point.encode());

				        });

				    });

				    hh::get_hints_sync_point.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        db::hints::sync_point sync_point;

				        const sstring encoded = req->get_query_param("id");

				        try {

				            sync_point = db::hints::sync_point::decode(encoded);

				        } catch (std::exception& e) {

				            throw httpd::bad_param_exception(format("Failed to parse the sync point description {}: {}", encoded, e.what()));

				        }

				        lowres_clock::time_point deadline;

				        const sstring timeout_str = req->get_query_param("timeout");

				        try {

				            deadline = [&] {

				                if (timeout_str.empty()) {

				                    // Empty string - don't wait at all, just check the status

				                    return lowres_clock::time_point::min();

				                } else {

				                    const auto timeout = std::stoll(timeout_str);

				                    if (timeout >= 0) {

				                        // Wait until the point is reached, or until `timeout` seconds elapse

				                        return lowres_clock::now() + std::chrono::seconds(timeout);

				                    } else {

				                        // Negative value indicates infinite timeout

				                        return lowres_clock::time_point::max();

				                    }

				                }

				            } ();

				        } catch (std::exception& e) {

				            throw httpd::bad_param_exception(format("Failed to parse the timeout parameter {}: {}", timeout_str, e.what()));

				        }

				        using return_type = hh::ns_get_hints_sync_point::get_hints_sync_point_return_type;

				        using return_type_wrapper = hh::ns_get_hints_sync_point::return_type_wrapper;

				        return ctx.sp.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {

				            return json::json_return_type(return_type_wrapper(return_type::DONE));

				        }).handle_exception_type([] (const timed_out_error&) {

				            return json::json_return_type(return_type_wrapper(return_type::IN_PROGRESS));

				        });

				    });

				    hh::list_endpoints_pending_hints.set(r, [] (std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				@@ -71,5 +149,16 @@ void set_hinted_handoff(http_context& ctx, routes& r) {

				    });

				}

				void unset_hinted_handoff(http_context& ctx, routes& r) {

				    hh::create_hints_sync_point.unset(r);

				    hh::get_hints_sync_point.unset(r);

				    hh::list_endpoints_pending_hints.unset(r);

				    hh::truncate_all_hints.unset(r);

				    hh::schedule_hint_delivery.unset(r);

				    hh::pause_hints_delivery.unset(r);

				    hh::get_create_hint_count.unset(r);

				    hh::get_not_stored_hints_count.unset(r);

				}

				}

									
										13

api/hinted_handoff.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -23,8 +23,15 @@

				#include "api.hh"

				namespace api {

				namespace gms {

				void set_hinted_handoff(http_context& ctx, routes& r);

				class gossiper;

				}

				namespace api {

				void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g);

				void unset_hinted_handoff(http_context& ctx, routes& r);

				}

									
										3

api/lsa.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -26,6 +26,7 @@

				#include <seastar/http/exception.hh>

				#include "utils/logalloc.hh"

				#include "log.hh"

				#include "database.hh"

				namespace api {

									
										2

api/lsa.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										7

api/messaging_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -96,6 +96,10 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging

				        return c.get_stats().sent_messages;

				    }));

				    get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {

				        return c.get_stats().replied;

				    }));

				    get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {

				        // We don't have the same drop message mechanism

				        // as origin has.

				@@ -155,6 +159,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging

				void unset_messaging_service(http_context& ctx, routes& r) {

				    get_timeout_messages.unset(r);

				    get_sent_messages.unset(r);

				    get_replied_messages.unset(r);

				    get_dropped_messages.unset(r);

				    get_exception_messages.unset(r);

				    get_pending_messages.unset(r);

									
										2

api/messaging_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										8

api/storage_proxy.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -193,7 +193,7 @@ sum_timer_stats_storage_proxy(distributed<proxy>& d,

				    });

				}

				void set_storage_proxy(http_context& ctx, routes& r) {

				void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss) {

				    sp::get_total_hints.set(r, [](std::unique_ptr<request> req)  {

				        //TBD

				        unimplemented();

				@@ -363,8 +363,8 @@ void set_storage_proxy(http_context& ctx, routes& r) {

				        return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_background);

				    });

				    sp::get_schema_versions.set(r, [](std::unique_ptr<request> req)  {

				        return service::get_local_storage_service().describe_schema_versions().then([] (auto result) {

				    sp::get_schema_versions.set(r, [&ss](std::unique_ptr<request> req)  {

				        return ss.local().describe_schema_versions().then([] (auto result) {

				            std::vector<sp::mapper_list> res;

				            for (auto e : result) {

				                sp::mapper_list entry;

									
										7

api/storage_proxy.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -21,10 +21,13 @@

				#pragma once

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				namespace service { class storage_service; }

				namespace api {

				void set_storage_proxy(http_context& ctx, routes& r);

				void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss);

				}

									
										467

api/storage_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -23,22 +23,28 @@

				#include "api/api-doc/storage_service.json.hh"

				#include "db/config.hh"

				#include "db/schema_tables.hh"

				#include <optional>

				#include "utils/hash.hh"

				#include <sstream>

				#include <time.h>

				#include <algorithm>

				#include <boost/range/adaptor/map.hpp>

				#include <boost/range/adaptor/filtered.hpp>

				#include <boost/algorithm/string/trim_all.hpp>

				#include <boost/algorithm/string/case_conv.hpp>

				#include <boost/functional/hash.hpp>

				#include "service/storage_service.hh"

				#include "service/load_meter.hh"

				#include "db/commitlog/commitlog.hh"

				#include "gms/gossiper.hh"

				#include "db/system_keyspace.hh"

				#include "seastar/http/exception.hh"

				#include <seastar/core/coroutine.hh>

				#include "repair/repair.hh"

				#include "locator/snitch_base.hh"

				#include "column_family.hh"

				#include "log.hh"

				#include "release.hh"

				#include "sstables/compaction_manager.hh"

				#include "compaction/compaction_manager.hh"

				#include "sstables/sstables.hh"

				#include "database.hh"

				#include "db/extensions.hh"

				@@ -46,6 +52,13 @@

				#include "transport/controller.hh"

				#include "thrift/controller.hh"

				#include "locator/token_metadata.hh"

				#include "cdc/generation_service.hh"

				#include "service/storage_proxy.hh"

				#include "locator/abstract_replication_strategy.hh"

				#include "sstables_loader.hh"

				#include "db/view/view_builder.hh"

				extern logging::logger apilog;

				namespace api {

				@@ -71,7 +84,7 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp

				    r.rpc_endpoints = d._rpc_endpoints;

				    for (auto det : d._endpoint_details) {

				        ss::endpoint_detail ed;

				        ed.host = det._host;

				        ed.host = boost::lexical_cast<std::string>(det._host);

				        ed.datacenter = det._datacenter;

				        if (det._rack != "") {

				            ed.rack = det._rack;

				@@ -94,12 +107,59 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {

				    };

				}

				future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {

				    namespace cf = httpd::column_family_json;

				    return q.scatter().then([&q, legacy_request] {

				        return sleep(q.duration()).then([&q, legacy_request] {

				            return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {

				                apilog.debug("toppartitions query: processing results");

				                cf::toppartitions_query_results results;

				                results.read_cardinality = topk_results.read.size();

				                results.write_cardinality = topk_results.write.size();

				                for (auto& d: topk_results.read.top(q.list_size())) {

				                    cf::toppartitions_record r;

				                    r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);

				                    r.count = d.count;

				                    r.error = d.error;

				                    results.read.push(r);

				                }

				                for (auto& d: topk_results.write.top(q.list_size())) {

				                    cf::toppartitions_record r;

				                    r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);

				                    r.count = d.count;

				                    r.error = d.error;

				                    results.write.push(r);

				                }

				                return make_ready_future<json::json_return_type>(results);

				            });

				        });

				    });

				}

				future<json::json_return_type> set_tables_autocompaction(http_context& ctx, service::storage_service& ss, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {

				    if (tables.empty()) {

				        tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());

				    }

				    return service::get_local_storage_service().set_tables_autocompaction(keyspace, tables, enabled).then([]{

				    apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);

				    return do_with(keyspace, std::move(tables), [&ctx, enabled] (const sstring &keyspace, const std::vector<sstring>& tables) {

				        return ctx.db.invoke_on(0, [&ctx, &keyspace, &tables, enabled] (database& db) {

				            auto g = database::autocompaction_toggle_guard(db);

				            return ctx.db.invoke_on_all([&keyspace, &tables, enabled] (database& db) {

				                return parallel_for_each(tables, [&db, &keyspace, enabled] (const sstring& table) {

				                    column_family& cf = db.find_column_family(keyspace, table);

				                    if (enabled) {

				                        cf.enable_auto_compaction();

				                    } else {

				                        cf.disable_auto_compaction();

				                    }

				                    return make_ready_future<>();

				                });

				            }).finally([g = std::move(g)] {});

				        });

				    }).then([] {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				}

				@@ -156,10 +216,10 @@ void unset_rpc_controller(http_context& ctx, routes& r) {

				    ss::is_rpc_server_running.unset(r);

				}

				void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {

				    ss::repair_async.set(r, [&ctx, &ms](std::unique_ptr<request> req) {

				void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {

				    ss::repair_async.set(r, [&ctx, &repair](std::unique_ptr<request> req) {

				        static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",

				                "jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",

				                "jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",

				                "startToken", "endToken" };

				        std::unordered_map<sstring, sstring> options_map;

				        for (auto o : options) {

				@@ -173,7 +233,7 @@ void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>&

				        // returns immediately, not waiting for the repair to finish. The user

				        // then has other mechanisms to track the ongoing repair's progress,

				        // or stop it.

				        return repair_start(ctx.db, ms, validate_keyspace(ctx, req->param),

				        return repair_start(repair, validate_keyspace(ctx, req->param),

				                options_map).then([] (int i) {

				                    return make_ready_future<json::json_return_type>(i);

				                });

				@@ -225,20 +285,20 @@ void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>&

				            try {

				                res = fut.get0();

				            } catch (std::exception& e) {

				                return make_exception_future<json::json_return_type>(httpd::server_error_exception(e.what()));

				                return make_exception_future<json::json_return_type>(httpd::bad_param_exception(e.what()));

				            }

				            return make_ready_future<json::json_return_type>(json::json_return_type(res));

				        });

				    });

				    ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {

				        return repair_abort_all(service::get_local_storage_service().db()).then([] {

				    ss::force_terminate_all_repair_sessions.set(r, [&ctx](std::unique_ptr<request> req) {

				        return repair_abort_all(ctx.db).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {

				        return repair_abort_all(service::get_local_storage_service().db()).then([] {

				    ss::force_terminate_all_repair_sessions_new.set(r, [&ctx](std::unique_ptr<request> req) {

				        return repair_abort_all(ctx.db).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -254,9 +314,60 @@ void unset_repair(http_context& ctx, routes& r) {

				    ss::force_terminate_all_repair_sessions_new.unset(r);

				}

				void set_storage_service(http_context& ctx, routes& r) {

				void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader) {

				    ss::load_new_ss_tables.set(r, [&ctx, &sst_loader](std::unique_ptr<request> req) {

				        auto ks = validate_keyspace(ctx, req->param);

				        auto cf = req->get_query_param("cf");

				        auto stream = req->get_query_param("load_and_stream");

				        auto primary_replica = req->get_query_param("primary_replica_only");

				        boost::algorithm::to_lower(stream);

				        boost::algorithm::to_lower(primary_replica);

				        bool load_and_stream = stream == "true" || stream == "1";

				        bool primary_replica_only = primary_replica == "true" || primary_replica == "1";

				        // No need to add the keyspace, since all we want is to avoid always sending this to the same

				        // CPU. Even then I am being overzealous here. This is not something that happens all the time.

				        auto coordinator = std::hash<sstring>()(cf) % smp::count;

				        return sst_loader.invoke_on(coordinator,

				                [ks = std::move(ks), cf = std::move(cf),

				                load_and_stream, primary_replica_only] (sstables_loader& loader) {

				            return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only);

				        }).then_wrapped([] (auto&& f) {

				            if (f.failed()) {

				                auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());

				                return make_exception_future<json::json_return_type>(httpd::server_error_exception(msg));

				            }

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				}

				void unset_sstables_loader(http_context& ctx, routes& r) {

				    ss::load_new_ss_tables.unset(r);

				}

				void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb) {

				    ss::view_build_statuses.set(r, [&ctx, &vb] (std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto view = req->param["view"];

				        return vb.local().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));

				        });

				    });

				}

				void unset_view_builder(http_context& ctx, routes& r) {

				    ss::view_build_statuses.unset(r);

				}

				static future<json::json_return_type> describe_ring_as_json(sharded<service::storage_service>& ss, sstring keyspace) {

				    co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));

				}

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs) {

				    ss::local_hostid.set(r, [](std::unique_ptr<request> req) {

				        return db::system_keyspace::get_local_host_id().then([](const utils::UUID& id) {

				        return db::system_keyspace::load_local_host_id().then([](const utils::UUID& id) {

				            return make_ready_future<json::json_return_type>(id.to_sstring());

				        });

				    });

				@@ -278,8 +389,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return ctx.db.local().commitlog()->active_config().commit_log_location;

				    });

				    ss::get_token_endpoint.set(r, [] (std::unique_ptr<request> req) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_to_endpoint_map(), [](const auto& i) {

				    ss::get_token_endpoint.set(r, [&ss] (std::unique_ptr<request> req) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_to_endpoint_map(), [](const auto& i) {

				            storage_service_json::mapper val;

				            val.key = boost::lexical_cast<std::string>(i.first);

				            val.value = boost::lexical_cast<std::string>(i.second);

				@@ -287,6 +398,56 @@ void set_storage_service(http_context& ctx, routes& r) {

				        }));

				    });

				    ss::toppartitions_generic.set(r, [&ctx] (std::unique_ptr<request> req) {

				        bool filters_provided = false;

				        std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};

				        if (req->query_parameters.contains("table_filters")) {

				            filters_provided = true;

				            auto filters = req->get_query_param("table_filters");

				            std::stringstream ss { filters };

				            std::string filter;

				            while (!filters.empty() && ss.good()) {

				                std::getline(ss, filter, ',');

				                table_filters.emplace(parse_fully_qualified_cf_name(filter));

				            }

				        }

				        std::unordered_set<sstring> keyspace_filters {};

				        if (req->query_parameters.contains("keyspace_filters")) {

				            filters_provided = true;

				            auto filters = req->get_query_param("keyspace_filters");

				            std::stringstream ss { filters };

				            std::string filter;

				            while (!filters.empty() && ss.good()) {

				                std::getline(ss, filter, ',');

				                keyspace_filters.emplace(std::move(filter));

				            }

				        }

				        // when the query is empty return immediately

				        if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {

				            apilog.debug("toppartitions query: processing results");

				            httpd::column_family_json::toppartitions_query_results results;

				            results.read_cardinality = 0;

				            results.write_cardinality = 0;

				            return make_ready_future<json::json_return_type>(results);

				        }

				        api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};

				        api::req_param<unsigned> capacity(*req, "capacity", 256);

				        api::req_param<unsigned> list_size(*req, "list_size", 10);

				        apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",

				            !table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.param, list_size.param, capacity.param);

				        return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {

				            return run_toppartitions_query(q, ctx);

				        });

				    });

				    ss::get_leaving_nodes.set(r, [&ctx](const_req req) {

				        return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());

				    });

				@@ -305,15 +466,15 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return container_to_vec(addr);

				    });

				    ss::get_release_version.set(r, [](const_req req) {

				        return service::get_local_storage_service().get_release_version();

				    ss::get_release_version.set(r, [&ss](const_req req) {

				        return ss.local().get_release_version();

				    });

				    ss::get_scylla_release_version.set(r, [](const_req req) {

				        return scylla_version();

				    });

				    ss::get_schema_version.set(r, [](const_req req) {

				        return service::get_local_storage_service().get_schema_version();

				    ss::get_schema_version.set(r, [&ss](const_req req) {

				        return ss.local().get_schema_version();

				    });

				    ss::get_all_data_file_locations.set(r, [&ctx](const_req req) {

				@@ -324,11 +485,11 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return ctx.db.local().get_config().saved_caches_directory();

				    });

				    ss::get_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<request> req) {

				    ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        std::vector<ss::maplist_mapper> res;

				        return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_range_to_address_map(keyspace),

				                [](const std::pair<dht::token_range, std::vector<gms::inet_address>>& entry){

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_range_to_address_map(keyspace),

				                [](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){

				            ss::maplist_mapper m;

				            if (entry.first.start()) {

				                m.key.push(entry.first.start().value().value().to_sstring());

				@@ -355,13 +516,12 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				    ss::describe_any_ring.set(r, [&ctx](std::unique_ptr<request> req) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(""), token_range_endpoints_to_json));

				    ss::describe_any_ring.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        return describe_ring_as_json(ss, "");

				    });

				    ss::describe_ring.set(r, [&ctx](std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().describe_ring(keyspace), token_range_endpoints_to_json));

				    ss::describe_ring.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        return describe_ring_as_json(ss, validate_keyspace(ctx, req->param));

				    });

				    ss::get_host_id_map.set(r, [&ctx](const_req req) {

				@@ -386,21 +546,24 @@ void set_storage_service(http_context& ctx, routes& r) {

				        });

				    });

				    ss::get_current_generation_number.set(r, [](std::unique_ptr<request> req) {

				    ss::get_current_generation_number.set(r, [&g](std::unique_ptr<request> req) {

				        gms::inet_address ep(utils::fb_utilities::get_broadcast_address());

				        return gms::get_local_gossiper().get_current_generation_number(ep).then([](int res) {

				        return g.get_current_generation_number(ep).then([](int res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    ss::get_natural_endpoints.set(r, [&ctx](const_req req) {

				    ss::get_natural_endpoints.set(r, [&ctx, &ss](const_req req) {

				        auto keyspace = validate_keyspace(ctx, req.param);

				        return container_to_vec(service::get_local_storage_service().get_natural_endpoints(keyspace, req.get_query_param("cf"),

				        return container_to_vec(ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"),

				                req.get_query_param("key")));

				    });

				    ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {

				    ss::cdc_streams_check_and_repair.set(r, [&ctx, &cdc_gs] (std::unique_ptr<request> req) {

				        if (!cdc_gs.local_is_initialized()) {

				            throw std::runtime_error("get_cdc_generation_service: not initialized yet");

				        }

				        return cdc_gs.local().check_and_repair_cdc_streams().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -411,40 +574,51 @@ void set_storage_service(http_context& ctx, routes& r) {

				        if (column_families.empty()) {

				            column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());

				        }

				        return ctx.db.invoke_on_all([keyspace, column_families] (database& db) {

				            std::vector<column_family*> column_families_vec;

				            for (auto cf : column_families) {

				                column_families_vec.push_back(&db.find_column_family(keyspace, cf));

				            }

				            return parallel_for_each(column_families_vec, [] (column_family* cf) {

				                    return cf->compact_all_sstables();

				        return ctx.db.invoke_on_all([keyspace, column_families] (database& db) -> future<> {

				            auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& cf_name) {

				                return db.find_uuid(keyspace, cf_name);

				            }));

				            // major compact smaller tables first, to increase chances of success if low on space.

				            std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {

				                return db.find_column_family(id).get_stats().live_disk_space_used;

				            });

				            // as a table can be dropped during loop below, let's find it before issuing major compaction request.

				            for (auto& id : table_ids) {

				                co_await db.find_column_family(id).compact_all_sstables();

				            }

				            co_return;

				        }).then([]{

				                return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::force_keyspace_cleanup.set(r, [&ctx](std::unique_ptr<request> req) {

				    ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto column_families = split_cf(req->get_query_param("cf"));

				        if (column_families.empty()) {

				            column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());

				        }

				        return service::get_local_storage_service().is_cleanup_allowed(keyspace).then([&ctx, keyspace,

				        return ss.local().is_cleanup_allowed(keyspace).then([&ctx, keyspace,

				                column_families = std::move(column_families)] (bool is_cleanup_allowed) mutable {

				            if (!is_cleanup_allowed) {

				                return make_exception_future<json::json_return_type>(

				                        std::runtime_error("Can not perform cleanup operation when topology changes"));

				            }

				            return ctx.db.invoke_on_all([keyspace, column_families] (database& db) {

				                std::vector<column_family*> column_families_vec;

				                auto& cm = db.get_compaction_manager();

				                for (auto cf : column_families) {

				                    column_families_vec.push_back(&db.find_column_family(keyspace, cf));

				                }

				                return parallel_for_each(column_families_vec, [&cm, &db] (column_family* cf) {

				                    return cm.perform_cleanup(db, cf);

				            return ctx.db.invoke_on_all([keyspace, column_families] (database& db) -> future<> {

				                auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& table_name) {

				                    return db.find_uuid(keyspace, table_name);

				                }));

				                // cleanup smaller tables first, to increase chances of success if low on space.

				                std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {

				                    return db.find_column_family(id).get_stats().live_disk_space_used;

				                });

				                auto& cm = db.get_compaction_manager();

				                // as a table can be dropped during loop below, let's find it before issuing the cleanup request.

				                for (auto& id : table_ids) {

				                    table& t = db.find_column_family(id);

				                    co_await cm.perform_cleanup(db, &t);

				                }

				                co_return;

				            }).then([]{

				                return make_ready_future<json::json_return_type>(0);

				            });

				@@ -481,34 +655,49 @@ void set_storage_service(http_context& ctx, routes& r) {

				    });

				    ss::decommission.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().decommission().then([] {

				    ss::decommission.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().decommission().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::move.set(r, [] (std::unique_ptr<request> req) {

				    ss::move.set(r, [&ss] (std::unique_ptr<request> req) {

				        auto new_token = req->get_query_param("new_token");

				        return service::get_local_storage_service().move(new_token).then([] {

				        return ss.local().move(new_token).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::remove_node.set(r, [](std::unique_ptr<request> req) {

				    ss::remove_node.set(r, [&ss](std::unique_ptr<request> req) {

				        auto host_id = req->get_query_param("host_id");

				        return service::get_local_storage_service().removenode(host_id).then([] {

				        std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");

				        auto ignore_nodes = std::list<gms::inet_address>();

				        for (std::string n : ignore_nodes_strs) {

				            try {

				                std::replace(n.begin(), n.end(), '\"', ' ');

				                std::replace(n.begin(), n.end(), '\'', ' ');

				                boost::trim_all(n);

				                if (!n.empty()) {

				                    auto node = gms::inet_address(n);

				                    ignore_nodes.push_back(node);

				                }

				            } catch (...) {

				                throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));

				            }

				        }

				        return ss.local().removenode(host_id, std::move(ignore_nodes)).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::get_removal_status.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().get_removal_status().then([] (auto status) {

				    ss::get_removal_status.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().get_removal_status().then([] (auto status) {

				            return make_ready_future<json::json_return_type>(status);

				        });

				    });

				    ss::force_remove_completion.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().force_remove_completion().then([] {

				    ss::force_remove_completion.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().force_remove_completion().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -532,20 +721,20 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				    ss::get_operation_mode.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().get_operation_mode().then([] (auto mode) {

				    ss::get_operation_mode.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().get_operation_mode().then([] (auto mode) {

				            return make_ready_future<json::json_return_type>(mode);

				        });

				    });

				    ss::is_starting.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().is_starting().then([] (auto starting) {

				    ss::is_starting.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().is_starting().then([] (auto starting) {

				            return make_ready_future<json::json_return_type>(starting);

				        });

				    });

				    ss::get_drain_progress.set(r, [](std::unique_ptr<request> req) {

				        return service::get_storage_service().map_reduce(adder<service::storage_service::drain_progress>(), [] (auto& ss) {

				    ss::get_drain_progress.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.map_reduce(adder<service::storage_service::drain_progress>(), [] (auto& ss) {

				            return ss.get_drain_progress();

				        }).then([] (auto&& progress) {

				            auto progress_str = format("Drained {}/{} ColumnFamilies", progress.remaining_cfs, progress.total_cfs);

				@@ -553,8 +742,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        });

				    });

				    ss::drain.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().drain().then([] {

				    ss::drain.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().drain().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -585,20 +774,20 @@ void set_storage_service(http_context& ctx, routes& r) {

				        });

				    });

				    ss::stop_gossiping.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().stop_gossiping().then([] {

				    ss::stop_gossiping.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().stop_gossiping().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::start_gossiping.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().start_gossiping().then([] {

				    ss::start_gossiping.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().start_gossiping().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::is_gossip_running.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().is_gossip_running().then([] (bool running){

				    ss::is_gossip_running.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().is_gossip_running().then([] (bool running){

				            return make_ready_future<json::json_return_type>(running);

				        });

				    });

				@@ -610,8 +799,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::is_initialized.set(r, [](std::unique_ptr<request> req) {

				        return service::get_local_storage_service().is_initialized().then([] (bool initialized) {

				    ss::is_initialized.set(r, [&ss](std::unique_ptr<request> req) {

				        return ss.local().is_initialized().then([] (bool initialized) {

				            return make_ready_future<json::json_return_type>(initialized);

				        });

				    });

				@@ -620,8 +809,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::is_joined.set(r, [] (std::unique_ptr<request> req) {

				        return make_ready_future<json::json_return_type>(service::get_local_storage_service().is_joined());

				    ss::is_joined.set(r, [&ss] (std::unique_ptr<request> req) {

				        return make_ready_future<json::json_return_type>(ss.local().is_joined());

				    });

				    ss::set_stream_throughput_mb_per_sec.set(r, [](std::unique_ptr<request> req) {

				@@ -649,10 +838,10 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::is_incremental_backups_enabled.set(r, [](std::unique_ptr<request> req) {

				    ss::is_incremental_backups_enabled.set(r, [&ctx](std::unique_ptr<request> req) {

				        // If this is issued in parallel with an ongoing change, we may see values not agreeing.

				        // Reissuing is asking for trouble, so we will just return true upon seeing any true value.

				        return service::get_local_storage_service().db().map_reduce(adder<bool>(), [] (database& db) {

				        return ctx.db.map_reduce(adder<bool>(), [] (database& db) {

				            for (auto& pair: db.get_keyspaces()) {

				                auto& ks = pair.second;

				                if (ks.incremental_backups_enabled()) {

				@@ -665,10 +854,10 @@ void set_storage_service(http_context& ctx, routes& r) {

				        });

				    });

				    ss::set_incremental_backups_enabled.set(r, [](std::unique_ptr<request> req) {

				    ss::set_incremental_backups_enabled.set(r, [&ctx](std::unique_ptr<request> req) {

				        auto val_str = req->get_query_param("value");

				        bool value = (val_str == "True") || (val_str == "true") || (val_str == "1");

				        return service::get_local_storage_service().db().invoke_on_all([value] (database& db) {

				        return ctx.db.invoke_on_all([value] (database& db) {

				            db.set_enable_incremental_backups(value);

				            // Change both KS and CF, so they are in sync

				@@ -686,9 +875,9 @@ void set_storage_service(http_context& ctx, routes& r) {

				        });

				    });

				    ss::rebuild.set(r, [](std::unique_ptr<request> req) {

				    ss::rebuild.set(r, [&ss](std::unique_ptr<request> req) {

				        auto source_dc = req->get_query_param("source_dc");

				        return service::get_local_storage_service().rebuild(std::move(source_dc)).then([] {

				        return ss.local().rebuild(std::move(source_dc)).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -713,23 +902,6 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::load_new_ss_tables.set(r, [&ctx](std::unique_ptr<request> req) {

				        auto ks = validate_keyspace(ctx, req->param);

				        auto cf = req->get_query_param("cf");

				        // No need to add the keyspace, since all we want is to avoid always sending this to the same

				        // CPU. Even then I am being overzealous here. This is not something that happens all the time.

				        auto coordinator = std::hash<sstring>()(cf) % smp::count;

				        return service::get_storage_service().invoke_on(coordinator, [ks = std::move(ks), cf = std::move(cf)] (service::storage_service& s) {

				            return s.load_new_sstables(ks, cf);

				        }).then_wrapped([] (auto&& f) {

				            if (f.failed()) {

				                auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());

				                return make_exception_future<json::json_return_type>(httpd::server_error_exception(msg));

				            }

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::sample_key_range.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				@@ -740,7 +912,7 @@ void set_storage_service(http_context& ctx, routes& r) {

				    ss::reset_local_schema.set(r, [](std::unique_ptr<request> req) {

				        // FIXME: We should truncate schema tables if more than one node in the cluster.

				        auto& sp = service::get_storage_proxy();

				        auto& fs = service::get_local_storage_service().features();

				        auto& fs = sp.local().features();

				        return db::schema_tables::recalculate_schema_version(sp, fs).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				@@ -776,6 +948,7 @@ void set_storage_service(http_context& ctx, routes& r) {

				        res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();

				        res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;

				        res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();

				        res.fast = tracing::tracing::get_local_tracing_instance().ignore_trace_events_enabled();

				        return res;

				    });

				@@ -783,8 +956,9 @@ void set_storage_service(http_context& ctx, routes& r) {

				        auto enable = req->get_query_param("enable");

				        auto ttl = req->get_query_param("ttl");

				        auto threshold = req->get_query_param("threshold");

				        auto fast = req->get_query_param("fast");

				        try {

				            return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {

				            return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {

				                if (threshold != "") {

				                    local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));

				                }

				@@ -794,6 +968,9 @@ void set_storage_service(http_context& ctx, routes& r) {

				                if (enable != "") {

				                    local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);

				                }

				                if (fast != "") {

				                    local_tracing.set_ignore_trace_events(strcasecmp(fast.c_str(), "true") == 0);

				                }

				            }).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				@@ -802,18 +979,18 @@ void set_storage_service(http_context& ctx, routes& r) {

				        }

				    });

				    ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				    ss::enable_auto_compaction.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto tables = split_cf(req->get_query_param("cf"));

				        return set_tables_autocompaction(ctx, keyspace, tables, true);

				        return set_tables_autocompaction(ctx, ss.local(), keyspace, tables, true);

				    });

				    ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				    ss::disable_auto_compaction.set(r, [&ctx, &ss](std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto tables = split_cf(req->get_query_param("cf"));

				        return set_tables_autocompaction(ctx, keyspace, tables, false);

				        return set_tables_autocompaction(ctx, ss.local(), keyspace, tables, false);

				    });

				    ss::deliver_hints.set(r, [](std::unique_ptr<request> req) {

				@@ -823,12 +1000,12 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				      });

				    ss::get_cluster_name.set(r, [](const_req req) {

				        return gms::get_local_gossiper().get_cluster_name();

				    ss::get_cluster_name.set(r, [&g](const_req req) {

				        return g.get_cluster_name();

				    });

				    ss::get_partitioner_name.set(r, [](const_req req) {

				        return gms::get_local_gossiper().get_partitioner_name();

				    ss::get_partitioner_name.set(r, [&g](const_req req) {

				        return g.get_partitioner_name();

				    });

				    ss::get_tombstone_warn_threshold.set(r, [](std::unique_ptr<request> req) {

				@@ -881,8 +1058,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return get_cf_stats(ctx, &column_family_stats::live_disk_space_used);

				    });

				    ss::get_exceptions.set(r, [](const_req req) {

				        return service::get_local_storage_service().get_exception_count();

				    ss::get_exceptions.set(r, [&ss](const_req req) {

				        return ss.local().get_exception_count();

				    });

				    ss::get_total_hints_in_progress.set(r, [](std::unique_ptr<request> req) {

				@@ -897,30 +1074,21 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(0);

				    });

				    ss::get_ownership.set(r, [] (std::unique_ptr<request> req) {

				        return service::get_local_storage_service().get_ownership().then([] (auto&& ownership) {

				    ss::get_ownership.set(r, [&ss] (std::unique_ptr<request> req) {

				        return ss.local().get_ownership().then([] (auto&& ownership) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));

				        });

				    });

				    ss::get_effective_ownership.set(r, [&ctx] (std::unique_ptr<request> req) {

				    ss::get_effective_ownership.set(r, [&ctx, &ss] (std::unique_ptr<request> req) {

				        auto keyspace_name = req->param["keyspace"] == "null" ? "" : validate_keyspace(ctx, req->param);

				        return service::get_local_storage_service().effective_ownership(keyspace_name).then([] (auto&& ownership) {

				        return ss.local().effective_ownership(keyspace_name).then([] (auto&& ownership) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));

				        });

				    });

				    ss::view_build_statuses.set(r, [&ctx] (std::unique_ptr<request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto view = req->param["view"];

				        return service::get_local_storage_service().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));

				        });

				    });

				    ss::sstable_info.set(r, [&ctx] (std::unique_ptr<request> req) {

				        auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;

				        auto cf = api::req_param<sstring>(*req, "cf", {}).value;

				@@ -930,7 +1098,7 @@ void set_storage_service(http_context& ctx, routes& r) {

				        using table_sstables_list = std::vector<ss::table_sstables>;

				        return do_with(table_sstables_list{}, [ks, cf, &ctx](table_sstables_list& dst) {

				            return service::get_local_storage_service().db().map_reduce([&dst](table_sstables_list&& res) {

				            return ctx.db.map_reduce([&dst](table_sstables_list&& res) {

				                for (auto&& t : res) {

				                    auto i = std::find_if(dst.begin(), dst.end(), [&t](const ss::table_sstables& t2) {

				                        return t.keyspace() == t2.keyspace() && t.table() == t2.table();

				@@ -963,7 +1131,7 @@ void set_storage_service(http_context& ctx, routes& r) {

				                        tst.keyspace = schema->ks_name();

				                        tst.table = schema->cf_name();

				                        for (auto sstable : *t->get_sstables_including_compacted_undeleted()) {

				                        for (auto sstables = t->get_sstables_including_compacted_undeleted(); auto sstable : *sstables) {

				                            auto ts = db_clock::to_time_t(sstable->data_file_write_time());

				                            ::tm t;

				                            ::gmtime_r(&ts, &t);

				@@ -1046,7 +1214,6 @@ void set_storage_service(http_context& ctx, routes& r) {

				            });

				        });

				    });

				}

				void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl) {

				@@ -1088,14 +1255,17 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				    });

				    ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) {

				        apilog.debug("take_snapshot: {}", req->query_parameters);

				        auto tag = req->get_query_param("tag");

				        auto column_families = split(req->get_query_param("cf"), ",");

				        auto sfopt = req->get_query_param("sf");

				        auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);

				        std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");

				        auto resp = make_ready_future<>();

				        if (column_families.empty()) {

				            resp = snap_ctl.local().take_snapshot(tag, keynames);

				            resp = snap_ctl.local().take_snapshot(tag, keynames, sf);

				        } else {

				            if (keynames.empty()) {

				                throw httpd::bad_param_exception("The keyspace of column families must be specified");

				@@ -1103,7 +1273,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				            if (keynames.size() > 1) {

				                throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");

				            }

				            resp = snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag);

				            resp = snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);

				        }

				        return resp.then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				@@ -1127,7 +1297,28 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				    });

				    ss::scrub.set(r, wrap_ks_cf(ctx, [&snap_ctl] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {

				        const auto skip_corrupted = req_param<bool>(*req, "skip_corrupted", false);

				        auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				        const sstring scrub_mode_str = req_param<sstring>(*req, "scrub_mode", "");

				        if (scrub_mode_str == "") {

				            const auto skip_corrupted = req_param<bool>(*req, "skip_corrupted", false);

				            if (skip_corrupted) {

				                scrub_mode = sstables::compaction_type_options::scrub::mode::skip;

				            }

				        } else {

				            if (scrub_mode_str == "ABORT") {

				                scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				            } else if (scrub_mode_str == "SKIP") {

				                scrub_mode = sstables::compaction_type_options::scrub::mode::skip;

				            } else if (scrub_mode_str == "SEGREGATE") {

				                scrub_mode = sstables::compaction_type_options::scrub::mode::segregate;

				            } else if (scrub_mode_str == "VALIDATE") {

				                scrub_mode = sstables::compaction_type_options::scrub::mode::validate;

				            } else {

				                throw std::invalid_argument(fmt::format("Unknown argument for 'scrub_mode' parameter: {}", scrub_mode_str));

				            }

				        }

				        auto f = make_ready_future<>();

				        if (!req_param<bool>(*req, "disable_snapshot", false)) {

				@@ -1137,12 +1328,12 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				            });

				        }

				        return f.then([&ctx, keyspace, column_families, skip_corrupted] {

				        return f.then([&ctx, keyspace, column_families, scrub_mode] {

				            return ctx.db.invoke_on_all([=] (database& db) {

				                return do_for_each(column_families, [=, &db](sstring cfname) {

				                    auto& cm = db.get_compaction_manager();

				                    auto& cf = db.find_column_family(keyspace, cfname);

				                    return cm.perform_sstable_scrub(&cf, skip_corrupted);

				                    return cm.perform_sstable_scrub(&cf, scrub_mode);

				                });

				            });

				        }).then([]{

									
										28

api/storage_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -23,16 +23,35 @@

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				#include "db/data_listeners.hh"

				namespace cql_transport { class controller; }

				class thrift_controller;

				namespace db { class snapshot_ctl; }

				namespace db {

				class snapshot_ctl;

				namespace view {

				class view_builder;

				}

				}

				namespace netw { class messaging_service; }

				class repair_service;

				namespace cdc { class generation_service; }

				class sstables_loader;

				namespace gms {

				class gossiper;

				}

				namespace api {

				void set_storage_service(http_context& ctx, routes& r);

				void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms);

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs);

				void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);

				void unset_sstables_loader(http_context& ctx, routes& r);

				void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb);

				void unset_view_builder(http_context& ctx, routes& r);

				void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair);

				void unset_repair(http_context& ctx, routes& r);

				void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl);

				void unset_transport_controller(http_context& ctx, routes& r);

				@@ -40,5 +59,6 @@ void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);

				void unset_rpc_controller(http_context& ctx, routes& r);

				void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);

				void unset_snapshot(http_context& ctx, routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				}

									
										2

api/stream_manager.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										2

api/stream_manager.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										15

api/system.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -25,6 +25,9 @@

				#include <seastar/core/reactor.hh>

				#include <seastar/http/exception.hh>

				#include "log.hh"

				#include "database.hh"

				extern logging::logger apilog;

				namespace api {

				@@ -70,6 +73,16 @@ void set_system(http_context& ctx, routes& r) {

				        }

				        return json::json_void();

				    });

				    hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {

				        apilog.info("Dropping sstable caches");

				        return ctx.db.invoke_on_all([] (database& db) {

				            return db.drop_caches();

				        }).then([] {

				            apilog.info("Caches dropped");

				            return json::json_return_type(json::json_void());

				        });

				    });

				}

				}

									
										2

api/system.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

									
										208

atomic_cell.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2018 ScyllaDB

				 * Copyright (C) 2018-present ScyllaDB

				 */

				/*

				@@ -24,142 +24,125 @@

				#include "counters.hh"

				#include "types.hh"

				/// LSA mirator for cells with irrelevant type

				///

				///

				const data::type_imr_descriptor& no_type_imr_descriptor() {

				    static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());

				    return state;

				}

				atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {

				    auto& imr_data = no_type_imr_descriptor();

				    return atomic_cell(

				            imr_data.type_info(),

				            imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_dead(timestamp, deletion_time);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, single_fragment_range(value));

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {

				    return atomic_cell_type::make_live(timestamp, fragment_range(value));

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, value);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)

				{

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, value);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,

				                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,

				                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {

				    return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,

				                             gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, value, expiry, ttl);

				}

				atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

				                                   gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)

				{

				    auto& imr_data = type.imr_state();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live(timestamp, value, expiry, ttl);

				}

				atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {

				    auto& imr_data = no_type_imr_descriptor();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())

				    );

				    return atomic_cell_type::make_live_counter_update(timestamp, value);

				}

				atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {

				    auto& imr_data = no_type_imr_descriptor();

				    return atomic_cell(

				        imr_data.type_info(),

				        imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())

				    );

				}

				static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)

				{

				    using imr_object_type = imr::utils::object<data::cell::structure>;

				    // If the cell doesn't own any memory it is trivial and can be copied with

				    // memcpy.

				    auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);

				    if (!f.template get<data::cell::tags::external_data>()) {

				        data::cell::context ctx(f, imr_data.type_info());

				        // XXX: We may be better off storing the total cell size in memory. Measure!

				        auto size = data::cell::structure::serialized_object_size(ptr, ctx);

				        return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {

				            std::copy_n(ptr, size, dst);

				        }, &imr_data.lsa_migrator());

				    }

				    return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());

				    return atomic_cell_type::make_live_uninitialized(timestamp, size);

				}

				atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)

				    : atomic_cell(type.imr_state().type_info(),

				                  copy_cell(type.imr_state(), other._view.raw_pointer()))

				{ }

				    : _data(other._view) {

				    set_view(_data);

				}

				// Based on:

				//  - org.apache.cassandra.db.AbstractCell#reconcile()

				//  - org.apache.cassandra.db.BufferExpiringCell#reconcile()

				//  - org.apache.cassandra.db.BufferDeletedCell#reconcile()

				std::strong_ordering

				compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {

				    if (left.timestamp() != right.timestamp()) {

				        return left.timestamp() <=> right.timestamp();

				    }

				    if (left.is_live() != right.is_live()) {

				        return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;

				    }

				    if (left.is_live()) {

				        auto c = compare_unsigned(left.value(), right.value()) <=> 0;

				        if (c != 0) {

				            return c;

				        }

				        if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {

				            // prefer expiring cells.

				            return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;

				        }

				        if (left.is_live_and_has_ttl()) {

				            if (left.expiry() != right.expiry()) {

				                return left.expiry() <=> right.expiry();

				            } else {

				                // prefer the cell that was written later,

				                // so it survives longer after it expires, until purged.

				                return right.ttl() <=> left.ttl();

				            }

				        }

				    } else {

				        // Both are deleted

				        // Origin compares big-endian serialized deletion time. That's because it

				        // delegates to AbstractCell.reconcile() which compares values after

				        // comparing timestamps, which in case of deleted cells will hold

				        // serialized expiry.

				        return (uint64_t) left.deletion_time().time_since_epoch().count()

				                <=> (uint64_t) right.deletion_time().time_since_epoch().count();

				    }

				    return std::strong_ordering::equal;

				}

				atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {

				    if (!_data.get()) {

				    if (_data.empty()) {

				        return atomic_cell_or_collection();

				    }

				    auto& imr_data = type.imr_state();

				    return atomic_cell_or_collection(

				        copy_cell(imr_data, _data.get())

				    );

				    return atomic_cell_or_collection(managed_bytes(_data));

				}

				atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)

				    : _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))

				    : _data(acv._view)

				{

				}

				bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const

				{

				    auto ptr_a = _data.get();

				    auto ptr_b = other._data.get();

				    if (!ptr_a || !ptr_b) {

				        return !ptr_a && !ptr_b;

				    if (_data.empty() || other._data.empty()) {

				        return _data.empty() && other._data.empty();

				    }

				    if (type.is_atomic()) {

				        auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);

				        auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);

				        auto a = atomic_cell_view::from_bytes(type, _data);

				        auto b = atomic_cell_view::from_bytes(type, other._data);

				        if (a.timestamp() != b.timestamp()) {

				            return false;

				        }

				@@ -191,44 +174,24 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c

				size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const

				{

				    if (!_data.get()) {

				        return 0;

				    }

				    auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());

				    auto view = data::cell::structure::make_view(_data.get(), ctx);

				    auto flags = view.get<data::cell::tags::flags>();

				    size_t external_value_size = 0;

				    if (flags.get<data::cell::tags::external_data>()) {

				        if (flags.get<data::cell::tags::collection>()) {

				            external_value_size = as_collection_mutation().data.size_bytes();

				        } else {

				            auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);

				            external_value_size = cell_view.value_size();

				        }

				        // Add overhead of chunk headers. The last one is a special case.

				        external_value_size += (external_value_size - 1) / data::cell::effective_external_chunk_length * data::cell::external_chunk_overhead;

				        external_value_size += data::cell::external_last_chunk_overhead;

				    }

				    return data::cell::structure::serialized_object_size(_data.get(), ctx)

				        + imr_object_type::size_overhead + external_value_size;

				    return _data.external_memory_usage();

				}

				std::ostream&

				operator<<(std::ostream& os, const atomic_cell_view& acv) {

				    if (acv.is_live()) {

				        return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",

				        fmt::print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",

				            acv.is_counter_update()

				                    ? "counter_update_value=" + to_sstring(acv.counter_update_value())

				                    : to_hex(acv.value().linearize()),

				                    : to_hex(to_bytes(acv.value())),

				            acv.timestamp(),

				            acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,

				            acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);

				    } else {

				        return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",

				        fmt::print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",

				            acv.timestamp(), acv.deletion_time().time_since_epoch().count());

				    }

				    return os;

				}

				std::ostream&

				@@ -247,22 +210,22 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {

				                cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();

				            } else {

				                cell_value_string_builder << "shards: ";

				                counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {

				                    cell_value_string_builder << ::join(", ", ccv.shards());

				                });

				                auto ccv = counter_cell_view(acv);

				                cell_value_string_builder << ::join(", ", ccv.shards());

				            }

				        } else {

				            cell_value_string_builder << type.to_string(acv.value().linearize());

				            cell_value_string_builder << type.to_string(to_bytes(acv.value()));

				        }

				        return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",

				        fmt::print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",

				            cell_value_string_builder.str(),

				            acv.timestamp(),

				            acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,

				            acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);

				    } else {

				        return fmt_print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",

				        fmt::print(os, "atomic_cell{{DEAD,ts={:d},deletion_time={:d}}}",

				            acv.timestamp(), acv.deletion_time().time_since_epoch().count());

				    }

				    return os;

				}

				std::ostream&

				@@ -271,12 +234,11 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {

				}

				std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {

				    if (!p._cell._data.get()) {

				    if (p._cell._data.empty()) {

				        return os << "{ null atomic_cell_or_collection }";

				    }

				    using dc = data::cell;

				    os << "{ ";

				    if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {

				    if (p._cdef.type->is_multi_cell()) {

				        os << "collection ";

				        auto cmv = p._cell.as_collection_mutation();

				        os << collection_mutation_view::printer(*p._cdef.type, cmv);

									
										289

atomic_cell.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -27,11 +27,10 @@

				#include "gc_clock.hh"

				#include "utils/managed_bytes.hh"

				#include <seastar/net//byteorder.hh>

				#include <seastar/util/bool_class.hh>

				#include <cstdint>

				#include <iosfwd>

				#include "data/cell.hh"

				#include "data/schema_info.hh"

				#include "imr/utils.hh"

				#include <concepts>

				#include "utils/fragmented_temporary_buffer.hh"

				#include "serializer.hh"

				@@ -40,41 +39,191 @@ class abstract_type;

				class collection_type_impl;

				class atomic_cell_or_collection;

				using atomic_cell_value_view = data::value_view;

				using atomic_cell_value_mutable_view = data::value_mutable_view;

				using atomic_cell_value = managed_bytes;

				template <mutable_view is_mutable>

				using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;

				using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;

				using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;

				template <typename T>

				requires std::is_trivial_v<T>

				static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {

				    auto out_view = managed_bytes_mutable_view(out);

				    out_view.remove_prefix(offset);

				    write<T>(out_view, val);

				}

				template <typename T>

				requires std::is_trivial_v<T>

				static void set_field(atomic_cell_value& out, unsigned offset, T val) {

				    auto out_view = atomic_cell_value_mutable_view(out);

				    set_field(out_view, offset, val);

				}

				template <FragmentRange Buffer>

				static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {

				    auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());

				    for (auto frag : value) {

				        write_fragmented(v, single_fragmented_view(frag));

				    }

				}

				template <typename T, FragmentedView Input>

				requires std::is_trivial_v<T>

				static T get_field(Input in, unsigned offset = 0) {

				    in.remove_prefix(offset);

				    return read_simple<T>(in);

				}

				/*

				 * Represents atomic cell layout. Works on serialized form.

				 *

				 * Layout:

				 *

				 *  <live>  := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>

				 *  <dead>  := <int8_t:    0><int64_t:timestamp><int64_t:deletion_time>

				 */

				class atomic_cell_type final {

				private:

				    static constexpr int8_t LIVE_FLAG = 0x01;

				    static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells

				    static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.

				    static constexpr unsigned flags_size = 1;

				    static constexpr unsigned timestamp_offset = flags_size;

				    static constexpr unsigned timestamp_size = 8;

				    static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;

				    static constexpr unsigned expiry_size = 8;

				    static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;

				    static constexpr unsigned deletion_time_size = 8;

				    static constexpr unsigned ttl_offset = expiry_offset + expiry_size;

				    static constexpr unsigned ttl_size = 4;

				    friend class counter_cell_builder;

				private:

				    static bool is_counter_update(atomic_cell_value_view cell) {

				        return cell.front() & COUNTER_UPDATE_FLAG;

				    }

				    static bool is_live(atomic_cell_value_view cell) {

				        return cell.front() & LIVE_FLAG;

				    }

				    static bool is_live_and_has_ttl(atomic_cell_value_view cell) {

				        return cell.front() & EXPIRY_FLAG;

				    }

				    static bool is_dead(atomic_cell_value_view cell) {

				        return !is_live(cell);

				    }

				    // Can be called on live and dead cells

				    static api::timestamp_type timestamp(atomic_cell_value_view cell) {

				        return get_field<api::timestamp_type>(cell, timestamp_offset);

				    }

				    static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {

				        set_field(cell, timestamp_offset, ts);

				    }

				    // Can be called on live cells only

				private:

				    template <mutable_view is_mutable>

				    static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {

				        auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);

				        auto value_offset = flags_size + timestamp_size + expiry_field_size;

				        cell.remove_prefix(value_offset);

				        return cell;

				    }

				public:

				    static atomic_cell_value_view value(managed_bytes_view cell) {

				        return do_get_value(cell);

				    }

				    static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {

				        return do_get_value(cell);

				    }

				    // Can be called on live counter update cells only

				    static int64_t counter_update_value(atomic_cell_value_view cell) {

				        return get_field<int64_t>(cell, flags_size + timestamp_size);

				    }

				    // Can be called only when is_dead() is true.

				    static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {

				        assert(is_dead(cell));

				        return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));

				    }

				    // Can be called only when is_live_and_has_ttl() is true.

				    static gc_clock::time_point expiry(atomic_cell_value_view cell) {

				        assert(is_live_and_has_ttl(cell));

				        auto expiry = get_field<int64_t>(cell, expiry_offset);

				        return gc_clock::time_point(gc_clock::duration(expiry));

				    }

				    // Can be called only when is_live_and_has_ttl() is true.

				    static gc_clock::duration ttl(atomic_cell_value_view cell) {

				        assert(is_live_and_has_ttl(cell));

				        return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));

				    }

				    static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {

				        managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);

				        b[0] = 0;

				        set_field(b, timestamp_offset, timestamp);

				        set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));

				        return b;

				    }

				    template <FragmentRange Buffer>

				    static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {

				        auto value_offset = flags_size + timestamp_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());

				        b[0] = LIVE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        set_value(b, value_offset, value);

				        return b;

				    }

				    static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {

				        auto value_offset = flags_size + timestamp_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));

				        b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        set_field(b, value_offset, value);

				        return b;

				    }

				    template <FragmentRange Buffer>

				    static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {

				        auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());

				        b[0] = EXPIRY_FLAG | LIVE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));

				        set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));

				        set_value(b, value_offset, value);

				        return b;

				    }

				    static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {

				        auto value_offset = flags_size + timestamp_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + size);

				        b[0] = LIVE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        return b;

				    }

				    template <mutable_view is_mutable>

				    friend class basic_atomic_cell_view;

				    friend class atomic_cell;

				};

				/// View of an atomic cell

				template<mutable_view is_mutable>

				class basic_atomic_cell_view {

				protected:

				    data::cell::basic_atomic_cell_view<is_mutable> _view;

				    friend class atomic_cell;

				public:

				    using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;

				    managed_bytes_basic_view<is_mutable> _view;

					friend class atomic_cell;

				protected:

				    explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)

				        : _view(std::move(v)) { }

				    basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)

				        : _view(data::cell::make_atomic_cell_view(ti, ptr))

				    { }

				    void set_view(managed_bytes_basic_view<is_mutable> v) {

				        _view = v;

				    }

				    basic_atomic_cell_view() = default;

				    explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }

				    friend class atomic_cell_or_collection;

				public:

				    operator basic_atomic_cell_view<mutable_view::no>() const noexcept {

				        return basic_atomic_cell_view<mutable_view::no>(_view);

				    }

				    void swap(basic_atomic_cell_view& other) noexcept {

				        using std::swap;

				        swap(_view, other._view);

				    }

				    bool is_counter_update() const {

				        return _view.is_counter_update();

				        return atomic_cell_type::is_counter_update(_view);

				    }

				    bool is_live() const {

				        return _view.is_live();

				        return atomic_cell_type::is_live(_view);

				    }

				    bool is_live(tombstone t, bool is_counter) const {

				        return is_live() && !is_covered_by(t, is_counter);

				@@ -83,73 +232,69 @@ public:

				        return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);

				    }

				    bool is_live_and_has_ttl() const {

				        return _view.is_expiring();

				        return atomic_cell_type::is_live_and_has_ttl(_view);

				    }

				    bool is_dead(gc_clock::time_point now) const {

				        return !is_live() || has_expired(now);

				        return atomic_cell_type::is_dead(_view) || has_expired(now);

				    }

				    bool is_covered_by(tombstone t, bool is_counter) const {

				        return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);

				    }

				    // Can be called on live and dead cells

				    api::timestamp_type timestamp() const {

				        return _view.timestamp();

				        return atomic_cell_type::timestamp(_view);

				    }

				    void set_timestamp(api::timestamp_type ts) {

				        _view.set_timestamp(ts);

				        atomic_cell_type::set_timestamp(_view, ts);

				    }

				    // Can be called on live cells only

				    data::basic_value_view<is_mutable> value() const {

				        return _view.value();

				    atomic_cell_value_basic_view<is_mutable> value() const {

				        return atomic_cell_type::value(_view);

				    }

				    // Can be called on live cells only

				    size_t value_size() const {

				        return _view.value_size();

				    }

				    bool is_value_fragmented() const {

				        return _view.is_value_fragmented();

				        return atomic_cell_type::value(_view).size();

				    }

				    // Can be called on live counter update cells only

				    int64_t counter_update_value() const {

				        return _view.counter_update_value();

				        return atomic_cell_type::counter_update_value(_view);

				    }

				    // Can be called only when is_dead(gc_clock::time_point)

				    gc_clock::time_point deletion_time() const {

				        return !is_live() ? _view.deletion_time() : expiry() - ttl();

				        return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();

				    }

				    // Can be called only when is_live_and_has_ttl()

				    gc_clock::time_point expiry() const {

				        return _view.expiry();

				        return atomic_cell_type::expiry(_view);

				    }

				    // Can be called only when is_live_and_has_ttl()

				    gc_clock::duration ttl() const {

				        return _view.ttl();

				        return atomic_cell_type::ttl(_view);

				    }

				    // Can be called on live and dead cells

				    bool has_expired(gc_clock::time_point now) const {

				        return is_live_and_has_ttl() && expiry() <= now;

				    }

				    bytes_view serialize() const {

				        return _view.serialize();

				    managed_bytes_view serialize() const {

				        return _view;

				    }

				};

				class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {

				    atomic_cell_view(const data::type_info& ti, const uint8_t* data)

				        : basic_atomic_cell_view<mutable_view::no>(ti, data) {}

				    atomic_cell_view(managed_bytes_view v)

				        : basic_atomic_cell_view(v) {}

				    template<mutable_view is_mutable>

				    atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)

				        : basic_atomic_cell_view<mutable_view::no>(view) { }

				    atomic_cell_view(basic_atomic_cell_view<is_mutable> view)

				        : basic_atomic_cell_view<mutable_view::no>(view) {}

				    friend class atomic_cell;

				public:

				    static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {

				        return atomic_cell_view(ti, data.get());

				    static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {

				        return atomic_cell_view(v);

				    }

				    static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {

				        return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));

				    static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {

				        return atomic_cell_view(managed_bytes_view(v));

				    }

				    friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);

				@@ -164,11 +309,11 @@ public:

				};

				class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {

				    atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)

				        : basic_atomic_cell_view<mutable_view::yes>(ti, data) {}

				    atomic_cell_mutable_view(managed_bytes_mutable_view data)

				        : basic_atomic_cell_view(data) {}

				public:

				    static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {

				        return atomic_cell_mutable_view(ti, data.get());

				    static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {

				        return atomic_cell_mutable_view(v);

				    }

				    friend class atomic_cell;

				@@ -177,26 +322,31 @@ public:

				using atomic_cell_ref = atomic_cell_mutable_view;

				class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {

				    using imr_object_type =  imr::utils::object<data::cell::structure>;

				    imr_object_type _data;

				    atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)

				        : basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}

				    managed_bytes _data;

				    atomic_cell(managed_bytes b) : _data(std::move(b))  {

				        set_view(_data);

				    }

				public:

				    class collection_member_tag;

				    using collection_member = bool_class<collection_member_tag>;

				    atomic_cell(atomic_cell&&) = default;

				    atomic_cell& operator=(const atomic_cell&) = delete;

				    atomic_cell& operator=(atomic_cell&&) = default;

				    void swap(atomic_cell& other) noexcept {

				        basic_atomic_cell_view<mutable_view::yes>::swap(other);

				        _data.swap(other._data);

				    atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {

				        set_view(_data);

				    }

				    operator atomic_cell_view() const { return atomic_cell_view(_view); }

				    atomic_cell& operator=(const atomic_cell&) = delete;

				    atomic_cell& operator=(atomic_cell&& o) {

				        _data = std::move(o._data);

				        set_view(_data);

				        return *this;

				    }

				    operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }

				    atomic_cell(const abstract_type& t, atomic_cell_view other);

				    static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);

				    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,

				                                 collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,

				                                 collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,

				                                 collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

				@@ -208,6 +358,8 @@ public:

				    static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);

				    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,

				        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,

				        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,

				        gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);

				    static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

				@@ -224,6 +376,13 @@ public:

				            return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);

				        }

				    }

				    static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const managed_bytes_view& value, ttl_opt ttl, collection_member cm = collection_member::no) {

				        if (!ttl) {

				            return make_live(type, timestamp, value, cm);

				        } else {

				            return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);

				        }

				    }

				    static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);

				    friend class atomic_cell_or_collection;

				    friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);

				@@ -237,7 +396,7 @@ public:

				class column_definition;

				int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);

				std::strong_ordering compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);

				void merge_column(const abstract_type& def,

				        atomic_cell_or_collection& old,

				        const atomic_cell_or_collection& neww);

									
										6

atomic_cell_hash.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -52,9 +52,7 @@ struct appending_hash<atomic_cell_view> {

				        feed_hash(h, cell.timestamp());

				        if (cell.is_live()) {

				            if (cdef.is_counter()) {

				                counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {

				                    ::feed_hash(h, ccv);

				                });

				                ::feed_hash(h, counter_cell_view(cell));

				                return;

				            }

				            if (cell.is_live_and_has_ttl()) {

									
										36

atomic_cell_or_collection.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 ScyllaDB

				 * Copyright (C) 2015-present ScyllaDB

				 */

				/*

				@@ -24,22 +24,15 @@

				#include "atomic_cell.hh"

				#include "collection_mutation.hh"

				#include "schema.hh"

				#include "hashing.hh"

				#include "imr/utils.hh"

				// A variant type that can hold either an atomic_cell, or a serialized collection.

				// Which type is stored is determined by the schema.

				// Has an "empty" state.

				// Objects moved-from are left in an empty state.

				class atomic_cell_or_collection final {

				    // FIXME: This has made us lose small-buffer optimisation. Unfortunately,

				    // due to the changed cell format it would be less effective now, anyway.

				    // Measure the actual impact because any attempts to fix this will become

				    // irrelevant once rows are converted to the IMR as well, so maybe we can

				    // live with this like that.

				    using imr_object_type = imr::utils::object<data::cell::structure>;

				    imr_object_type _data;

				    managed_bytes _data;

				private:

				    atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}

				    atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}

				public:

				    atomic_cell_or_collection() = default;

				    atomic_cell_or_collection(atomic_cell_or_collection&&) = default;

				@@ -49,20 +42,16 @@ public:

				    atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}

				    atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);

				    static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }

				    atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }

				    atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }

				    atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }

				    atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }

				    atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }

				    atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }

				    atomic_cell_or_collection copy(const abstract_type&) const;

				    explicit operator bool() const {

				        return bool(_data);

				        return !_data.empty();

				    }

				    static constexpr bool can_use_mutable_view() {

				        return true;

				    }

				    void swap(atomic_cell_or_collection& other) noexcept {

				        _data.swap(other._data);

				    }

				    static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }

				    collection_mutation_view as_collection_mutation() const;

				    bytes_view serialize() const;

				@@ -82,12 +71,3 @@ public:

				    };

				    friend std::ostream& operator<<(std::ostream&, const printer&);

				};

				namespace std {

				inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept

				{

				    a.swap(b);

				}

				}

									
										2

auth/allow_all_authenticator.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

									
										2

auth/allow_all_authenticator.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

									
										2

auth/allow_all_authorizer.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

									
										2

auth/allow_all_authorizer.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

									
										2

auth/authenticated_user.cc
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

									
										2

auth/authenticated_user.hh
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

									
										2

auth/authentication_options.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2018 ScyllaDB

				 * Copyright (C) 2018-present ScyllaDB

				 */

				/*

									
										2

auth/authentication_options.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2018 ScyllaDB

				 * Copyright (C) 2018-present ScyllaDB

				 */

				/*

									
										2

auth/authenticator.cc
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

									
										5

auth/authenticator.hh
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				@@ -47,7 +47,6 @@

				#include <stdexcept>

				#include <unordered_map>

				#include <boost/any.hpp>

				#include <seastar/core/enum.hh>

				#include <seastar/core/future.hh>

				#include <seastar/core/sstring.hh>

				@@ -75,6 +74,8 @@ class authenticated_user;

				///

				class authenticator {

				public:

				    using ptr_type = std::unique_ptr<authenticator>;

				    ///

				    /// The name of the key to be used for the user-name part of password authentication with \ref authenticate.

				    ///

									
										4

auth/authorizer.hh
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				@@ -91,6 +91,8 @@ public:

				///

				class authorizer {

				public:

				    using ptr_type = std::unique_ptr<authorizer>;

				    virtual ~authorizer() = default;

				    virtual future<> start() = 0;

									
										20

auth/common.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

				@@ -45,15 +45,13 @@ static logging::logger auth_log("auth");

				// Func must support being invoked more than once.

				future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func) {

				    struct empty_state { };

				    return delay_until_system_ready(as).then([&as, func = std::move(func)] () mutable {

				        return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {

				            return func().then_wrapped([] (auto&& f) -> std::optional<empty_state> {

				                if (f.failed()) {

				                    auth_log.debug("Auth task failed with error, rescheduling: {}", f.get_exception());

				                    return { };

				                }

				                return { empty_state() };

				            });

				    return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {

				        return func().then_wrapped([] (auto&& f) -> std::optional<empty_state> {

				            if (f.failed()) {

				                auth_log.debug("Auth task failed with error, rescheduling: {}", f.get_exception());

				                return { };

				            }

				            return { empty_state() };

				        });

				    }).discard_result();

				}

				@@ -82,7 +80,7 @@ static future<> create_metadata_table_if_missing_impl(

				    b.set_uuid(uuid);

				    schema_ptr table = b.build();

				    return ignore_existing([&mm, table = std::move(table)] () {

				        return mm.announce_new_column_family(table, false);

				        return mm.announce_new_column_family(table);

				    });

				}

									
										6

auth/common.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 * Copyright (C) 2017-present ScyllaDB

				 */

				/*

				@@ -70,10 +70,6 @@ future<> once_among_shards(Task&& f) {

				    return make_ready_future<>();

				}

				inline future<> delay_until_system_ready(seastar::abort_source& as) {

				    return sleep_abortable(15s, as);

				}

				// Func must support being invoked more than once.

				future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);

									
										17

auth/default_authorizer.cc
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				@@ -62,6 +62,7 @@ extern "C" {

				#include "exceptions/exceptions.hh"

				#include "log.hh"

				#include "database.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				@@ -134,13 +135,13 @@ future<> default_authorizer::migrate_legacy_metadata() const {

				}

				future<> default_authorizer::start() {

				    static const sstring create_table = sprint(

				            "CREATE TABLE %s.%s ("

				            "%s text,"

				            "%s text,"

				            "%s set<text>,"

				            "PRIMARY KEY(%s, %s)"

				            ") WITH gc_grace_seconds=%d",

				    static const sstring create_table = fmt::format(

				            "CREATE TABLE {}.{} ("

				            "{} text,"

				            "{} text,"

				            "{} set<text>,"

				            "PRIMARY KEY({}, {})"

				            ") WITH gc_grace_seconds={}",

				            meta::AUTH_KS,

				            PERMISSIONS_CF,

				            ROLE_NAME,

									
										9

auth/default_authorizer.hh
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				@@ -46,9 +46,14 @@

				#include <seastar/core/abort_source.hh>

				#include "auth/authorizer.hh"

				#include "cql3/query_processor.hh"

				#include "service/migration_manager.hh"

				namespace cql3 {

				class query_processor;

				} // namespace cql3

				namespace auth {

				class default_authorizer : public authorizer {

									
										60

auth/password_authenticator.cc
									
												View File
												
				@@ -17,7 +17,7 @@

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 * Copyright (C) 2016-present ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				@@ -59,6 +59,7 @@

				#include "service/migration_manager.hh"

				#include "utils/class_registrator.hh"

				#include "database.hh"

				#include "cql3/query_processor.hh"

				namespace auth {

				@@ -66,7 +67,6 @@ constexpr std::string_view password_authenticator_name("org.apache.cassandra.aut

				// name of the hash column.

				static constexpr std::string_view SALTED_HASH = "salted_hash";

				static constexpr std::string_view OPTIONS = "options";

				static constexpr std::string_view DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;

				static const sstring DEFAULT_USER_PASSWORD = sstring(meta::DEFAULT_SUPERUSER_NAME);

				@@ -204,11 +204,11 @@ bool password_authenticator::require_authentication() const {

				}

				authentication_option_set password_authenticator::supported_options() const {

				    return authentication_option_set{authentication_option::password, authentication_option::options};

				    return authentication_option_set{authentication_option::password};

				}

				authentication_option_set password_authenticator::alterable_options() const {

				    return authentication_option_set{authentication_option::password, authentication_option::options};

				    return authentication_option_set{authentication_option::password};

				}

				future<authenticated_user> password_authenticator::authenticate(

				@@ -263,46 +263,21 @@ future<authenticated_user> password_authenticator::authenticate(

				    });

				}

				future<> password_authenticator::maybe_update_custom_options(std::string_view role_name, const authentication_options& options) const {

				    static const sstring query = format("UPDATE {} SET {} = ? WHERE {} = ?",

				            meta::roles_table::qualified_name,

				            OPTIONS,

				            meta::roles_table::role_col_name);

				    if (!options.options) {

				        return make_ready_future<>();

				    }

				    std::vector<std::pair<data_value, data_value>> entries;

				    for (const auto& entry : *options.options) {

				        entries.push_back({data_value(entry.first), data_value(entry.second)});

				    }

				    auto map_value = make_map_value(map_type_impl::get_instance(utf8_type, utf8_type, false), entries);

				    return _qp.execute_internal(

				            query,

				            consistency_for_user(role_name),

				            internal_distributed_query_state(),

				            {std::move(map_value), sstring(role_name)}).discard_result();

				}

				future<> password_authenticator::create(std::string_view role_name, const authentication_options& options) const {

				    if (!options.password) {

				        return maybe_update_custom_options(role_name, options);

				        return make_ready_future<>();

				    }

				    return _qp.execute_internal(

				            update_row_query(),

				            consistency_for_user(role_name),

				            internal_distributed_query_state(),

				            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result().then([this, role_name, &options] {

				                return maybe_update_custom_options(role_name, options);

				            });

				            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();

				}

				future<> password_authenticator::alter(std::string_view role_name, const authentication_options& options) const {

				    if (!options.password) {

				        return maybe_update_custom_options(role_name, options);

				        return make_ready_future<>();

				    }

				    static const sstring query = format("UPDATE {} SET {} = ? WHERE {} = ?",

				@@ -314,9 +289,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent

				            query,

				            consistency_for_user(role_name),

				            internal_distributed_query_state(),

				            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result().then([this, role_name, &options] {

				                return maybe_update_custom_options(role_name, options);

				            }).discard_result();

				            {passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();

				}

				future<> password_authenticator::drop(std::string_view name) const {

				@@ -332,22 +305,7 @@ future<> password_authenticator::drop(std::string_view name) const {

				}

				future<custom_options> password_authenticator::query_custom_options(std::string_view role_name) const {

				    static const sstring query = format("SELECT {} FROM {} WHERE {} = ?",

				            OPTIONS,

				            meta::roles_table::qualified_name,

				            meta::roles_table::role_col_name);

				    return _qp.execute_internal(

				            query, consistency_for_user(role_name),

				            internal_distributed_query_state(),

				            {sstring(role_name)}).then([](::shared_ptr<cql3::untyped_result_set> rs) {

				        custom_options opts;

				        const auto& row = rs->one();

				        if (row.has(OPTIONS)) {

				            row.get_map_data<sstring, sstring>(OPTIONS, std::inserter(opts, opts.end()), utf8_type, utf8_type);

				        }

				        return opts;

				    });

				    return make_ready_future<custom_options>();

				}

				const resource_set& password_authenticator::protected_resources() const {

Compare commits

4532 Commits add_alter_ ... branch-4.6

28 .github/CODEOWNERS vendored Unescape Escape View File

29 .github/workflows/docs-pages@v2.yaml vendored Normal file Unescape Escape View File

25 .github/workflows/docs-pr@v1.yaml vendored Normal file Unescape Escape View File

4 .gitignore vendored Unescape Escape View File

2 .gitmodules vendored Unescape Escape View File

124 CMakeLists.txt Unescape Escape View File

21 CONTRIBUTING.md Unescape Escape View File

28 HACKING.md Unescape Escape View File

4 NOTICE.txt Unescape Escape View File

11 README.md Unescape Escape View File

76 SCYLLA-VERSION-GEN Unescape Escape View File

2 abseil

2 absl-flat_hash_map.cc Unescape Escape View File

2 absl-flat_hash_map.hh Unescape Escape View File

67 alternator/auth.cc Unescape Escape View File

10 alternator/auth.hh Unescape Escape View File

241 alternator/conditions.cc Unescape Escape View File

3 alternator/conditions.hh Unescape Escape View File

128 alternator/controller.cc Normal file Unescape Escape View File

82 alternator/controller.hh Normal file Unescape Escape View File

18 alternator/error.hh Unescape Escape View File

966 alternator/executor.cc View File

106 alternator/executor.hh Unescape Escape View File

193 alternator/expressions.cc Unescape Escape View File

5 alternator/expressions.g Unescape Escape View File

2 alternator/expressions.hh Unescape Escape View File

17 alternator/expressions_types.hh Unescape Escape View File

5 alternator/rmw_operation.hh Unescape Escape View File

44 alternator/serialization.cc Unescape Escape View File

8 alternator/serialization.hh Unescape Escape View File

279 alternator/server.cc Unescape Escape View File

26 alternator/server.hh Unescape Escape View File

5 alternator/stats.cc Unescape Escape View File

3 alternator/stats.hh Unescape Escape View File

134 alternator/streams.cc Unescape Escape View File

2 alternator/tags_extension.hh Unescape Escape View File

113 alternator/ttl.cc Normal file Unescape Escape View File

10 api/api-doc/column_family.json Unescape Escape View File

24 api/api-doc/gossiper.json Unescape Escape View File

55 api/api-doc/hinted_handoff.json Unescape Escape View File

4 api/api-doc/messaging_service.json Unescape Escape View File

130 api/api-doc/storage_service.json Unescape Escape View File

16 api/api-doc/system.json Unescape Escape View File

81 api/api.cc Unescape Escape View File

5 api/api.hh Unescape Escape View File

63 api/api_init.hh Unescape Escape View File

2 api/cache_service.cc Unescape Escape View File

2 api/cache_service.hh Unescape Escape View File

2 api/collectd.cc Unescape Escape View File

2 api/collectd.hh Unescape Escape View File

133 api/column_family.cc Unescape Escape View File

7 api/column_family.hh Unescape Escape View File

2 api/commitlog.cc Unescape Escape View File

2 api/commitlog.hh Unescape Escape View File

16 api/compaction_manager.cc Unescape Escape View File

2 api/compaction_manager.hh Unescape Escape View File

15 api/config.cc Unescape Escape View File

4 api/config.hh Unescape Escape View File

2 api/endpoint_snitch.cc Unescape Escape View File

2 api/endpoint_snitch.hh Unescape Escape View File

2 api/error_injection.cc Unescape Escape View File

2 api/error_injection.hh Unescape Escape View File

24 api/failure_detector.cc Unescape Escape View File

12 api/failure_detector.hh Unescape Escape View File

37 api/gossiper.cc Unescape Escape View File

12 api/gossiper.hh Unescape Escape View File

93 api/hinted_handoff.cc Unescape Escape View File

13 api/hinted_handoff.hh Unescape Escape View File

3 api/lsa.cc Unescape Escape View File

2 api/lsa.hh Unescape Escape View File

7 api/messaging_service.cc Unescape Escape View File

2 api/messaging_service.hh Unescape Escape View File

8 api/storage_proxy.cc Unescape Escape View File

7 api/storage_proxy.hh Unescape Escape View File

467 api/storage_service.cc Unescape Escape View File

28 api/storage_service.hh Unescape Escape View File

2 api/stream_manager.cc Unescape Escape View File

2 api/stream_manager.hh Unescape Escape View File

4532 Commits

add_alter_ ... branch-4.6

28

.github/CODEOWNERS vendored

View File

29

.github/workflows/docs-pages@v2.yaml vendored Normal file

View File

25

.github/workflows/docs-pr@v1.yaml vendored Normal file

View File

4

.gitignore vendored

View File

2

.gitmodules vendored

View File

124

CMakeLists.txt

View File

21

CONTRIBUTING.md

View File

28

HACKING.md

View File

4

NOTICE.txt

View File

11

README.md

View File

76

SCYLLA-VERSION-GEN

View File

2

abseil

2

absl-flat_hash_map.cc

View File

2

absl-flat_hash_map.hh

View File

67

alternator/auth.cc

View File

10

alternator/auth.hh

View File

241

alternator/conditions.cc

View File

3

alternator/conditions.hh

View File

128

alternator/controller.cc Normal file

View File

82

alternator/controller.hh Normal file

View File

18

alternator/error.hh

View File

966

alternator/executor.cc

View File

106

alternator/executor.hh

View File

193

alternator/expressions.cc

View File

5

alternator/expressions.g

View File

2

alternator/expressions.hh

View File

17

alternator/expressions_types.hh

View File

5

alternator/rmw_operation.hh

View File

44

alternator/serialization.cc

View File

8

alternator/serialization.hh

View File

279

alternator/server.cc

View File

26

alternator/server.hh

View File

5

alternator/stats.cc

View File

3

alternator/stats.hh

View File

134

alternator/streams.cc

View File

2

alternator/tags_extension.hh

View File

113

alternator/ttl.cc Normal file

View File

10

api/api-doc/column_family.json

View File

24

api/api-doc/gossiper.json

View File

55

api/api-doc/hinted_handoff.json

View File

4

api/api-doc/messaging_service.json

View File

130

api/api-doc/storage_service.json

View File

16

api/api-doc/system.json

View File

81

api/api.cc

View File

5

api/api.hh

View File

63

api/api_init.hh

View File

2

api/cache_service.cc

View File

2

api/cache_service.hh

View File

2

api/collectd.cc

View File

2

api/collectd.hh

View File

133

api/column_family.cc

View File

7

api/column_family.hh

View File

2

api/commitlog.cc

View File

2

api/commitlog.hh

View File

16

api/compaction_manager.cc

View File

2

api/compaction_manager.hh

View File

15

api/config.cc

View File

4

api/config.hh

View File

2

api/endpoint_snitch.cc

View File

2

api/endpoint_snitch.hh

View File

2

api/error_injection.cc

View File

2

api/error_injection.hh

View File

24

api/failure_detector.cc

View File

12

api/failure_detector.hh

View File

37

api/gossiper.cc

View File

12

api/gossiper.hh

View File

93

api/hinted_handoff.cc

View File

13

api/hinted_handoff.hh

View File

3

api/lsa.cc

View File

2

api/lsa.hh

View File

7

api/messaging_service.cc

View File

2

api/messaging_service.hh

View File

8

api/storage_proxy.cc

View File

7

api/storage_proxy.hh

View File

467

api/storage_service.cc

View File

28

api/storage_service.hh

View File

2

api/stream_manager.cc

View File

2

api/stream_manager.hh

View File

15

api/system.cc

View File