scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Michał Radwański	ebf38eaead	flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO. In functions such as upgrade_to_v2 (excerpt below), if the constructor of transforming_reader throws, r needs to be destroyed, however it hasn't been closed. However, if a reader didn't start any operations, it is safe to destruct such a reader. This issue can potentially manifest itself in many more readers and might be hard to track down. This commit adds a bool indicating whether a close is anticipated, thus avoiding errors in the destructor. Code excerpt: flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) { class transforming_reader : public flat_mutation_reader_v2::impl { // ... }; return make_flat_mutation_reader_v2<transforming_reader>(std::move(r)); } Fixes #9065. Fixes #11491 (cherry picked from commit `9ada63a9cb`)	2022-09-21 10:25:18 +03:00
Beni Peled	1c82766f33	release: prepare for 5.0.4 scylla-5.0.4	2022-09-21 09:16:13 +03:00
Piotr Sarna	e1f78c33b4	Merge 'Fix mutation commutativity with shadowable tombstone' from Tomasz Grabiec This series fixes lack of mutation associativity which manifests as sporadic failures in row_cache_test.cc::test_concurrent_reads_and_eviction due to differences in mutations applied and read. No known production impact. Refs https://github.com/scylladb/scylladb/issues/11307 Closes #11312 * github.com:scylladb/scylladb: test: mutation_test: Add explicit test for mutation commutativity test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones db: mutation_partition: Drop unnecessary maybe_shadow() db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone mutation_partition: row: make row marker shadowing symmetric (cherry picked from commit `484004e766`)	2022-09-20 23:21:06 +02:00
Tomasz Grabiec	0634b5f734	test: row_cache: Use more narrow key range to stress overlapping reads more This makes catching issues related to concurrent access of same or adjacent entries more likely. For example, catches #11239. Closes #11260 (cherry picked from commit `8ee5b69f80`)	2022-09-20 23:20:43 +02:00
Avi Kivity	6f020b26e1	Merge 'Backport 3 fixes for the evictable reader v2' from Botond Dénes This pull request backports 3 important fixes from `adc08d0ab9`. Said 3 commits fixed important bugs in the v2 variant of the evitable reader, but were not backported because they were part of a large series doing v2 conversion in general. This means that 5.0 was left with a buggy evictable reader v2, which is used by repair. So far in the wild we've seen one bug manifest itself: the evictable reader getting stuck, spinning in a tight loop in `evictable_reader_v2::do_fill_buffer()`, in turn making repair being stuck too. Fixes: #11223 Closes #11540 * github.com:scylladb/scylladb: test/boost/mutation_reader_test: add v2 specific evictable reader tests evictable_reader_v2: terminate active range tombstones on reader recreation evictable_reader_v2: restore handling of non-monotonically increasing positions evictable_reader_v2: simplify handling of reader recreation	2022-09-20 13:42:10 +03:00
Pavel Emelyanov	7f8dcc5657	messaging_service: Fix gossiper verb group When configuring tcp-nodelay unconditionally, messaging service thinks gossiper uses group index 1, though it had changed some time ago and now those verbs belong to group 0. fixes: #11465 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `2c74062962`)	2022-09-19 10:31:58 +03:00
Botond Dénes	20451760fe	tools/scylla-sstable: fix description template Quote '{' and '}' used in CQL example, so format doesn't try to interpret it. Fixes: #11571 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220221140652.173015-1-bdenes@scylladb.com> (cherry picked from commit `10880fb0a7`)	2022-09-19 06:54:25 +03:00
Michał Chojnowski	51b031d04e	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202 (cherry picked from commit `cdb3e71045`)	2022-09-18 13:29:35 +03:00
Botond Dénes	82d1446ca9	test/boost/mutation_reader_test: add v2 specific evictable reader tests One is a reincarnation of the recently removed test_multishard_combining_reader_non_strictly_monotonic_positions. The latter was actually targeting the evictable reader but through the multishard reader, probably for historic reasons (evictable reader was part of the multishard reader family). The other one checks that active range tombstones changes are properly terminated when the partition ends abruptly after recreating the reader. (cherry picked from commit `014a23bf2a`)	2022-09-15 13:51:13 +03:00
Botond Dénes	e0acb0766d	evictable_reader_v2: terminate active range tombstones on reader recreation Reader recreation messes with the continuity of the mutation fragment stream because it breaks snapshot isolation. We cannot guarantee that a range tombstone or even the partition started before will continue after too. So we have to make sure to wrap up all loose threads when recreating the reader. We already close uncontinued partitions. This commit also takes care of closing any range tombstone started by unconditionally emitting a null range tombstone. This is legal to do, even if no range tombstone was in effect. (cherry picked from commit `9e48237b86`)	2022-09-14 19:15:50 +03:00
Botond Dénes	4f26d489a0	evictable_reader_v2: restore handling of non-monotonically increasing positions We thought that unlike v1, v2 will not need this. But it does. Handled similarly to how v1 did it: we ensure each buffer represents forward progress, when the last fragment in the buffer is a range tombstone change: * Ensure the content of the buffer represents progress w.r.t. _next_position_in_partition, thus ensuring the next time we recreate the reader it will continue from a later position. * Continue reading until the next (peeked) fragment has a strictly larger position. The code is just much nicer because it uses coroutines. (cherry picked from commit `6db08ddeb2`)	2022-09-14 19:15:49 +03:00
Botond Dénes	43cbc5c836	evictable_reader_v2: simplify handling of reader recreation The evictable reader has a handful of flags dictating what to do after the reader is recreated: what to validate, what to drop, etc. We actually need a single flag telling us if the reader was recreated or not, all other things can be derived from existing fields. This patch does exactly that. Furthermore it folds do_fill_buffer() into fill_buffer() and replaces the awkward to use `should_drop_fragment()` with `examine_first_fragments()`, which does a much better job of encapsulating all validation and fragment dropping logic. This code reorganization also fixes two bugs introduced by the v2 conversion: * The loop in `do_fill_buffer()` could become infinite in certain circumstances due to a difference between the v1 and v2 versions of `is_end_of_stream()`. * The position of the first non-dropped fragment is was not validated (this was integrated into the range tombstone trimming which was thrown out by the conversion). (cherry picked from commit `498d03836b`)	2022-09-14 19:15:49 +03:00
Nadav Har'El	f0c521efdf	alternator: clean error shutdown in case of TLS misconfigration The way our boot-time service "controllers" are written, if a controller's start_server() finds an error and throws, it cannot the caller (main.cc) to call stop_server(), and must clean up resources already created (e.g., sharded services) before returning or risk crashes on assertion failures. This patch fixes such a mistake in Alternator's initialization. As noted in issue #10025, if the Alternator TLS configuration is broken - especially the certificate or key files are missing - Scylla would crash on an assertion failure, instead of reporting the error as expected. Before this patch such a misconfiguration will result in the unintelligible: <alternator::server>::~sharded() [Service = alternator::server]: Assertion `_instances.empty()' failed. Aborting on shard 0. After this patch we get the right error message: ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed: std::_Nested_exception<std::runtime_error> (Failed to set up Alternator TLS credentials): std::_Nested_exception<std::runtime_error> (Could not read certificate file conf/scylla.crt): std::filesystem::__cxx11:: filesystem_error (error system:2, filesystem error: open failed: No such file or directory [conf/scylla.crt]) Arguably this error message is a bit ugly, so I opened https://github.com/scylladb/seastar/issues/1029, but at least it says exactly what the error is. Fixes #10025 Fixes #11520 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220321133323.3150939-1-nyh@scylladb.com> (cherry picked from commit `7f89c8b3e3`)	2022-09-11 14:43:18 +03:00
Beni Peled	b9a61c8e9a	release: prepare for 5.0.3 scylla-5.0.3	2022-09-07 11:16:52 +03:00
Karol Baryła	32aa1e5287	transport/server.cc: Return correct size of decompressed lz4 buffer An incorrect size is returned from the function, which could lead to crashes or undefined behavior. Fix by erroring out in these cases. Fixes #11476 (cherry picked from commit `1c2eef384d`)	2022-09-07 10:58:42 +03:00
Nadav Har'El	da6a126d79	cross-tree: fix header file self-sufficiency Scylla's coding standard requires that each header is self-sufficient, i.e., it includes whatever other headers it needs - so it can be included without having to include any other header before it. We have a test for this, "ninja dev-headers", but it isn't run very frequently, and it turns out our code deviated from this requirement in a few places. This patch fixes those places, and after it "ninja dev-headers" succeeds again. This is needed because our CI runs "ninja dev-headers". Fixes #10995 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11457	2022-09-06 15:45:34 +03:00
Avi Kivity	d07e902983	Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later. Fixes: https://github.com/scylladb/scylladb/issues/11264 Closes #11273 * github.com:scylladb/scylladb: querier: querier_cache: remove now unused evict_all_for_table() database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table() reader_concurrency_semaphore: add evict_inactive_reads_for_table() (cherry picked from commit `afa7960926`)	2022-09-02 11:39:43 +03:00
Piotr Sarna	3c0fc42f84	cql3: fix misleading error message for service level timeouts The error message incorrectly stated that the timeout value cannot be longer than 24h, but it can - the actual restriction is that the value cannot be expressed in units like days or months, which was done in order to significantly simplify the parsing routines (and the fact that timeouts counted in days are not expected to be common). Fixes #10286 Closes #10294 (cherry picked from commit `85e95a8cc3`)	2022-09-01 20:34:12 +03:00
Piotr Grabowski	964ccf9192	type_json: support integers in scientific format Add support for specifing integers in scientific format (for example 1.234e8) in INSERT JSON statement: INSERT INTO table JSON '{"int_column": 1e7}'; Inserting a floating-point number ending with .0 is allowed, as the fractional part is zero. Non-zero fractional part (for example 12.34) is disallowed. A new test is added to test all those behaviors. Before the JSON parsing library was switched to RapidJSON from JsonCpp, this statement used to work correctly, because JsonCpp transparently casts double to integer value. This behavior differs from Cassandra, which disallows those types of numbers (1e7, 123.0 and 12.34). Fix typo in if condition: "if (value.GetUint64())" to "if (value.IsUint64())". Fixes #10100 (cherry picked from commit `efe7456f0a`)	2022-09-01 16:03:49 +03:00
Avi Kivity	dfdc128faf	Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec Scenario: cache = [ row(pos=2, continuous=false), row(pos=after(2), dummy=true) ] Scanning read starts, starts populating [-inf, before(2)] from sstables. row(pos=2) is evicted. cache = [ row(pos=after(2), dummy=true) ] Scanning read finishes reading from sstables. Refreshes cache cursor via partition_snapshot_row_cursor::maybe_refresh(), which calls partition_snapshot_row_cursor::advance_to() because iterators are invalidated. This advances the cursor to after(2). no_clustering_row_between(2, after(2)) returns true, so advance_to() returns true, and maybe_refresh() returns true. This is interpreted by the cache reader as "the cursor has not moved forward", so it marks the range as complete, without emitting the row with pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads will also miss the row. The bug is in advance_to(), which is using no_clustering_row_between(a, b) to determine its result, which by definition excludes the starting key. Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction with reduced key range in the random_mutation_generator (1024 -> 16). Fixes #11239 Closes #11240 * github.com:scylladb/scylladb: test: mvcc: Fix illegal use of maybe_refresh() tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range() tests: row_cache_test: Introduce one_shot mode to throttle row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy	2022-08-11 18:36:44 +02:00
Yaron Kaikov	299122e78d	release: prepare for 5.0.2 scylla-5.0.2	2022-08-07 16:15:02 +03:00
Avi Kivity	23a34d7e42	Merge 'Backport: Fix map subscript crashes when map or subscript is null' from Nadav Har'El This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0. Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug. Refs #10535. The original cover letter from https://github.com/scylladb/scylla/pull/10420: In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically. In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL. However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan. Fixes https://github.com/scylladb/scylla/issues/10361 Fixes https://github.com/scylladb/scylla/issues/10399 Fixes https://github.com/scylladb/scylla/pull/10401 Closes #11142 * github.com:scylladb/scylla: test/cql-pytest: reproducer for CONTAINS NULL bug expressions: don't dereference invalid map subscript in filter expressions: fix invalid dereference in map subscript evaluation test/cql-pytest: improve tests for map subscripts and nulls	2022-07-28 15:31:28 +03:00
Nadav Har'El	67a2f3aa67	test/cql-pytest: reproducer for CONTAINS NULL bug This is a reproducer for issue #10359 that a "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions should not match any set, but currently do match non-empty or all sets. The tests currently fail on Scylla, so marked xfail. They also fails on Cassandra because Cassandra considers such a request an error, which we consider a mistake (see #4776) - so the tests are marked "cassandra_bug". Refs #10359. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220412130914.823646-1-nyh@scylladb.com> (cherry picked from commit `ae0e1574dc`)	2022-07-27 20:03:30 +03:00
Nadav Har'El	66e8cf8cea	expressions: don't dereference invalid map subscript in filter If we have the filter expression "WHERE m[?] = 2", the existing code simply assumed that the subscript is an object of the right type. However, while it should indeed be the right type (we already have code that verifies that), there are two more options: It can also be a NULL, or an UNSET_VALUE. Either of these cases causes the existing code to dereference a non-object as an object, leading to bizarre errors (as in issue #10361) or even crashes (as in issue #10399). Cassandra returns a invalid request error in these cases: "Unsupported unset map key for column m" or "Unsupported null map key for column m". We decided to do things differently: * For NULL, we consider m[NULL] to result in NULL - instead of an error. This behavior is more consistent with other expressions that contain null - for example NULL[2] and NULL<2 both result in NULL as well. Moreover, if in the future we allow more complex expressions, such as m[a] (where a is a column), we can find the subscript to be null for some rows and non-null for other rows - and throwing an "invalid query" in the middle of the filtering doesn't make sense. * For UNSET_VALUE, we do consider this an error like Cassandra, and use the same error message as Cassandra. However, the current implementation checks for this error only when the expression is evaluated - not before. It means that if the scan is empty before the filtering, the error will not be reported and we'll silently return an empty result set. We currently consider this ok, but we can also change this in the future by binding the expression only once (today we do it on every evaluation) and validating it once after this binding. Fixes #10361 Fixes #10399 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `fbb2a41246`)	2022-07-27 19:56:17 +03:00
Nadav Har'El	35b66c844c	expressions: fix invalid dereference in map subscript evaluation When we have an filter such as "WHERE m[2] = 3" (where m is a map column), if a row had a null value for m, our expression evaluation code incorrectly dereferences an unset optional, and continued processing the result of this dereference which resulted in undefined behavior - sometimes we were lucky enough to get "marshaling error" but other times Scylla crashed. The fix is trivial - just check before dereferencing the optional value of the map. We return null in that case, which means that we consider the result of null[2] to be null. I think this is a reasonable approach and fits our overall approach of making null dominate expressions (e.g., the value of "null < 2" is also null). The test test_filtering.py::test_filtering_null_map_with_subscript, which used to frequently fail with marshaling errors or crashes, now passes every time so its "xfail" mark is removed. Fixes #10417 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `808a93d29b`)	2022-07-27 19:50:24 +03:00
Nadav Har'El	9e7a1340b9	test/cql-pytest: improve tests for map subscripts and nulls The test test_null.py::test_map_subscript_null turned out to reproduce multiple bugs related to using map subscripts in filtering expressions. One was issue #10361 (m[null] resulted in a bizarre error) or #10399 (m[null] resulted in a crash), and a different issue was #10401 (m[2] resulted in a bizarre error or a crash if m itself was null). Moreover, the same test uncovered different bugs depending how it was run - alone or with other tests - because it was using a shared table. In this patch we introduce two separate tests in test_filtering.py which are designed to reproduce these separate bugs instead of mixing them into one test. The new tests also cover a few more corners which the previous test (which focused on nulls) missed - such as UNSET_VALUE. The two new tests (and the old test_map_subscript_null) pass on Cassandra so still assume that the Cassandra behavior - that m[null] should be an error - is the correct behavior. We may want to change the desired behavior (e.g., to decide that m[null] be null, not an error), and change the tests accordingly later - but for now the tests follow Cassandra's behavior exactly, and pass on Cassandra and fail on Scylla (so are marked xfail). The bugs reproduced by these tests involve randomness or reading uninitialized memory, so these tests sometimes pass, sometimes fail, and sometimes even crash (as reported in #10399 and #10401). So to reproduce these bugs run the tests multiple times. For example: test/cql-pytest/run --count 100 --runxfail test_filtering.py::test_filtering_null_map_with_subscript Refs #10361 Refs #10399 Refs #10401 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `189b8845fe`)	2022-07-27 19:28:17 +03:00
Benny Halevy	d5a0750ef3	multishard_mutation_query: do_query: stop ctx if lookup_readers fails lookup_readers might fail after populating some readers and those better be closed before returning the exception. Fixes #10351 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10425 (cherry picked from commit `055141fc2e`)	2022-07-25 14:52:44 +03:00
Benny Halevy	618c483c73	sstables: time_series_sstable_set: insert: make exception safe Need to erase the shared sstable from _sstables if insertion to _sstables_reversed fails. Fixes #10787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `cd68b04fbf`)	2022-07-25 14:21:45 +03:00
Tomasz Grabiec	f10fd1bc12	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries. Fixes #10801 Refs #10793 (cherry picked from commit `0e78ad50ea`) Closes #10802 (cherry picked from commit `3bec1cc19f`)	2022-07-25 14:19:48 +03:00
Tomasz Grabiec	1891f10141	memtable: Fix missing range tombstones during reads under ceratin rare conditions There is a bug introduced in `e74c3c8` (4.6.0) which makes memtable reader skip one a range tombstone for a certain pattern of deletions and under certain sequence of events. _rt_stream contains the result of deoverlapping range tombstones which had the same position, which were sipped from all the versions. The result of deoverlapping may produce a range tombstone which starts later, at the same position as a more recent tombstone which has not been sipped from the partition version yet. If we consume the old range tombstone from _rt_stream and then refresh the iterators, the refresh will skip over the newer tombstone. The fix is to drop the logic which drains _rt_stream so that _rt_stream is always merged with partition versions. For the problem to trigger, there have to be multiple MVCC versions (at least 2) which contain deletions of the following form: [a, c] @ t0 [a, b) @ t1, [b, d] @ t2 c > b The proper sequence for such versions is (assuming d > c): [a, b) @ t1, [b, d] @ t2 Due to the bug, the reader will produce: [a, b) @ t1, [b, c] @ t0 The reader also needs to be preempted right before processing [b, d] @ t2 and iterators need to get invalidated so that lsa_partition_reader::do_refresh_state() is called and it skips over [b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it does emit the proper range tombstone, it's possible that it will violate fragment order in the stream if _rt_stream accumulated remainders (possible with 3 MVCC versions). The problem goes away once MVCC versions merge. Fixes #10913 Fixes #10830 Closes #10914 (cherry picked from commit `a6aef60b93`)	2022-07-19 19:33:51 +03:00
Pavel Emelyanov	b177dacd36	Update seastar submodule (auto-increase latency goal fixes) * seastar dbf79189...9a7ba6d5 (3): > io: Adjust IO latency goal on fair-queue level > reactor: Check IOPS/bandwidth and increase latency goal > Revert "io_queue: Auto-increase the io-latency goal" refs: #10927 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> scylla-5.0.1	2022-07-19 13:06:43 +03:00
Yaron Kaikov	283a722923	release: prepare for 5.0.1	2022-07-19 06:39:11 +03:00
Pavel Emelyanov	522d0a81e7	azure_snitch: Do nothing on non-io-cpu All snitch drivers are supposed to snitch info on some shard and replicate the dc/rack info across others. All, but azure really do so. The azure one gets dc/rack on all shards, which's excessive but not terrible, but when all shards start to replicate their data to all the others, this may lead to use-after-frees. fixes: #10494 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `c6d0bc87d0`)	2022-07-17 14:13:25 +03:00
Pavel Emelyanov	cd13911db4	Merge 'Scrub compaction: prevent mishandling of range tombstone changes' from Botond With v2 having individual bounds of range tombstone as separate fragments, out-of-order fragments become more difficult to handle, especially in the presence of active range tombstone. Scrub in both SKIP and SEGREGATE mode closes the partition on seeing the first invalid fragment (SEGREAGE re-opens it immediately). If there is an active range tombstone, scrub now also has to take care of closing said tombstone when closing the partition. In a normal stream it could just use the last position-in-partition to create a closing bound. But when out-of-order fragments are on the table this is not possible: the closing bound may be found later in the stream, with a position smaller than that of the current position-in-partition. To prevent extending range tombstone changes like that, Scrub now aborts the compaction on the first invalid fragment seen inside an active range tombstone. Fixing a v2 stream with range tombstone changes is definitely possible, but non-trivial, so we defer it until there is demand for it. This series also makes the mutation fragment stream validator check for open range tombstones on partition-end and adds a comprehensive test-suite for the validator. Fixes: #10168 Tests: unit(dev) * scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git: compaction/compaction: abort scrub when attempting to rectify stream with active tombstone test/boost/mutation_test: add test for mutation_fragment_stream_validator mutation_fragment_stream_validator: validate range tombstone changes (cherry picked from commit `edd0481b38`)	2022-07-14 18:49:13 +03:00
Nadav Har'El	32423ebc38	Merge 'Handle errors during snapshot' from Benny Halevy This series refactors `table::snapshot` and moves the responsibility to flush the table before taking the snapshot to the caller. `flush_on_all` and `snapshot_on_all` helpers are added to replica::database (by making it a peering_sharded_service) and upper layers, including api and snapshot-ctl now call it instead of calling cf.snapshot directly. With that, error are handed in table::snapshot and propagated back to the callers. Failure to allocate the `snapshot_manager` object is fatal, similar to failure to allocate a continuation, since we can't coordinate across the shards without it. Test: unit(dev), rest_api(debug) * github.com:scylladb/scylla: table: snapshot: handle errors table: snapshot: get rid of skip_flush param database: truncate: skip flush when taking snapshot test: rest_api: storage_service: verify_snapshot_details: add truncate database: snapshot_on_all: flush before snapshot if needed table: make snapshot method private database: add snapshot_on_all snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot api: storage_service: increase visibility of snapshot ops in the log api: storage_service: coroutinize take_snapshot and del_snapshot api: storage_service: take_snapshot: improve api help messages test: rest_api: storage_service: add test_storage_service_snapshot database: add flush_on_all variants test: rest_api: add test_storage_service_flush (cherry picked from commit `2c39c4c284`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10975	2022-07-12 15:24:24 +03:00
Pavel Emelyanov	97054ee691	view: Fix trace-state pointer use after move It's moved into .mutate_locally() but it captured and used in its continuation. It works well just because moved-from pointer looks like nullptr and all the tracing code checks for it to be non-such. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/ (CI job failed on post-actions thus it's red) Fixes #11015 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220711134152.30346-1-xemul@scylladb.com> (cherry picked from commit `5526738794`)	2022-07-12 14:20:57 +03:00
Piotr Sarna	34085c364f	view: exclude using static columns in the view filter The code which applied view filtering (i.e. a condition placed on a view column, e.g. "WHERE v = 42") erroneously used a wildcard selection, which also assumes that static columns are needed, if the base table contains any such columns. The filtering code currently assumes that no such columns are fetched, so the selection is amended to only ask for regular columns (primary key columns are sent anyway, because they are enabled via slice options, so no need to ask for them explicitly). Fixes #10851 Closes #10855 (cherry picked from commit `bc3a635c42`)	2022-07-11 17:06:55 +03:00
Takuya ASADA	323521f4c8	install.sh: install files with correct permission in strict umask setting To avoid failing to run scripts in non-root user, we need to set permission explicitly on executables. Fixes #10752 Closes #10840 (cherry picked from commit `13caac7ae6`)	2022-07-10 16:46:30 +03:00
Asias He	1ad59d6a7b	repair: Do not flush hints and batchlog if tombstone_gc_mode is not repair The flush of hints and batchlog are needed only for the table with tombstone_gc_mode set to repair mode. We should skip the flush if the tombstone_gc_mode is not repair mode. Fixes #10004 Closes #10124 (cherry picked from commit `ec59f7a079`)	2022-07-04 10:31:51 +03:00
Nadav Har'El	d3045df9c9	Merge 'types: fix is_string for reversed types' from Piotr Sarna Checking if the type is string is subtly broken for reversed types, and these types will not be recognized as strings, even though they are. As a result, if somebody creates a column with DESC order and then tries to use operator LIKE on it, it will fail because the type would not be recognized as a string. Fixes #10183 Closes #10181 * github.com:scylladb/scylla: test: add a case for LIKE operator on a descending order column types: fix is_string for reversed types (cherry picked from commit `733672fc54`)	2022-07-03 17:59:33 +03:00
Benny Halevy	be48b7aa8b	compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group It was assumed that offstrategy compaction is always triggered by streaming/repair where it would inherit the caller's scheduling group. However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see how the expiration of this timer will inherit anything from streaming/repair. Also, since `d309a86`, offstrategy compaction may be triggered by the api where it will run in the default scheduling group. The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`. Fixes #10151 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com> (cherry picked from commit `0764e511bb`)	2022-07-03 14:28:47 +03:00
Takuya ASADA	3c4688bcfa	scylla_coredump_setup: support new format of Storage field Storage field of "coredumpctl info" changed at systemd-v248, it added "(present)" on the end of line when coredump file available. Fixes #10669 Closes #10714 (cherry picked from commit `ad2344a864`)	2022-07-03 13:55:18 +03:00
Nadav Har'El	cc22021876	alternator: forbid empty AttributesToGet In DynamoDB one can retrieve only a subset of the attributes using the AttributesToGet or ProjectionExpression paramters to read requests. Neither allows an empty list of attributes - if you don't want any attributes, you should use Select=COUNT instead. Currently we correctly refuse an empty ProjectionExpression - and have a test for it: test_projection_expression.py::test_projection_expression_toplevel_syntax However, Alternator is missing the same empty-forbidding logic for AttributesToGet. An empty AttributesToGet is currently allowed, and basically says "retrieve everything", which is sort of unexpected. So this patch adds the missing logic, and the missing test (actually two tests for the same thing - one using GetItem and the other Query). Fixes #10332 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220405113700.9768-1-nyh@scylladb.com> (cherry picked from commit `9c1ebdceea`)	2022-07-03 13:35:50 +03:00
Yaron Kaikov	c9e79cb4a3	release: prepare for 5.0.0 scylla-5.0.0	2022-06-28 15:51:29 +03:00
Yaron Kaikov	f28542a71e	release: prepare for 5.0.rc8 scylla-5.0.rc8	2022-06-12 14:44:47 +03:00
Pavel Emelyanov	527a75a4c0	Update seastar submodule (Calculate max IO lengths as lengths) * seastar 8b2c13b3...dbf79189 (1): > Merge 'Calculate max IO lengths as lengths' io_queue: Type alias for internal::io_direction_and_length io_queue, fair_group: Throw instead of assert io_queue: Keep max lengths on board io_queue: Toss request_fq_ticket() io_queue: Introduce make_ticket() helper io_queue: Remove max_ticket_size io_queue: Make make_ticket() non-brancy io_queue: Add devid to group creation log tests: cstress(release) fixes: #10704	2022-06-09 21:15:21 +03:00
Avi Kivity	df00f8fcfb	Update seastar submodule (json crash in describe_ring) * seastar 7a430a0830...8b2c13b346 (1): > Merge 'stream_range_as_array: always close output stream' from Benny Halevy Fixes #10592.	2022-06-08 16:48:28 +03:00
Yaron Kaikov	41a00c744f	release: prepare for 5.0.rc7 scylla-5.0.rc7	2022-06-02 15:13:59 +03:00
Avi Kivity	2d7b6cd702	messaging: do isolate default tenants In `10dd08c9` ("messaging_service: supply and interpret rpc isolation_cookies", 4.2), we added a mechanism to perform rpc calls in remote scheduling groups based on the connection identity (rather than the verb), so that connection processing itself can run in the correct group (not just verb processing), and so that one verb can run in different groups according to need. In `16d8cdadc` ("messaging_service: introduce the tenant concept", 4.2), we changed the way isolation cookies are sent: scheduling_group messaging_service::scheduling_group_for_verb(messaging_verb verb) const { return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group; @@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge if (must_compress) { opts.compressor_factory = &compressor_factory; } opts.tcp_nodelay = must_tcp_nodelay; opts.reuseaddr = true; - opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + // We send cookies only for non-default statement tenant clients. + if (idx > 3) { + opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie; + } This effectively disables the mechanism for the default tenant. As a result some verbs will be executed in whatever group the messaging service listener was started in. This used to be the main group, but in `554ab03` ("main: Run init_server and join_cluster inside maintenance scheduling group", 4.5), this was change to the maintenance group. As a result normal read/writes now compete with maintenance operations, raising their latency significantly. Fix by sending the isolation cookie for all connections. With this, a 2-node cassandra-stress load has 99th percentile increase by just 3ms during repair, compared to 10ms+ before. Fixes #9505. Closes #10673 (cherry picked from commit `c83393e819`)	2022-06-01 17:20:30 +03:00
Avi Kivity	ff79228178	Merge 'Allow trigger off strategy compaction early for node operations' from Asias He This patch set adds two commits to allow trigger off strategy early for node operations. ) repair: Repair table by table internally This patch changes the way a repair job walks through tables and ranges if multiple tables and ranges are requested by users. Before: ``` for range in ranges for table in tables repair(range, table) ``` After: ``` for table in tables for range in ranges repair(range, table) ``` The motivation for this change is to allow off-strategy compaction to trigger early, as soon as a table is finished. This allows to reduce the number of temporary sstables on disk. For example, if there are 50 tables and 256 ranges to repair, each range will generate one sstable. Before this change, there will be 50 256 sstables on disk before off-strategy compaction triggers. After this change, once a table is finished, off-strategy compaction can compact the 256 sstables. As a result, this would reduce the number of sstables by 50X. This is very useful for repair based node operations since multiple ranges and tables can be requested in a single repair job. Refs: #10462 ) repair: Trigger off strategy compaction after all ranges of a table is repaired When the repair reason is not repair, which means the repair reason is node operations (bootstrap, replace and so on), a single repair job contains all the ranges of a table that need to be repaired. To trigger off strategy compaction early and reduce the number of temporary sstable files on disk, we can trigger the compaction as soon as a table is finished. Refs: #10462 Closes #10551 github.com:scylladb/scylla: repair: Trigger off strategy compaction after all ranges of a table is repaired repair: Repair table by table internally (cherry picked from commit `e65b3ed50a`)	2022-06-01 14:17:01 +03:00

1 2 3 4 5 ...

30087 Commits