When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.
The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.
Fixes: https://github.com/scylladb/scylladb/issues/9482Closes#11912
* github.com:scylladb/scylladb:
test/cql-pytest: add regression test for "IDL frame truncated" error
mutation_compactor: detach_state(): make it no-op if partition was exhausted
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.
Tests: 1. The fixed issue scenario from github.
2. Unit tests in release mode.
Fixes#11774
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>
Closes#11777
(cherry picked from commit ab7429b77d)
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.
maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.
So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.
This patch also adds two tests that reproduce the bug and verify its
fix:
1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
that maybe_quote() quotes reserved keywords like "to", but doesn't
quote unreserved keywords like "int".
2. Add a test reproducing issue #9450 - creating a materialized view
whose key column is a keyword. This new test passes on Cassandra,
failed on Scylla before this patch, and passes after this patch.
It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:
1. Try hard not to introduced new reserved keywords. Instead, introduce
unreserved keywords. We've been doing this even before recognizing
this maybe_quote() future-compatibility problem.
2. In the next patch we will introduce quote() - which unconditionally
quotes identifier names, even if lowercase. These quoted names will
be uglier for lowercase names - but will be safe from future
introduction of new keywords. So we can consider switching some or
all uses of maybe_quote() to quote().
Fixes#9450
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
(cherry picked from commit 5d2f694a90)
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.
Fixes#11470Closes#11693
(cherry picked from commit 636e14cc77)
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.
Fixes: #7915
(cherry picked from commit 1902dbc9ff)
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
(cherry picked from commit 6b246dc119)
(cherry picked from commit e2809674d2)
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.
This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
(cherry picked from commit 70b4158ce0)
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
(cherry picked from commit e981bd4f21)
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.
Fixes#10668.
Closes#10678
(cherry picked from commit ef7643d504)
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
(cherry picked from commit 4c93a694b7)
When updating an updateable value via CQL the new value comes as a
string that's then boost::lexical_cast-ed to the desired value. If the
cast throws the respective exception is printed in logs which is very
likely uncalled for.
fixes: #10394
tests: manual
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220503142942.8145-1-xemul@scylladb.com>
(cherry picked from commit 063d26bc9e)
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.
The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.
So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.
Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.
Fixes#11222
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11298
(cherry picked from commit 941c719a23)
The generator was first setting the marker then applied tombstones.
The marker was set like this:
row.marker() = random_row_marker();
Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.
However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.
This could generate rows with markers uncompacted with shadowable tombstones.
This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.
Fix by making sure there are no key clashes.
Closes#11663
(cherry picked from commit 5268f0f837)
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.
But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.
(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views. Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)
Closes#10760.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(cherry picked from commit aab4cd850c)
Fixes#10760.
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.
So in this patch we add a test for this issue, and fix it.
Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.
Fixes#10789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8866c326de)
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
(cherry picked from commit 7f80602b01)
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
(cherry picked from commit be9d1c4df4)
Some tests want to generate a fixed amount of random partitions, make
their life easier.
(cherry picked from commit 98f3d516a2)
Ref #11421 (prerequisite)
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.
(Would info be more appropriate here than warning?)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10556Fixes#10636
(cherry picked from commit 00ed4ac74c)
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.
Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.
This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.
This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.
This patch also fixes a possible user after free issue by not passing
the table reference object around.
Fixes#10395Closes#10396
(cherry picked from commit 953af38281)
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.
Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
class transforming_reader : public flat_mutation_reader_v2::impl {
// ...
};
return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}
Fixes#9065.
Fixes#11491
(cherry picked from commit 9ada63a9cb)
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
(cherry picked from commit 484004e766)
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.
Closes#11260
(cherry picked from commit 8ee5b69f80)
This pull request backports 3 important fixes from adc08d0ab9. Said 3 commits fixed important bugs in the v2 variant of the evitable reader, but were not backported because they were part of a large series doing v2 conversion in general. This means that 5.0 was left with a buggy evictable reader v2, which is used by repair. So far in the wild we've seen one bug manifest itself: the evictable reader getting stuck, spinning in a tight loop in `evictable_reader_v2::do_fill_buffer()`, in turn making repair being stuck too.
Fixes: #11223Closes#11540
* github.com:scylladb/scylladb:
test/boost/mutation_reader_test: add v2 specific evictable reader tests
evictable_reader_v2: terminate active range tombstones on reader recreation
evictable_reader_v2: restore handling of non-monotonically increasing positions
evictable_reader_v2: simplify handling of reader recreation
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
(cherry picked from commit cdb3e71045)
One is a reincarnation of the recently removed
test_multishard_combining_reader_non_strictly_monotonic_positions. The
latter was actually targeting the evictable reader but through the
multishard reader, probably for historic reasons (evictable reader was
part of the multishard reader family).
The other one checks that active range tombstones changes are properly
terminated when the partition ends abruptly after recreating the reader.
(cherry picked from commit 014a23bf2a)
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.
(cherry picked from commit 9e48237b86)
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
_next_position_in_partition, thus ensuring the next time we recreate
the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
larger position.
The code is just much nicer because it uses coroutines.
(cherry picked from commit 6db08ddeb2)
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
circumstances due to a difference between the v1 and v2 versions of
`is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
(this was integrated into the range tombstone trimming which was
thrown out by the conversion).
(cherry picked from commit 498d03836b)
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.
This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:
<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.
After this patch we get the right error message:
ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])
Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.
Fixes#10025Fixes#11520
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
(cherry picked from commit 7f89c8b3e3)
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.
Fixes#11476
(cherry picked from commit 1c2eef384d)
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.
We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.
This is needed because our CI runs "ninja dev-headers".
Fixes#10995
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11457
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
(cherry picked from commit afa7960926)
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).
Fixes#10286Closes#10294
(cherry picked from commit 85e95a8cc3)
Add support for specifing integers in scientific format (for example
1.234e8) in INSERT JSON statement:
INSERT INTO table JSON '{"int_column": 1e7}';
Inserting a floating-point number ending with .0 is allowed, as
the fractional part is zero. Non-zero fractional part (for example
12.34) is disallowed. A new test is added to test all those behaviors.
Before the JSON parsing library was switched to RapidJSON from JsonCpp,
this statement used to work correctly, because JsonCpp transparently
casts double to integer value.
This behavior differs from Cassandra, which disallows those types of
numbers (1e7, 123.0 and 12.34).
Fix typo in if condition: "if (value.GetUint64())" to
"if (value.IsUint64())".
Fixes#10100
(cherry picked from commit efe7456f0a)
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0.
Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug.
Refs #10535.
The original cover letter from https://github.com/scylladb/scylla/pull/10420:
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically.
In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL.
However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.
Fixes https://github.com/scylladb/scylla/issues/10361
Fixes https://github.com/scylladb/scylla/issues/10399
Fixes https://github.com/scylladb/scylla/pull/10401Closes#11142
* github.com:scylladb/scylla:
test/cql-pytest: reproducer for CONTAINS NULL bug
expressions: don't dereference invalid map subscript in filter
expressions: fix invalid dereference in map subscript evaluation
test/cql-pytest: improve tests for map subscripts and nulls
This is a reproducer for issue #10359 that a "CONTAINS NULL" and
"CONTAINS KEY NULL" restrictions should not match any set, but currently
do match non-empty or all sets.
The tests currently fail on Scylla, so marked xfail. They also fails on
Cassandra because Cassandra considers such a request an error, which
we consider a mistake (see #4776) - so the tests are marked "cassandra_bug".
Refs #10359.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412130914.823646-1-nyh@scylladb.com>
(cherry picked from commit ae0e1574dc)
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).
Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:
* For NULL, we consider m[NULL] to result in NULL - instead of an error.
This behavior is more consistent with other expressions that contain
null - for example NULL[2] and NULL<2 both result in NULL as well.
Moreover, if in the future we allow more complex expressions, such
as m[a] (where a is a column), we can find the subscript to be null
for some rows and non-null for other rows - and throwing an "invalid
query" in the middle of the filtering doesn't make sense.
* For UNSET_VALUE, we do consider this an error like Cassandra, and use
the same error message as Cassandra. However, the current implementation
checks for this error only when the expression is evaluated - not
before. It means that if the scan is empty before the filtering, the
error will not be reported and we'll silently return an empty result
set. We currently consider this ok, but we can also change this in the
future by binding the expression only once (today we do it on every
evaluation) and validating it once after this binding.
Fixes#10361Fixes#10399
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit fbb2a41246)