The only user is the tests of downgrade_to_v1(), which uses it through
mutation source. To avoid any new users popping up, we make it a private
method of the latter. In the process the pass-through optimization is
dropped, it is not needed for tests anyway.
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
The only user is row level repair: it is replaced with
downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has
lots of downgrade_to_v1() calls, we will deal with these later all at
once.
Another use is the empty mutation source, this is trivially converted to
use the v2 variant.
The conversion is shallow: the meat of the logic remains v1, fragments
are converted to v2 right before being pushed into the buffer. This
approach is simple, surgical and is still better then a full
upgrade_to_v2().
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
The underlying mutation representation is still v1, so the
implementation still has to do conversion. This happens right above the
lsa reader component.
Reads (part of operations) running concurrent to `drop_column_family()`
can create querier cache entries while we wait for them to finish in
`await_pending_ops()`. Move the cache entry eviction to after this, to
ensure such entries are also cleaned up before destroying the table
object.
This moves the `_querier_cache.evict_all_for_table()` from
`database::remove()` to `database::drop_column_family()`. With that the
former doesn't have to return `future<>` anymore. While at it (changing
the signature) also rename `column_family` -> `table`.
Also add a regression unit test.
Said method was already coroutinized, but only halfway, possibly because
of the difficulty in expressing `finally()` with coroutines. We now have
`coroutines::as_future()` which makes this easier, so finish the job.
These bring in wasm.hh (though they really shouldn't) and make
everyone suffer. Forward declare instead and add missing includes
where needed.
Closes#10444
Minor fixlets to make `ninja dev-headers` pass.
Closes#10445
* github.com:scylladb/scylla:
readers/from_mutations_v2.hh: make self-contained
data_dictionary/storage_options.hh: make self-contained
Reduce #include load by standardizing on std::any.
In keys.cc, we just drop the unneeded include.
One instance of boost::any remains in config_file, due to a tie-in with
other boost components.
Closes#10441
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
DynamoDB limits partition-key length to 2048 bytes and sort-key length
to 1024 bytes. Alternator currently has no such limits officially, but
if a user tries a key length of over 64 KB, the result will be an
"internal server error" as Alternator runs into Scylla's low-level key
length limit of 64 KB.
In this patch we add (mostly xfailing) tests confirming all the above
observations. The tests include extensive comments on what they are
testing and why. Some of these tests (specifically, the ones checking
what happens above 64 KB) should pass once Alternator is fixed. Other
tests - requiring that the limits be exactly what they are in DynamoDB -
may either not pass or change in the future, depending on what we decide
the limits should be in Alternator.
Refs #10347
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10438
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
fsync_directory() is broken because it's unconditionally performing
fsync on parent directory, not on the directory that it was called
with. To fix, let's remove wrong parent_path() usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
dirname() is confusing because if it's called on a directory, parent
path is retrieved. By renaming it to parent_path(), it's clearer
what the function will do exactly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
On acaf0bb we applied out() just for perftune.py because we had issue #10390
with this script.
But the issue can happen with other commands too, let's apply it to all
commands which uses capture_output.
related #10390Closes#10414
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.
Fixes#10351
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10425
"
The method in question performs node bootstrap in several different
modes
(regular, replacing, rnbo) and several subsequent if-else branches just
duplicate each-other. This set merges them making the code easier to
read.
"
* 'br-less-branchy-bootstrap' of https://github.com/xemul/scylla:
storage_service: Remove pointless check in replace-bootstrap
storage_service: Generalize wait for range setup
storage_service: Merge common if-else branches in bootstrap
storage_service: Move tables bootstrap-ON upwards
The doc is being updated to reflect the changes in the commit
d8833de3bb ("Redefine Compaction Backlog to tame
compaction aggressiveness").
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
Currently an exception is thrown in the apply stage
when the schema is not synced, but it is too late
since returning an error doesn't pinpoint which code
path was using an unsync'ed schema so move the check
earlier, before _apply_stage is called.
We need to make sure the schema is synced earlier
when the mutation is applied so call on_internal_error
to generate a backtrace in testing and still throw
an error in production.
Typically storage_proxy::mutate_locally implicitly
ensures the schema is synced by making a global_schema_ptr
for it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220424110057.3957597-1-bhalevy@scylladb.com>
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues #10361, #10399, #10401. The existing test `test_null.py::test_map_subscript_null` reproduced all these bugs sporadically.
In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then **define** both `m[NULL]` and `NULL[2]` to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., `NULL < 2` is also defined as resulting in NULL.
However, this decision differs from Cassandra, where `m[NULL]` is considered an error but `NULL[2]` is allowed. We believe that making `m[NULL]` be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like `m[a]`, where the column `a` can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.
Fixes#10361Fixes#10399Fixes#10401Closes#10420
* github.com:scylladb/scylla:
expressions: don't dereference invalid map subscript in filter
expressions: fix invalid dereference in map subscript evaluation
test/cql-pytest: improve tests for map subscripts and nulls
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.
Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.
This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.
This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.
This patch also fixes a possible user after free issue by not passing
the table reference object around.
Fixes#10395Closes#10396
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).
Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:
* For NULL, we consider m[NULL] to result in NULL - instead of an error.
This behavior is more consistent with other expressions that contain
null - for example NULL[2] and NULL<2 both result in NULL as well.
Moreover, if in the future we allow more complex expressions, such
as m[a] (where a is a column), we can find the subscript to be null
for some rows and non-null for other rows - and throwing an "invalid
query" in the middle of the filtering doesn't make sense.
* For UNSET_VALUE, we do consider this an error like Cassandra, and use
the same error message as Cassandra. However, the current implementation
checks for this error only when the expression is evaluated - not
before. It means that if the scan is empty before the filtering, the
error will not be reported and we'll silently return an empty result
set. We currently consider this ok, but we can also change this in the
future by binding the expression only once (today we do it on every
evaluation) and validating it once after this binding.
Fixes#10361Fixes#10399
Signed-off-by: Nadav Har'El <nyh@scylladb.com>