Currently, each data sync repair task is started (and hence run) twice.
Thus, when two running operations happen within a time frame long
enough, the following situation may occur:
- the first run finishes
- after some time (ttl) the task is unregistered from the task manager
- the second run finishes and attempts to finish the task which does
not exist anymore
- memory access causes a segfault.
The second call to start is deleted. A check is added
to the start method to ensure that each task is started at most once.
Fixes: #12089Closes#12090
In ad3d2ee47d, we replaced `bool` as an expression element
(representing a boolean constant) with `constant`. But a comment
and a concept continue to mention it.
Remove the comment and the concept fragment.
Closes#12119
Recent changes in topology restricted the get_dc/get_rack calls. Older
code was trying to locate the endpoint in gossiper, then in system
keyspace cache and if the endpoint was not found in both -- returned
"default" location.
New code generates internal error in this case. This approach already
helped to spot several BUGs in code that had been eventually fixed, but
echoes of that change still pop up.
This patch relaxes the "missing endpoint" case by printing a warning in
logs and returning back the "default" location like old code did.
tests: update_cluster_layout_tests.py::*
hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_rebalance
bootstrap_test.py::TestBootstrap::test_decommissioned_wiped_node_can_join
bootstrap_test.py::TestBootstrap::test_failed_bootstap_wiped_node_can_join
materialized_views_test.py::TestMaterializedViews::test_decommission_node_during_mv_insert_4_nodes
refs: #11900
refs: #12054fixes: #11870
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12067
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.
If we want to reuse the IP address, we don't allocate one using the host
registry. This required certain refactors: moving the code responsible
for allocation of IPs outside `ScyllaServer`, into `ScyllaCluster`.
Add two tests, but they are now skipped: one of them is failing (unability
for new node to join group 0) and both suffer from a hardcoded 60-second sleep
in Scylla.
Closes#12032
* github.com:scylladb/scylladb:
test/topology: simple node replace tests (currently disabled)
test/pylib: scylla_cluster: support node replace operation
test/pylib: scylla_cluster: move members initialization to constructor
test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer
test/pylib: scylla_cluster: refactor create_server parameters to a struct
test.py: stop/uninstall clusters instead of servers when cleaning up
test/pylib: artifact_registry: replace `Awaitable` type with `Coroutine`
test.py: prepare for adding extra config from test when creating servers
test/pylib: manager_client: convert `add_server` to use `put_json`
test/pylib: rest_client: allow returning JSON data from `put_json`
test/pylib: scylla_cluster: don't import from manager_client
This reverts commit e2fe8559ca. I
ran all the release mode tests on aarch64 with it reverted, and
it passes. So it looks like whatever problems we had with it
were fixed.
Closes#12072
As a part of CQL rewrite we want to be able to perform filtering by calling `evaluate()` on an expression and checking if it evaluates to `true`. Currently trying to do that for a binary operator would result in an error.
Right now checking if a binary operation like `col1 = 123` is true is done using `is_satisfied_by`, which is able to check if a binary operation evaluates to true for a small set of predefined cases.
Eventually once the grammar is relaxed we will be able to write expressions like: `(col1 < col2) = (1 > ?)`, which doesn't fit with what `is_satisfied_by` is supposed to do.
Additionally expressions like `1 = NULL` should evaluate to `NULL`, not `true` or `false`. `is_satsified_by` is not able to express that properly.
The proper way to go is implementing `evaluate(binary_operator)`, which takes a binary operation and returns what the result of it would be.
Implementing `prepare_expression` for `binary_operator` requires us to be able to evaluate it first. In the next PR I will add support for `prepare_expression`.
Closes#12052
* github.com:scylladb/scylladb:
cql-pytest: enable two unset value tests that pass now
cql-pytest: reduce unset value error message
cql3: expr: change unset value error messages to lowercase
cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
cql3: expr: remove needless braces around switch cases
cql3: move evaluation IS_NOT NULL to a separate function
expr_test: test evaluating LIKE binary_operator
expr_test: test evaluating IS_NOT binary_operator
expr_test: test evaluating CONTAINS_KEY binary_operator
expr_test: test evaluating CONTAINS binary_operator
expr_test: test evaluating IN binary_operator
expr_test: test evaluating GTE binary_operator
expr_test: test evaluating GT binary_operator
expr_test: test evaluating LTE binary_operator
expr_test: test evaluating LT binary_operator
expr_test: test evaluating NEQ binary_operator
expr_test: test evaluating EQ binary_operator
cql3: expr properly handle null in is_one_of()
cql3: expr properly handle null in like()
cql3: expr properly handle null in contains_key()
cql3: expr properly handle null in contains()
cql3: expr: properly handle null in limits()
cql3: expr: remove unneeded overload of limits()
cql3: expr: properly handle null in equality operators
cql3: expr: remove unneeded overload of equal()
cql3: expr: use evaluate(binary_operator) in is_satisfied_by
cql3: expr: handle IS NOT NULL when evaluating binary_operator
cql3: expr: make it possible to evaluate binary_operator
cql3: expr: accept expression as lhs argument to like()
cql3: expr: accept expression as lhs in contains_key
cql3: expr: accept expression as lhs argument to contains()
The Alternator TTL expiration scanner scans an entire table using many
small pages. If any of those pages time out for some reason (e.g., an
overload situation), we currently consider the entire scan to have failed
and wait for the next scan period (which by default is 24 hours) when
we start the scan from scratch (at a random position). There is a risk
that if these timeouts are common enough to occur once or more per
scan, the result is that we double or more the effective expiration lag.
A better solution, done in this patch, is to retry from the same position
if a single page timed out - immediately (or almost immediately, we add
a one-second sleep).
Fixes#11737
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12092
scylla-driver causes dtests to fail randomly (likely
due to incorrect handling of the USE statement). Revert
it.
* tools/java 73422ee114...1c06006447 (2):
> Revert "Add Scylla Cloud serverless support"
> Revert "Switch cqlsh to use scylla-driver"
Recently, clang started complaining about std::unexpected_handler being
deprecated:
```
In file included from utils/exceptions.cc:18:
./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations]
std::unexpected_handler unexpectedHandler;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here
typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) ();
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED'
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED'
^
1 warning generated.
```
According to cppreference.com, it was deprecated in C++11 and removed in
C++17 (!).
This commit gets rid of the warning by inlining the
std::unexpected_handler typedef, which is defined as a pointer a
function with 0 arguments, returning void.
Fixes: #12022Closes#12074
On rare occassions a SELECT on a DROPpped table throws
cassandra.ReadFailure instead of cassandra.InvalidRequest. This could
not be reproduced locally.
Catch both exceptions as the table is not present anyway and it's
correctly marked as a failure.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#12027
Global index page caching, as introduced in 4.6
(078a6e422b and 9f957f1cf9) has proven to be misdesigned,
because it poses a risk of catastrophic performance regressions in
common workloads by flooding the cache with useless index entries.
Because of that risk, it should be disabled by default.
Refs #11202Fixes#11889Closes#11890
In system_keyspace::get_repair_history value of repair_uuid
is got from row as tasks::task_id.
tasks::task_id is represented by an abstract_type specific
for utils::UUID. Thus, since their typeids differ, bad_cast
is thrown.
repair_uuid is got from row as utils::UUID and then cast.
Since no longer needed, data_type_for<tasks::task_id> is deleted.
Fixes: #11966Closes#12062
While implementing evaluate(binary_operator)
missing checks for unset value were added
for comparisons in filtering code.
Because of that some tests for unset value
started passing.
There are still other tests for unset value
that are failing because Scylla doesn't
have all the checks that it should.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When unset value appears in an invalid place
both Cassandra and Scylla throw an error.
The tests were written with Cassandra
and thus the expected error messages were
exactly the same as produced by Cassandra.
Scylla produces different error messages,
but both databases return messages with
the text 'unset value'.
Reduce the expected message text
from the whole message to something
that contains 'unset value'.
It would be hard to mimic Cassandra's
error messages in Scylla. There is no
point in spending time on that.
Instead it's better to modify the tests
so that they are able to work with
both Cassandra and Scylla.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The messages used to contain UNSET_VALUE
in capital letters, but the tests
expect messages with 'unset value'.
Change the message so that it can
match the expected error text in tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add two node replace tests using the freshly added infrastructure.
One test replaces a node while using a different IP. It is disabled
because the replace operation has an unconditional 60-seconds sleep
(it doesn't depend on the ring_delay setting for some reason). The sleep
needs to be fixed before we can enable this test.
The other test replaces while reusing the replaced node's IP.
Additionally to the sleep, the test fails because the node cannot join
group 0; it's stuck in an infinite loop of trying to join:
```
INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found no local group 0. Discovering...
INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader b7047f7e-03e6-4797-a723-24054201f91d
INFO 2022-11-18 15:56:19,934 [shard 0] raft_group0 - Server 8de951fd-a528-4a82-ac54-592ea269537f is starting group 0 with id 25d2b050-6751-11ed-b534-c3c40c275dd3
WARN 2022-11-18 15:56:20,935 [shard 0] raft_group0 - failed to modify config at peer b7047f7e-03e6-4797-a723-24054201f91d: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO 2022-11-18 15:56:21,937 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN 2022-11-18 15:56:22,937 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO 2022-11-18 15:56:23,938 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN 2022-11-18 15:56:24,939 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
```
and so on.
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.
If we want to reuse the IP address, we don't allocate one using the host
registry.
Since now multiple servers can have the same IP, introduce a
`leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing
the cluster - to make sure we don't `release_host` the same host twice.
Previously some members had to be initialized in `install` because
that's when we first knew the IP address.
Now we know the IP address during construction, which allows us to make
the code a bit shorter and simpler, and establish invariants: some
members (such as `self.config`) are now valid for the entire lifetime of
the server object.
`install()` is reduced to performing only side effects (creating
directories, writing config files), all calculation is done inside the
constructor.
`ScyllaServer`s were constructed without IP addresses. They leased an IP
address from `HostRegistry` and released them in `uninstall`.
This responsibility was now moved into `ScyllaCluster`, which leases an
IP address for a server before constructing it, and passes it to the
constructor. It releases the addresses of its serverswhen uninstalling
itself.
This will allow the cluster to reuse the IP address of an existing
server in that cluster when adding a new server which wants to replace
the existing one. Instead of leasing a new address, it will pass
the existing IP address to the new server's constructor.
The refactor is also nice in that it establishes an invariant for
`ScyllaServer`, simplifying reasoning about the class: now it has
an `ip_addr` field at all times.
`host_registry` was moved from `ScyllaServer` to `ScyllaCluster`.
`ScyllaCluster` constructor takes a function `create_server` which
itself takes 3 parameters now. Soon it will take a 4th. The list of
parameters is repeated at the constructor definition and the call site
of the constructor, with many parameters it begins being tiresome.
Refactor the list of parameters to a `NamedTuple`.
`self.artifacts` was calling `ScyllaServer.stop` and
`ScyllaServer.uninstall`. Now it calls `ScyllaCluster.stop` and
`ScyllaCluster.uninstall`, which underneath stops/uninstalls
servers in this cluster.
We must be a bit more careful now in case installing/starting a
server inside a cluster fails: there are no server cleanup artifacts,
and a server is added to cluster's `running` map only after
`install_and_start` finishes (until that happens,
`ScyllaCluster.stop/uninstall` won't catch this server).
So handle failures explicitly in `install_and_start`.
This commit does not logically change how the tests are running - every
started server belongs to some cluster, so it will be cleaned up
- but it's an important refactor.
It will allow us to move IP address (de)allocation code outside
`ScyllaServer`, into `ScyllaCluster`, which in turn will allow us to
implement node replace operation for the case where we want to reuse
the replaced node's IP.
Also, `ScyllaCluster.uninstall` was unused before this change, now it's
used.
Currently, TTL is listed as one of the experimental features: https://docs.scylladb.com/stable/alternator/compatibility.html#experimental-api-features
This PR moves the feature description from the Experimental Features section to a separate section.
I've also added some links and improved the formatting.
@tzach I've relied on your release notes for RC1.
Refs: https://github.com/scylladb/scylladb/issues/5060Closes#11997
* github.com:scylladb/scylladb:
Update docs/alternator/compatibility.md
doc: update the link to Enabling Experimental Features
doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph
doc: update the links to the Enabling Experimental Features section
doc: add the link to the Enabling Experimental Features section
doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.
This patch removes this requirement and in essence GAs this feature.
Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.
This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.
Fixes#12037
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12049
The `cleanup_before_exit` method of `ArtifactRegistry` calls `close()`
on artifacts. mypy complains that `Awaitable` has no such method. In
fact, the `artifact` objects that we pass to `ArtifactRegistry`
(obtained by calling `async def` functions) do have a `close()` method,
and they are a particular case of `Awaitable`s, but in general not
all `Awaitable`s have `close()`.
Replace `Awaitable` with one of its subtypes: `Coroutine`. `Coroutine`s
have a `close()` method, and `async def` functions return objects of
this type. mypy no longer complains.
[PR](https://github.com/scylladb/scylladb/pull/9314) fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
`modification_statement::create_clustering_ranges` to return an
empty range in this case, since `possible_lhs_values` it
uses explicitly returns `empty_value_set` if it evaluates `rhs`
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
`modification_statement::process_where_clause`. So the only
problem was `modification_statement::execute_with_condition`
was not expecting an empty `clustering_range` in case of
a null clustering key.
Also this patch contains a fix for the problem with wrong
column name in Scylla error messages. If `INSERT` or `DELETE`
statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.
The problem occurs if the query contains a column with the list type,
otherwise
`statement_restrictions::process_clustering_columns_restrictions`
checks that all the components of the key are specified.
Closes#12047
* github.com:scylladb/scylladb:
cql: refactor, inline modification_statement::validate_primary_key_restrictions
cql: DELETE with null value for IN parameter should be forbidden
cql: add column name to the error message in case of null primary key component
cql: batch statement, inserting a row with a null key column should be forbidden
cql: wrong column name in error messages
modification_statement: fix LWT insert crash if clustering key is null
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.
Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.
Related: https://github.com/scylladb/scylladb/pull/9810Closes#12070
* github.com:scylladb/scylladb:
doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
Don't call get_datacenter(ep) without checking
first has_endpoint(ep) since the former may abort
on internal error if the endpoint is not listed
in topology.
Refs #11870
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12054
If a DELETE statement contains an IN operator and the
parameter value for it is NULL, this should also trigger
an error. This is in line with how Cassandra
behaves in this case.
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.
Fixes: #12060
If INSERT or DELETE statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.
The problem occurs if the query contains a column with the list type,
otherwise
statement_restrictions::process_clustering_columns_restrictions
checks that all the components of the key are specified.
Fixes: #12046
Returns an unordered set of datacenter names
to be used by network_topology_replication_strategy
and for ks_prop_defs.
The set is kept in sync with _dc_endpoints.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12023
Since we moved all IaaS code to scylla-machine-image, we nolonger need
AMI variable on sysconfig file or --ami parameter on setup scripts,
and also never used /etc/scylla/ami_disabled.
So let's drop all of them from Scylla core core.
Related with scylladb/scylla-machine-image#61
Closes#12043
When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this.
Closes#12005
* github.com:scylladb/scylladb:
sstable_directory: Move all RAII booleans onto flags
sstable_directory: Convert sort-sstables argument to flags struct
sstable_directory: Drop default filter
Scylla doesn't support combining restrictions
on token with other restrictions on partition key columns.
Some pieces of code depend on the assumption
that such combinations are allowed.
In case they were allowed in the future
these functions would silently start
returning wrong results, and we would
return invalid rows.
Add a test that will start failing once
this restriction is removed. It will
warn the developer to change the
functions that used to depend
on the assumption.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>