Compare commits

..

381 Commits

Author SHA1 Message Date
Yaniv Kaul
4b7ef8cbc3 Update create-cluster-multidc.rst - remove unneeded period
SSA
2022-11-28 18:10:01 +02:00
Avi Kivity
0da66371a5 storage_proxy: coroutinize inner continuation of create_hint_sync_point()
It is part of a coroutine::parallel_for_each(), which is safe for lambda coroutines.

Closes #12057
2022-11-28 11:30:00 +02:00
Avi Kivity
d12d42d1a6 Revert "configure: temporarily disable wasm support for aarch64"
This reverts commit e2fe8559ca. I
ran all the release mode tests on aarch64 with it reverted, and
it passes. So it looks like whatever problems we had with it
were fixed.

Closes #12072
2022-11-28 11:30:00 +02:00
Nadav Har'El
99a72a9676 Merge 'cql3: expr: make it possible to evaluate expr::binary_operator' from Jan Ciołek
As a part of CQL rewrite we want to be able to perform filtering by calling `evaluate()` on an expression and checking if it evaluates to `true`. Currently trying to do that for a binary operator would result in an error.

Right now checking if a binary operation like `col1 = 123` is true is done using `is_satisfied_by`, which is able to check if a binary operation evaluates to true for a small set of predefined cases.

Eventually once the grammar is relaxed we will be able to write expressions like: `(col1 < col2) = (1 > ?)`, which doesn't fit with what `is_satisfied_by` is supposed to do.
Additionally expressions like `1 = NULL` should evaluate to `NULL`, not `true` or `false`. `is_satsified_by` is not able to express that properly.

The proper way to go is implementing `evaluate(binary_operator)`, which takes a binary operation and returns what the result of it would be.

Implementing `prepare_expression` for `binary_operator` requires us to be able to evaluate it first. In the next PR I will add support for `prepare_expression`.

Closes #12052

* github.com:scylladb/scylladb:
  cql-pytest: enable two unset value tests that pass now
  cql-pytest: reduce unset value error message
  cql3: expr: change unset value error messages to lowercase
  cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
  cql3: expr: remove needless braces around switch cases
  cql3: move evaluation IS_NOT NULL to a separate function
  expr_test: test evaluating LIKE binary_operator
  expr_test: test evaluating IS_NOT binary_operator
  expr_test: test evaluating CONTAINS_KEY binary_operator
  expr_test: test evaluating CONTAINS binary_operator
  expr_test: test evaluating IN binary_operator
  expr_test: test evaluating GTE binary_operator
  expr_test: test evaluating GT binary_operator
  expr_test: test evaluating LTE binary_operator
  expr_test: test evaluating LT binary_operator
  expr_test: test evaluating NEQ binary_operator
  expr_test: test evaluating EQ binary_operator
  cql3: expr properly handle null in is_one_of()
  cql3: expr properly handle null in like()
  cql3: expr properly handle null in contains_key()
  cql3: expr properly handle null in contains()
  cql3: expr: properly handle null in limits()
  cql3: expr: remove unneeded overload of limits()
  cql3: expr: properly handle null in equality operators
  cql3: expr: remove unneeded overload of equal()
  cql3: expr: use evaluate(binary_operator) in is_satisfied_by
  cql3: expr: handle IS NOT NULL when evaluating binary_operator
  cql3: expr: make it possible to evaluate binary_operator
  cql3: expr: accept expression as lhs argument to like()
  cql3: expr: accept expression as lhs in contains_key
  cql3: expr: accept expression as lhs argument to contains()
2022-11-28 11:30:00 +02:00
Nadav Har'El
1e59c3f9ef alternator: if TTL scan times out, continue immediately
The Alternator TTL expiration scanner scans an entire table using many
small pages. If any of those pages time out for some reason (e.g., an
overload situation), we currently consider the entire scan to have failed
and wait for the next scan period (which by default is 24 hours) when
we start the scan from scratch (at a random position). There is a risk
that if these timeouts are common enough to occur once or more per
scan, the result is that we double or more the effective expiration lag.

A better solution, done in this patch, is to retry from the same position
if a single page timed out - immediately (or almost immediately, we add
a one-second sleep).

Fixes #11737

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12092
2022-11-28 11:30:00 +02:00
Avi Kivity
45a57bf22d Update tools/java submodule (revert scylla-driver)
scylla-driver causes dtests to fail randomly (likely
due to incorrect handling of the USE statement). Revert
it.

* tools/java 73422ee114...1c06006447 (2):
  > Revert "Add Scylla Cloud serverless support"
  > Revert "Switch cqlsh to use scylla-driver"
2022-11-28 11:29:08 +02:00
Kefu Chai
af011aaba1 utils/variant_element: simplify is_variant_element with right fold
for better readability than the recursive approach.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #12091
2022-11-27 16:34:34 +02:00
Avi Kivity
78222ea171 Update tools/java submodule (cqlsh system_distributed_everywhere is a system keyspace)
* tools/java 874e2d529b...73422ee114 (1):
  > Mark "system_distributed_everywhere" as system ks
2022-11-27 15:37:57 +02:00
Aleksandra Martyniuk
9a3d114349 tasks: move methods from task_manager to source file
Methods from tasks::task_manager and nested classes are moved
to source file.

Closes #12064
2022-11-27 15:09:28 +02:00
Piotr Dulikowski
22fbf2567c utils/abi: don't use the deprecated std::unexpected_handler
Recently, clang started complaining about std::unexpected_handler being
deprecated:

```
In file included from utils/exceptions.cc:18:
./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations]
    std::unexpected_handler unexpectedHandler;
         ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here
  typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) ();
                 ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED'
                               ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED'
                                             ^
1 warning generated.
```

According to cppreference.com, it was deprecated in C++11 and removed in
C++17 (!).

This commit gets rid of the warning by inlining the
std::unexpected_handler typedef, which is defined as a pointer a
function with 0 arguments, returning void.

Fixes: #12022

Closes #12074
2022-11-27 12:25:20 +02:00
Alejo Sanchez
5ff4b8b5f8 pytest: catch rare exception for random tables test
On rare occassions a SELECT on a DROPpped table throws
cassandra.ReadFailure instead of cassandra.InvalidRequest. This could
not be reproduced locally.

Catch both exceptions as the table is not present anyway and it's
correctly marked as a failure.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12027
2022-11-27 10:26:55 +02:00
Michał Chojnowski
a75e4e1b23 db: config: disable global index page caching by default
Global index page caching, as introduced in 4.6
(078a6e422b and 9f957f1cf9) has proven to be misdesigned,
because it poses a risk of catastrophic performance regressions in
common workloads by flooding the cache with useless index entries.
Because of that risk, it should be disabled by default.

Refs #11202
Fixes #11889

Closes #11890
2022-11-26 14:27:26 +02:00
Anna Stuchlik
d5f676106e doc: remove the LWT page from the index of Enterprise features
Closes #12076
2022-11-24 21:59:05 +02:00
Aleksandra Martyniuk
dcc17037c7 repair: fix bad cast in tasks::task_id parsing
In system_keyspace::get_repair_history value of repair_uuid
is got from row as tasks::task_id.
tasks::task_id is represented by an abstract_type specific
for utils::UUID. Thus, since their typeids differ, bad_cast
is thrown.

repair_uuid is got from row as utils::UUID and then cast.
Since no longer needed, data_type_for<tasks::task_id> is deleted.

Fixes: #11966

Closes #12062
2022-11-24 19:37:44 +02:00
Jan Ciolek
77c7d8b8f6 cql-pytest: enable two unset value tests that pass now
While implementing evaluate(binary_operator)
missing checks for unset value were added
for comparisons in filtering code.

Because of that some tests for unset value
started passing.

There are still other tests for unset value
that are failing because Scylla doesn't
have all the checks that it should.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:07:17 +01:00
Jan Ciolek
5bc0bc6531 cql-pytest: reduce unset value error message
When unset value appears in an invalid place
both Cassandra and Scylla throw an error.

The tests were written with Cassandra
and thus the expected error messages were
exactly the same as produced by Cassandra.

Scylla produces different error messages,
but both databases return messages with
the text 'unset value'.

Reduce the expected message text
from the whole message to something
that contains 'unset value'.

It would be hard to mimic Cassandra's
error messages in Scylla. There is no
point in spending time on that.
Instead it's better to modify the tests
so that they are able to work with
both Cassandra and Scylla.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:04:07 +01:00
Jan Ciolek
08f40a116d cql3: expr: change unset value error messages to lowercase
The messages used to contain UNSET_VALUE
in capital letters, but the tests
expect messages with 'unset value'.

Change the message so that it can
match the expected error text in tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:02:44 +01:00
Avi Kivity
29a4b662f8 Merge 'doc: document the Alternator TTL feature as GA' from Anna Stuchlik
Currently, TTL is listed as one of the experimental features: https://docs.scylladb.com/stable/alternator/compatibility.html#experimental-api-features

This PR moves the feature description from the Experimental Features section to a separate section.
I've also added some links and improved the formatting.

@tzach I've relied on your release notes for RC1.

Refs: https://github.com/scylladb/scylladb/issues/5060

Closes #11997

* github.com:scylladb/scylladb:
  Update docs/alternator/compatibility.md
  doc: update the link to Enabling Experimental Features
  doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph
  doc: update the links to the Enabling Experimental Features section
  doc: add the link to the Enabling Experimental Features section
  doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section
2022-11-24 17:22:05 +02:00
Nadav Har'El
2dedb5ea75 alternator: make Alternator TTL feature no longer "experimental"
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.

This patch removes this requirement and in essence GAs this feature.

Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.

This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.

Fixes #12037

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12049
2022-11-24 17:21:39 +02:00
Tzach Livyatan
e96d31d654 docs: Add Authentication and Authorization as a prerequisite for Auditing.
Closes #12058
2022-11-24 17:21:23 +02:00
Nadav Har'El
c6bb64ab0e Merge 'Fix LWT insert crash if clustering key is null' from Gusev Petr
[PR](https://github.com/scylladb/scylladb/pull/9314) fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
`modification_statement::create_clustering_ranges` to return an
empty range in this case, since `possible_lhs_values` it
uses explicitly returns `empty_value_set` if it evaluates `rhs`
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
`modification_statement::process_where_clause`. So the only
problem was `modification_statement::execute_with_condition`
was not expecting an empty `clustering_range` in case of
a null clustering key.

Also this patch contains a fix for the problem with wrong
column name in Scylla error messages. If `INSERT` or `DELETE`
statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.

The problem occurs if the query contains a column with the list type,
otherwise
`statement_restrictions::process_clustering_columns_restrictions`
checks that all the components of the key are specified.

Closes #12047

* github.com:scylladb/scylladb:
  cql: refactor, inline modification_statement::validate_primary_key_restrictions
  cql: DELETE with null value for IN parameter should be forbidden
  cql: add column name to the error message in case of null primary key component
  cql: batch statement, inserting a row with a null key column should be forbidden
  cql: wrong column name in error messages
  modification_statement: fix LWT insert crash if clustering key is null
2022-11-24 16:15:27 +02:00
Nadav Har'El
6e9f739f19 Merge 'doc: add the links to the per-partition rate limit extension ' from Anna Stuchlik
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.

Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.

Related: https://github.com/scylladb/scylladb/pull/9810

Closes #12070

* github.com:scylladb/scylladb:
  doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
  doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
2022-11-24 16:03:30 +02:00
Anna Stuchlik
8049670772 doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list 2022-11-24 14:42:11 +01:00
Anna Stuchlik
57a58b17a8 doc: enable publishing the documentation for version 5.1
Closes #12059
2022-11-24 13:55:25 +02:00
Benny Halevy
243dc2efce hints: host_filter: check topology::has_endpoint if enabled_selectively
Don't call get_datacenter(ep) without checking
first has_endpoint(ep) since the former may abort
on internal error if the endpoint is not listed
in topology.

Refs #11870

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12054
2022-11-24 14:33:06 +03:00
Anna Stuchlik
f158d31e24 doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections 2022-11-24 11:26:33 +01:00
Petr Gusev
b95305ae2b cql: refactor, inline modification_statement::validate_primary_key_restrictions
The function didn't add much value, just forwarded to _restrictions.
Removed it and called _restrictions->validate_primary_key directly.
2022-11-23 21:56:12 +04:00
Petr Gusev
f9936bb0cb cql: DELETE with null value for IN parameter should be forbidden
If a DELETE statement contains an IN operator and the
parameter value for it is NULL, this should also trigger
an error. This is in line with how Cassandra
behaves in this case.
2022-11-23 21:39:23 +04:00
Petr Gusev
c123f94110 cql: add column name to the error message in case of null primary key component
It's more user-friendly and the error message
corresponds to what Cassandra provides in this case.
2022-11-23 21:39:23 +04:00
Petr Gusev
7730c4718e cql: batch statement, inserting a row with a null key column should be forbidden
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.

Fixes: #12060
2022-11-23 21:39:23 +04:00
Petr Gusev
89a5397d7c cql: wrong column name in error messages
If INSERT or DELETE statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.

The problem occurs if the query contains a column with the list type,
otherwise
statement_restrictions::process_clustering_columns_restrictions
checks that all the components of the key are specified.

Fixes: #12046
2022-11-23 21:39:16 +04:00
Benny Halevy
996eac9569 topology: add get_datacenters
Returns an unordered set of datacenter names
to be used by network_topology_replication_strategy
and for ks_prop_defs.

The set is kept in sync with _dc_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12023
2022-11-23 18:39:36 +02:00
Takuya ASADA
9acdd3af23 dist: drop deprecated AMI parameters on setup scripts
Since we moved all IaaS code to scylla-machine-image, we nolonger need
AMI variable on sysconfig file or --ami parameter on setup scripts,
and also never used /etc/scylla/ami_disabled.
So let's drop all of them from Scylla core core.

Related with scylladb/scylla-machine-image#61

Closes #12043
2022-11-23 17:56:13 +02:00
Avi Kivity
7c66fdcad1 Merge 'Simplify sstable_directory configuration' from Pavel Emelyanov
When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this.

Closes #12005

* github.com:scylladb/scylladb:
  sstable_directory: Move all RAII booleans onto flags
  sstable_directory: Convert sort-sstables argument to flags struct
  sstable_directory: Drop default filter
2022-11-23 16:16:04 +02:00
Avi Kivity
70bfa708f5 storage_proxy: coroutinize change_hints_host_filter()
Trivial straight-line code, no performance implications.

Closes #12056
2022-11-23 15:34:24 +02:00
Jan Ciolek
84501851eb cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
Scylla doesn't support combining restrictions
on token with other restrictions on partition key columns.

Some pieces of code depend on the assumption
that such combinations are allowed.
In case they were allowed in the future
these functions would silently start
returning wrong results, and we would
return invalid rows.

Add a test that will start failing once
this restriction is removed. It will
warn the developer to change the
functions that used to depend
on the assumption.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 13:09:22 +01:00
Botond Dénes
602dfdaf98 Merge 'Task manager top level repair tasks' from Aleksandra Martyniuk
The PR introduces top level repair tasks representing repair and node operations
performed with repair. The actions performed as a part of these operations are
moved to corresponding tasks' run methods.

Also a small change to repair module is added.

Closes #11869

* github.com:scylladb/scylladb:
  repair: define run for data_sync_repair_task_impl
  repair: add data_sync_repair_task_impl
  tasks: repair: add noexcept to task impl constructor
  repair: define run for user_requested_repair_task_impl
  repair: add user_requested_repair_task_impl
  repair: allow direct access to max_repair_memory_per_range
2022-11-23 14:02:30 +02:00
Jan Ciolek
338af848a8 cql3: expr: remove needless braces around switch cases
Originally put braces around the cases because
there were local variables that I didn't want
to be shadowed.

Now there are no variables so the braces
can be removed without any problems.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:30 +01:00
Jan Ciolek
e8a46d34c2 cql3: move evaluation IS_NOT NULL to a separate function
When evaluating a binary operation with
operations like EQUAL, LESS_THAN, IN
the logic of the operation is put
in a separate function to keep things clean.

IS_NOT NULL is the only exception,
it has its evaluate implementation
right in the evaluate(binary_operator)
function.

It would be cleaner to have it in
a separate dedicated function,
so it's moved to one.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:30 +01:00
Jan Ciolek
b6cf6e6777 expr_test: test evaluating LIKE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
6774272fd6 expr_test: test evaluating IS_NOT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
e6c78bb6c2 expr_test: test evaluating CONTAINS_KEY binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
4f250609ab expr_test: test evaluating CONTAINS binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
3ca04cfcc2 expr_test: test evaluating IN binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
41f452b73f expr_test: test evaluating GTE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
1fe9a9ce2a expr_test: test evaluating GT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
ef2a77a3e0 expr_test: test evaluating LTE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
3cbb2d44e8 expr_test: test evaluating LT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
9feee70710 expr_test: test evaluating NEQ binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
e77dba0b0b expr_test: test evaluating EQ binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
63a89776a1 cql3: expr properly handle null in is_one_of()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
214dab9c77 cql3: expr properly handle null in like()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
2ce9c95a9d cql3: expr properly handle null in contains_key()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
336ad61aa3 cql3: expr properly handle null in contains()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
e2223be1ec cql3: expr: properly handle null in limits()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
d1abf2e168 cql3: expr: remove unneeded overload of limits()
There is a more general version of limits()
which takes expressions as both the lhs and rhs
arguments.

There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.

Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:25 +01:00
Jan Ciolek
0609a425e6 cql3: expr: properly handle null in equality operators
Expressions like:
123 = NULL
NULL = 123
NULL = NULL
NULL != 123

should be tolerated, but evaluate to NULL.
The current code assumes that a binary operator
can only evaluate to a boolean - true or false.

Now a binary operator can also evaluate to NULL.
This should happen in cases when one of the
operator's sides is NULL.

A special class is introduced to represent a value
that can be one of three things: true, false or null.
It's better than using std::optional<bool>,
because optional has implicit conversions to bool
that could cause confusion and bugs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:22 +01:00
Aleksandra Martyniuk
a3016e652f repair: define run for data_sync_repair_task_impl
Operations performed as a part of data sync repair are moved
to data_sync_repair_task_impl run method.
2022-11-23 10:44:19 +01:00
Aleksandra Martyniuk
42239c8fed repair: add data_sync_repair_task_impl
Create a task spanning over whole node operation. Tasks of that type
are stored on shard 0.
2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk
9e108a2490 tasks: repair: add noexcept to task impl constructor
Add noexcept to constructor of tasks::task_manager::task::impl
and inheriting classes.
2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk
4a4e9c12df repair: define run for user_requested_repair_task_impl
Operations performed as a part of user requested repair are
moved to user_requested_repair_task_impl run method.
2022-11-23 10:19:51 +01:00
Aleksandra Martyniuk
3800b771fc repair: add user_requested_repair_task_impl
Create a task spanning over whole user requested repair.
Tasks of that type are stored on shard 0.
2022-11-23 10:11:09 +01:00
Aleksandra Martyniuk
0256ede089 repair: allow direct access to max_repair_memory_per_range
Access specifier of constexpr value max_repair_memory_per_range
in repair_module is changed to public and its getter is deleted.
2022-11-23 10:11:09 +01:00
Anna Stuchlik
16e2b9acd4 Update docs/alternator/compatibility.md
Co-authored-by: Daniel Lohse <info@asapdesign.de>
2022-11-23 09:51:04 +01:00
Avi Kivity
d7310fd083 gdb: messaging: print tls servers too
Many systems have most traffic on tls servers, so print them.

Closes #12053
2022-11-23 07:59:02 +02:00
Avi Kivity
aec9faddb1 Merge 'storage_proxy: use erm topology' from Benny Halevy
When processing a query, we keep a pointer to an effective_replication_map.
In a couple places we used the latest topology instead of the one held by the effective_replication_map
that the query uses and that might lead to inconsistencies if, for example, a node is removed from topology after decommission that happens concurrently to the query.

This change gets the topology& from the e_r_m in those cases.

Fixes #12050

Closes #12051

* github.com:scylladb/scylladb:
  storage_proxy: pass topology& to sort_endpoints_by_proximity
  storage_proxy: pass topology& to is_worth_merging_for_range_query
2022-11-22 20:04:41 +02:00
Botond Dénes
49ec7caf27 mutation_fragment_stream_validator: avoid allocation when stream is correct
Currently the ctor of said class always allocates as it copies the
provided name string and it creates a new name via format().
We want to avoid this, now that the validator is used on the read path.
So defer creating the formatted name to when we actually want to log
something, which is either when log level is debug or when an error is
found. We don't care about performance in either case, but we do care
about it on the happy path.
Further to the above, provide a constructor for string literal names and
when this is used, don't copy the name string, just save a view to it.

Refs: #11174

Closes #12042
2022-11-22 19:19:18 +02:00
Nadav Har'El
ce7c1a6c52 Merge 'alternator: fix wrong 'where' condition for GSI range key' from Marcin Maliszkiewicz
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).

Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)

Closes #11926

* github.com:scylladb/scylladb:
  alternator: add maybe_quote to secondary indexes 'where' condition
  test/alternator: correct xfail reason for test_gsi_backfill_empty_string
  test/alternator: correct indentation in test_lsi_describe
  alternator: fix wrong 'where' condition for GSI range key
2022-11-22 17:46:52 +02:00
Pavel Emelyanov
22133a3949 sstable_directory: Move all RAII booleans onto flags
There's a bunch of booleans that control the behavior of sstable
directory scanning. Currently they are described as verbose
bool_class<>-es and are put into sstable_directory construction time.

However, these are not used outside of .process_sstable_dir() method and
moving them onto recently added flags struct makes the code much
shorter (29 insertions(+), 121 deletions(-))

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:30:00 +03:00
Pavel Emelyanov
7ca5e143d7 sstable_directory: Convert sort-sstables argument to flags struct
The sstable_directory::process_sstable_dir() accepts a boolean to
control its behavior when collecting sstables. Turn this boolean into a
structure of flags. The intention is to extend this flags set in the
future (next patch).

This boolean is true all the time, but one place sets it to true in a
"verbose" manner, like this:

        bool sort_sstables_according_to_owner = false;
        process_sstable_dir(directory, sort_sstables_according_to_owner).get();

the local variable is not used anymore. Using designated initializers
solves the verbosity in a nicer manner.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:19:23 +03:00
Pavel Emelyanov
7c7017d726 sstable_directory: Drop default filter
It's used as default argument for .reshape() method, but callers specify
it explicitly. At the same time the filter is simple enough and is only
used in one place so that the caller can just use explicit lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:19:23 +03:00
Jan Ciolek
6be142e3a0 cql3: expr: remove unneeded overload of equal()
There is a more general version of equal()
which takes expressions as both the lhs and rhs
arguments.

There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.

Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-22 14:28:10 +01:00
Benny Halevy
731a74c71f storage_proxy: pass topology& to sort_endpoints_by_proximity
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:02:40 +02:00
Benny Halevy
ab3fc1e069 storage_proxy: pass topology& to is_worth_merging_for_range_query
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:01:58 +02:00
Petr Gusev
0d443dfd16 modification_statement: fix LWT insert crash if clustering key is null
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.

Fixes: #11954
2022-11-22 16:45:16 +04:00
Marcin Maliszkiewicz
2bf2ffd3ed alternator: add maybe_quote to secondary indexes 'where' condition
This bug doesn't affect anything, the reason is descibed in the commit:
'alternator: fix wrong 'where' condition for GSI range key'.

But it's theoretically correct to escape those key names and
the difference can be observed via CQL's describe table. Before
the patch 'where' condition is missing one double quote in variable
name making it mismatched with corresponding column name.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
4389baf0d9 test/alternator: correct xfail reason for test_gsi_backfill_empty_string
Previously cited issue is closed already.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
59eca20af1 test/alternator: correct indentation in test_lsi_describe
Otherwise I think assert is not executed in a loop. And I am not sure why lsi variable can be bound
to anything. As I tested it was pointing to the last element in lsis...
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
d6d20134de alternator: fix wrong 'where' condition for GSI range key
This bug doesn't manifest in a visible way to the user.

Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported
so we don't need to consider what could happen for empty values of index range key.
After the index is added the only interesting value user can set is omitting
the value (null or empty are not allowed, see test_gsi_empty_value and
test_gsi_null_value).

In practice no matter of 'where' condition the underlaying materialized
view code is skipping row updates with missing keys as per this comment:
'If one of the key columns is missing, set has_new_row = false
meaning that after the update there will be no view row'.

Thats why the added test passes both before and after the patch.
But it's still usefull to include it to exercise those code paths.

Fixes #11800
2022-11-22 11:08:23 +01:00
Nadav Har'El
ff617c6950 cql-pytest: translate a few small Cassandra tests
This patch includes a translation of several additional small test files
from Cassandra's CQL unit test directory cql3/validation/operations.

All tests included here pass on both Cassandra and Scylla, so they did
not discover any new Scylla bugs, but can be useful in the future as
regression tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12045
2022-11-22 07:54:13 +02:00
Botond Dénes
f3eecb47f6 Merge 'Optimize cleanup compaction get ranges for invalidation' from Benny Halevy
Take advantage of the facts that both the owned ranges
and the initial non_owned_ranges (derived from the set of sstables)
are deoverlapped and sorted by start token to turn
the calculation of the final non_owned_ranges from
quadratic to linear.

Fixes #11922

Closes #11903

* github.com:scylladb/scylladb:
  dht: optimize subtract_ranges
  compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation
  compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys
2022-11-22 06:45:01 +02:00
Jan Ciolek
a1407ef576 cql3: expr: use evaluate(binary_operator) in is_satisfied_by
is_satisfied_by has to check if a binary_operator is satisfied
by some values. It used to be impossible to evaluate
a binary_operator, so is_satisfied had code to check
if its satisfied for a limited number of cases
occuring when filtering queries.

Now evaluate(binary_operator) has been implemented
and is_satisfied_by can use it to check if a binary_operator
evaluates to true.
This is cleaner and reduces code duplication.
Additionally cql tests will test the new evalute() implementation.

There is one special case with token().
When is_satisfied_by sees a restriction on token
it assumes that it's satisfied because it's
sure that these token restrictions were used
to generate partition ranges.

I had to leave this special case in because it's impossible
to evaluate(token). Once this is implemented I will remove
the special case because it's risky and prone to cause
bugs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 20:40:06 +01:00
Jan Ciolek
9c4889ecc3 cql3: expr: handle IS NOT NULL when evaluating binary_operator
The code to evaluate binary operators
was copied from is_satisfied_by.
is_satisfied_by wasn't able to evaluate
IS NOT NULL restrictions, so when such restriction
is encountered it throws an exception.

Implement proper handling for IS NOT NULL binary operators.

The switch ensures that all variants of oper_t are handled,
otherwise there would be a compilation error.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 20:40:00 +01:00
Avi Kivity
bf2e54ff85 Merge 'Move deletion log code to sstable_directory.cc' from Pavel Emelyanov
In order to support different storage kinds for sstable files (e.g. -- s3) it's needed to localize all the places that manipulate files on a POSIX filesystem so that custom storage could implement them in its own way. This set moves the deletion log manipulations to the sstable_directory.cc, which already "knows" that it works over a directory.

Closes #12020

* github.com:scylladb/scylladb:
  sstables: Delete log file in replay_pending_delete_log()
  sstables: Move deletion log manipulations to sstable_directory.cc
  sstables: Open-code delete_sstables() call
  sstables: Use fs::path in replay_pending_delete_log()
  sstables: Indentation fix after previous patch
  sstables: Coroutinize replay_pending_delete_log
  sstables: Read pending delete log with one line helper
  sstables: Dont write pending log with file_writer
2022-11-21 21:22:59 +02:00
Jan Ciolek
b4cc92216b cql3: expr: make it possible to evaluate binary_operator
evaluate() takes an expression and evaluates it
to a constant value. It wasn't possible to evalute
binary operators before, so it's added.

The code is based on is_satisfied_by,
which is currently used to check
whether a binary operator evaluates
to true or false.

It looks like is_satisfied_by and evalate()
do pretty much the same thing, one could be
implemented using the other.
In the future they might get merged
into a single function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 17:48:23 +01:00
Jan Ciolek
8d81eaa68f cql3: expr: accept expression as lhs argument to like()
like() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:33:18 +01:00
Jan Ciolek
b1a12686dc cql3: expr: accept expression as lhs in contains_key
contains_key() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:33:02 +01:00
Jan Ciolek
79cd9cd956 cql3: expr: accept expression as lhs argument to contains()
contains() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:32:44 +01:00
Benny Halevy
57ff3f240f dht: optimize subtract_ranges
Take advantage of the fact that both ranges and
ranges_to_subtract are deoverlapped and sorted by
to reduce the calculation complexity from
quadratic to linear.

Fixes #11922

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:48:28 +02:00
Benny Halevy
8b81635d95 compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation
The algorithm is generic and can be used elsewhere.

Add a unit test for the function before it gets
optimized in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:48:26 +02:00
Benny Halevy
7c6f60ae72 compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys
Currently, the function is inefficient in two ways:
1. unnecessary copy of first/last keys to automatic variables
2. redecorating the partition keys with the schema passed to
   needs_cleanup.

We canjust use the tokens from the sstable first/last decorated keys.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:44:32 +02:00
Pavel Emelyanov
2f9b7931af sstables: Delete log file in replay_pending_delete_log()
It's natural that the replayer cleans up after itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:16:22 +03:00
Pavel Emelyanov
bdc47b7717 sstables: Move deletion log manipulations to sstable_directory.cc
The deletion log concept uses the fact that files are on a POSIX
filesystem. Support for another storage type will have to reimplement
this place, so keep the FS-specific code in _directory.cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:16:21 +03:00
Pavel Emelyanov
865c51c6cf sstables: Open-code delete_sstables() call
It's no used by any other code, but to be used it requires the caller to
tranform TOC file names by prepending sstable directory to them. Things
get shorter and simpler if merging the helper code into the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
a61c96a627 sstables: Use fs::path in replay_pending_delete_log()
It's called by a code that has fs::path at hand and internally uses
helpers that need fs::path too, so no need to convert it back and forth.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
f5684bcaf0 sstables: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
85a73ca9c6 sstables: Coroutinize replay_pending_delete_log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
6f3fd94162 sstables: Read pending delete log with one line helper
There's one in seastar since recently

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
2dedf4d03a sstables: Dont write pending log with file_writer
It's a wrapper over output_stream with offset tracking and the tracking
is not needed to generate a log file. As a bonus of switching back we
get a stream.write(sstring) sugar.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:24 +03:00
Botond Dénes
2d4439a739 Merge 'doc: add a troubleshooting article about the missing configuration files' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11598

This PR adds the troubleshooting article submitted by @syuu1228 in the deprecated _scylla-docs_ repo, with https://github.com/scylladb/scylla-docs/pull/4152.
I copied and reorganized the content and rewritten it a little according to the RST guidelines so that the page renders correctly.

@syuu1228 Could you review this PR to make sure that my changes didn't distort the original meaning?

Closes #11626

* github.com:scylladb/scylladb:
  doc: apply the feedback to improve clarity
  doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB
  doc: add the new page to the toctree
  doc: add a troubleshooting article about the missing configuration files
2022-11-21 12:02:31 +02:00
Nadav Har'El
757d2a4c02 test/alternator: un-xfail a test which passes on modern Python
We had an xfailing test that reproduced a case where Alternator tried
to report an error when the request was too long, but the boto library
didn't see this error and threw a "Broken Pipe" error instead. It turns
out that this wasn't a Scylla bug but rather a bug in urllib3, which
overzealously reported a "Broken Pipe" instead of trying to read the
server's response. It turns out this issue was already fixed in
   https://github.com/urllib3/urllib3/pull/1524

and now, on modern installations, the test that used to fail now passes
and reports "XPASS".

So in this patch we remove the "xfail" tag, and skip the test if
running an old version of urllib3.

Fixes #8195

Closes #12038
2022-11-21 08:10:10 +02:00
Botond Dénes
ffc3697f2f Merge 'storage_service api: handle dropped tables' from Benny Halevy
Gracefully skip tables that were removed in the background.

Fixes #12007

Closes #12013

* github.com:scylladb/scylladb:
  api: storage_service: fixup indentation
  api: storage_service: add run_on_existing_tables
  api: storage_service: add parse_table_infos
  api: storage_service: log errors from compaction related handlers
  api: storage_service: coroutinize compaction related handlers
2022-11-21 07:56:27 +02:00
Avi Kivity
994603171b Merge 'Add validator to the mutation compactor' from Botond Dénes
Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written.
This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too.
This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign.

Fixes: https://github.com/scylladb/scylladb/issues/11174

Closes #11532

* github.com:scylladb/scylladb:
  mutation_compactor: add validator
  mutation_fragment_stream_validator: add a 'none' validation level
  test/boost/mutation_query_test: test_partition_limit: sort input data
  querier: consume_page(): use partition_start as the sentinel value
  treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
  treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
  position_in_partition: add for_partition_{start,end}()
2022-11-20 20:33:26 +02:00
Avi Kivity
779b01106d Merge 'cql3: expr: add unit tests for prepare_expression' from Jan Ciołek
Adds unit tests for the function `expr::prepare_expression`.

Three minor bugs were found by these tests, both fixed in this PR.
1. When preparing a map, the type for tuple constructor was taken from an unprepared tuple, which has `nullptr` as its type.
2. Preparing an empty nonfrozen list or set resulted in `null`, but preparing a map didn't. Fixed this inconsistency.
3. Preparing a `bind_variable` with `nullptr` receiver was allowed. The `bind_variable` ended up with a `nullptr` type, which is incorrect. Changed it to throw an exception,

Closes #11941

* github.com:scylladb/scylladb:
  test preparing expr::usertype_constructor
  expr_test: test that prepare_expression checks style_type of collection_constructor
  expr_test: test preparing expr::collection_constructor for map
  prepare_expr: make preparing nonfrozen empty maps return null
  prepare_expr: fix a bug in map_prepare_expression
  expr_test: test preparing expr::collection_constructor for set
  expr_test: test preparing expr::collection_constructor for list
  expr_test: test preparing expr::tuple_constructor
  expr_test: test preparing expr::untyped_constant
  expr_test_utils: add make_bigint_raw/const
  expr_test_utils: add make_tinyint_raw/const
  expr_test: test preparing expr::bind_variable
  cql3: prepare_expr: forbid preparing bind_variable without a receiver
  expr_test: test preparing expr::null
  expr_test: test preparing expr::cast
  expr_test_utils: add make_receiver
  expr_test_utils: add make_smallint_raw/const
  expr_test: test preparing expr::token
  expr_test: test preparing expr::subscript
  expr_test: test preparing expr::column_value
  expr_test: test preparing expr::unresolved_identifier
  expr_test_utils: mock data_dictionary::database
2022-11-20 20:03:54 +02:00
Nadav Har'El
2ba8b8d625 test/cql-pytest: remove "xfail" from passing test testIndexOnFrozenCollectionOfUDT
We had a test that used to fail because of issue #8745. But this issue
was alread fixed, and we forgot to remove the "xfail" marker. The test
now passes, so let's remove the xfail marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12039
2022-11-20 19:54:59 +02:00
Avi Kivity
40f61db120 Merge 'docs: describe the Raft upgrade and recovery procedures' from Kamil Braun
Add new guide for upgrading 5.1 to 5.2.

In this new upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.

In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
  successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
  situation.

Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.

Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
  use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
  once they exist

Closes #11910

* github.com:scylladb/scylladb:
  docs: describe the Raft upgrade and recovery procedures
  docs: add upgrade guide 5.1 -> 5.2
2022-11-20 19:00:23 +02:00
Avi Kivity
15ee8cfc05 Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.

Fixes: #11923
Refs: #11770

Closes #12026

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
  reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
  reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
2022-11-20 18:51:34 +02:00
Avi Kivity
895d721d5e Merge 'scylla-sstable: data-dump improvements' from Botond Dénes
This series contains a mixed bag of improvements to  `scylla sstable dump-data`. These improvements are mostly aimed at making the json output clearer, getting rid of any ambiguities.

Closes #12030

* github.com:scylladb/scylladb:
  tools/scylla-sstable: traverse sstables in argument order
  tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements
  tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>"
  tools/scylla-sstable: dump-data/json: use more uniform format for collections
  tools/scylla-sstable: dump-data/json: make cells easier to parse
2022-11-20 17:02:27 +02:00
Avi Kivity
2f9c53fbe4 Merge 'test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address' from Kamil Braun
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.

This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.

Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).

So use the ID instead to name the workdir and log file.

Also, when starting a test case, print the used cluster. This will make
it easier to map server IDs to their IP addresses when browsing through
the test logs.

Closes #12018

* github.com:scylladb/scylladb:
  test/pylib: manager_client: print used cluster when starting test case
  test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address
2022-11-20 16:56:19 +02:00
Avi Kivity
14218d82d6 Update tools/java submodule (serverless)
* tools/java caf754f243...874e2d529b (2):
  > Add Scylla Cloud serverless support
  > Switch cqlsh to use scylla-driver
2022-11-20 16:41:36 +02:00
Tomasz Grabiec
c8e983b4aa test: flat_mutation_reader_assertions: Use fatal BOOST_REQUIRE_EQUAL instead of BOOST_CHECK_EQUAL
BOOST_CHECK_EQUAL is a weaker form of assertion, it reports an error
and will cause the test case to fail but continues. This makes the
test harder to debug because there's no obvious way to catch the
failure in GDB and the test output is also flooded with things which
happen after the failed assertion.

Message-Id: <20221119171855.2240225-1-tgrabiec@scylladb.com>
2022-11-20 16:14:26 +02:00
Nadav Har'El
2d2034ea28 Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Closes #12031

* github.com:scylladb/scylladb:
  cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
  boost/restrictions-test: uncomment part of the test that passes now
  cql-pytest: enable test for filtering combined multi column and regular column restrictions
  cql3: don't ignore other restrictions when a multi column restriction is present during filtering
2022-11-20 11:50:38 +02:00
Benny Halevy
ec5707a4a8 api: storage_service: fixup indentation 2022-11-20 09:14:45 +02:00
Benny Halevy
cc63719782 api: storage_service: add run_on_existing_tables
Gracefully skip tables that were removed
in the background.

Fixes #12007

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:14:29 +02:00
Benny Halevy
9ef9b9d1d9 api: storage_service: add parse_table_infos
The table UUIDs are the same on all shards
so we might as well get them on shard 0
(as we already do) and reuse them on other shards.

It is more efficient and accurate to lookup the table
eventually on the shard using its uuid rather than
its name.  If the table was dropped and recreated
using the same name in the background, the new
table will have a new uuid and do the api function
does not apply to it anymore.

A following change will handle the no_such_column_family
cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:14:21 +02:00
Benny Halevy
9b4a9b2772 api: storage_service: log errors from compaction related handlers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:03:25 +02:00
Benny Halevy
a47f96bc05 api: storage_service: coroutinize compaction related handlers
Before we improve parsing tables lists
and handling of no_such_column_family
errors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:03:25 +02:00
Jan Ciolek
286f182a8c cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
In issue #12014 a user has encountered an instance of #6200.
When filtering a WHERE clause which contained
both multi-column and regular restrictions,
the regular restrictions were ignored.

Add a test which reproduces the issue
using a reproducer provided by the user.

This problem is tested in another similar test,
but this one reproduces the issue in the exact
way it was found by the user.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:27:42 +01:00
Jan Ciolek
63fb2612c3 boost/restrictions-test: uncomment part of the test that passes now
A part of the test was commented out due to #6200.
Now #6200 has been fixed and it can be uncommented.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:32 +01:00
Jan Ciolek
99e1032e34 cql-pytest: enable test for filtering combined multi column and regular column restrictions
The test test_multi_column_restrictions_and_filtering was marked as xfail,
because issue #6200 wasn't fixed. Now that filtering
multi column and other restrictions together has been fixed
the test passes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:32 +01:00
Jan Ciolek
b974d4adfb cql3: don't ignore other restrictions when a multi column restriction is present during filtering
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`

would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected,
the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions
are not satisfied. When they are satisfied the other
restrictions are checked as well to ensure all
of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column
and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:16 +01:00
Botond Dénes
30597f17ed tools/scylla-sstable: traverse sstables in argument order
In the order the user passed them on the command-line.
2022-11-18 15:58:37 +02:00
Botond Dénes
e337b25aa9 tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements
The usage of clustering_fragments is a typo, the output contains clustering_elements.
2022-11-18 15:58:36 +02:00
Botond Dénes
c39408b394 tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>"
The currently used "<unknown>" marker for invalid values/types is
undistinguishable from a normal value in some cases. Use the much more
distinct and unique json Null instead.
2022-11-18 15:58:36 +02:00
Botond Dénes
1dfceb5716 tools/scylla-sstable: dump-data/json: use more uniform format for collections
Instead of trying to be clever and switching the output on the type of
collection, use the same format always: a list of objects, where the
object has a key and value attribute, containing to the respective
collection item key and values. This makes processing much easier for
machines (and humans too since the previous system wasn't working well).
2022-11-18 15:58:36 +02:00
Botond Dénes
f89acc8df7 tools/scylla-sstable: dump-data/json: make cells easier to parse
There are several slightly different cell types in scylla: regular
cells, collection cells (frozen and non-frozen) and counter cells
(update and shards). In C++ code the type of the cell is always
available for code wishing to make out exactly what kind of cell a cell
is. In the JSON output of the dump-data this is currently really hard to
do as there is not enough information to disambiguate all the different
cell types. We wish to make the JSON output self-sufficient so in this
patch we introduce a "type" field which contains one of:
* regular
* counter-update
* counter-shards
* frozen-collection
* collection

Furthermore, we bring the different types closer by also printing the
counter shards under the 'value' key, not under the 'shards' key as
before. The separate 'shards' is no longer needed to disambiguate.
The documentation and the write operation is also updated to reflect the
changes.
2022-11-18 15:58:36 +02:00
Petr Gusev
41629e97de test.py: handle --markers parameter
Some tests may take longer than a few seconds to run. We want to
mark such tests in some way, so that we can run them selectively.
This patch proposes to use pytest markers for this. The markers
from the test.py command line are passed to pytest
as is via the -m parameter.

By default, the marker filter is not applied and all tests
will be run without exception. To exclude e.g. slow tests
you can write --markers 'not slow'.

The --markers parameter is currently only supported
by Python tests, other tests ignore it. We intend to
support this parameter for other types of tests in the future.

Another possible improvement is not to run suites for which
all tests have been filtered out by markers. The markers are
currently handled by pytest, which means that the logic in
test.py (e.g., running a scylla test cluster) will be run
for such suites.

Closes #11713
2022-11-18 12:36:20 +01:00
Avi Kivity
7da12c64bc Revert "Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity""
This reverts commit 22f13e7ca3, and reinstates
commit df8e1da8b2 ("Merge 'cql3: select_statement:
coroutinize indexed_table_select_statement::do_execute_base_query()' from
Avi Kivity"). The original commit was reverted due to failures in debug
mode on aarch64, but after commit 224a2877b9
("build: disable -Og in debug mode to avoid coroutine asan breakage"), it
works again.

Closes #12021
2022-11-18 12:44:00 +02:00
Kamil Braun
d7649a86c4 Merge 'Build up to support of dynamic IP address changes in Raft' from Konstantin Osipov
We plan to stop storing IP addresses in Raft configuration, and instead
use the information disseminated through gossip to locate Raft peers.

Implement patches that are building up to that:
* improve Raft API of configuration change notifications
* disseminate raft host id in Gossip
* avoid using Raft addresses from Raft configuraiton, and instead
  consistently use the translation layer between raft server id <-> IP
  address

Closes #11953

* github.com:scylladb/scylladb:
  raft: persist the initial raft address map
  raft: (upgrade) do not use IP addresses from Raft config
  raft: (and gossip) begin gossiping raft server ids
  raft: change the API of conf change notifications
2022-11-18 11:38:19 +01:00
Botond Dénes
437fcdeeda Merge 'Make use of enum_set in directory lister' from Pavel Emelyanov
The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent.

Closes #12017

* github.com:scylladb/scylladb:
  lister: Make lister::dir_entry_types an enum_set
  database: Avoid useless local variable
2022-11-18 12:15:26 +02:00
Botond Dénes
b39ca29b3c reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
The semaphore should admit readers as soon as it can. So at any point in
time there should be either no waiters, or the semaphore shouldn't be
able to admit new reads. Otherwise something went wrong. Detect this
when queuing up reads and dump the diagnostics if detected.
Even though tests should ensure this should never happen, recently we've
seen a race between eviction and enqueuing producing such situations.
This is very hard to write tests for, so add built-in detection and
protection instead. Detecting this is very cheap anyway.
2022-11-18 11:35:47 +02:00
Botond Dénes
ca7014ddb8 reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
Said method has a protection against concurrent (recursive more like)
calls to itself, by setting a flag `_evicting` and returning early if
this flag is set. The evicting loop however has at least one preemption
point between deciding there is nothing more to evict and resetting said
flag. This window provides opporunity for new inactive reads or waiters
to be queued without this loop noticing, while denying any other
concurrent invocations at that time from reacting too.
Eliminate this by using repeat() instead of do_until() and setting
`_evicting = false` the moment the loop's run condition becomes false.
2022-11-18 11:35:47 +02:00
Botond Dénes
892f52c683 reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
Currently this method detaches the inactive read from the handle and
notifies the permit, calls the notify handler if any and does some stat
bookkeeping. Extend it to do a complete detach: unlink the entry from
the inactive reads list and also cancel the ttl timer.
After this, all that is left to the caller is to destroy the entry.
This will prevent any recursive eviction from causing assertion failure.
Although recursive eviction shouldn't happen, it shouldn't trigger an
assert.
2022-11-18 11:35:43 +02:00
Pavel Emelyanov
a44ca06906 Merge 'token_metadata: Do not use topology info for is_member check' from Asias He
Since commit a980f94 (token_metadata: impl: keep the set of normal token owners as a member), we have a set, _normal_token_owners, which contains all the nodes in the ring.

We can use _normal_token_owners to check if a node is part of the ring directly instead of going through the _toplogy indirectly.

Fixes #11935

Closes #11936

* github.com:scylladb/scylladb:
  token_metadata: Rename is_member to is_normal_token_owner
  token_metadata: Add docs for is_member
  token_metadata: Do not use topology info for is_member check
  token_metadata: Check node is part of the topology instead of the ring
2022-11-18 11:54:07 +03:00
Asias He
4571fcf9e7 token_metadata: Rename is_member to is_normal_token_owner
The name is_normal_token_owner is more clear than is_member.
The is_normal_token_owner reflects what it really checks.
2022-11-18 09:29:20 +08:00
Asias He
965097cde5 token_metadata: Add docs for is_member
Make it clear, is_member checks if a node is part of the token ring and
checks nothing else.
2022-11-18 09:28:56 +08:00
Asias He
a495b71858 token_metadata: Do not use topology info for is_member check
Since commit a980f94 (token_metadata: impl: keep the set of normal token
owners as a member), we have a set, _normal_token_owners, which contains
all the nodes in the ring.

We can use _normal_token_owners to check if a node is part of the ring
directly instead of going through the _toplogy indirectly.

Fixes #11935
2022-11-18 09:28:56 +08:00
Asias He
f2ca790883 token_metadata: Check node is part of the topology instead of the ring
update_normal_tokens is the way to add a new node into the ring. We
should not require a new node to already be in the ring to be able to
add it to the ring. The current code works accidentally because
is_member is checking if a node is in the topology

We should use _topology.has_endpoint to check if a node is part of the
topology explicitly.
2022-11-18 09:28:56 +08:00
Jan Ciolek
77d68153f1 test preparing expr::usertype_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:10 +01:00
Jan Ciolek
eb92fb4289 expr_test: test that prepare_expression checks style_type of collection_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:10 +01:00
Jan Ciolek
77c63a6b92 expr_test: test preparing expr::collection_constructor for map
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:09 +01:00
Jan Ciolek
db67ade778 prepare_expr: make preparing nonfrozen empty maps return null
In Scylla and Cassandra inserting an empty collection
that is not frozen, is interpreted as inserting a null value.

list_prepare_expression and set_prepare_expression
have an if which handles this behavior, but there
wasn't one in map_prepare_expression.

As a result preparing empty list or set would result in null,
but preparing an empty map wouldn't. This is inconsistent,
it's better to return null in all cases of empty nonfrozen
collections.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:09 +01:00
Jan Ciolek
da71f9b50b prepare_expr: fix a bug in map_prepare_expression
map_prepare_expression takes a collection_constructor
of unprepared items and prepares it.

Elements of a map collection_constructor are tuples (key and value).

map_prepare_expression creates a prepared collection_constructor
by preparing each tuple and adding it to the result.

During this preparation it needs to set the type of the tuple.
There was a bug here - it took the type from unprepared
tuple_constructor and assigned it to the prepared one.
An unprepared tuple_constructor doesn't have a type
so it ended up assigning nullptr.

Instead of that it should create a tuple_type_impl instance
by looking at the types of map key and values,
and use this tuple_type_impl as the type of the prepared tuples.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:35:04 +01:00
Jan Ciolek
a656fdfe9a expr_test: test preparing expr::collection_constructor for set
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
76f587cfe7 expr_test: test preparing expr::collection_constructor for list
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
44b55e6caf expr_test: test preparing expr::tuple_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
265100a638 expr_test: test preparing expr::untyped_constant
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
f6b9100cd2 expr_test_utils: add make_bigint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
f9ff131f86 expr_test_utils: add make_tinyint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Jan Ciolek
76b6161386 expr_test: test preparing expr::bind_variable
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Jan Ciolek
4882724066 cql3: prepare_expr: forbid preparing bind_variable without a receiver
prepare_expression treats receiver as an optional argument,
it can be set to nullptr and the preparation should
still succeed when it's possible to infer the type of an expression.

preparing a bind_variable requires the receiver to be present,
because it doesn't contain any information about the type
of the bound value.

Added a check that the receiver is present.
Allowing to prepare a bind_variable without
the receiver present was a bug.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Avi Kivity
2779a171fc Merge 'Do not run aborted tasks' from Aleksandra Martyniuk
task_manager::task::impl contains an abort source which can
be used to check whether it is aborted and an abort method
which aborts the task (request_abort on abort_source) and all
its descendants recursively.

When the start method is called after the task was aborted,
then its state is set to failed and the task does not run.

Fixes: #11995

Closes #11996

* github.com:scylladb/scylladb:
  tasks: do not run tasks that are aborted
  tasks: delete unused variable
  tasks: add abort_source to task_manager::task::impl
2022-11-17 19:42:46 +02:00
Pavel Emelyanov
a396c27efc Merge 'message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client' from Kamil Braun
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780

Closes #11942

* github.com:scylladb/scylladb:
  message: messaging_service: check for known topology before calling is_same_dc/rack
  test: reenable test_topology::test_decommission_node_add_column
  test/pylib: util: configurable period in wait_for
  message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
  message: messaging_service: topology independent connection settings for GOSSIP verbs
2022-11-17 20:14:32 +03:00
Jan Ciolek
42e01cc67f expr_test: test preparing expr::null
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:05 +01:00
Jan Ciolek
45b3fca71c expr_test: test preparing expr::cast
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:05 +01:00
Jan Ciolek
498c9bfa0d expr_test_utils: add make_receiver
Add a convenience function which creates receivers.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
6873a21fbd expr_test_utils: add make_smallint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
488056acb7 expr_test: test preparing expr::token
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
7958f77a40 expr_test: test preparing expr::subscript
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
569bd61c6c expr_test: test preparing expr::column_value
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
26174e29c6 expr_test: test preparing expr::unresolved_identifier
It's interesting that prepare_expression
for column identifiers doesn't require a receiver.
I hope this won't break validation in the future.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
c719a923bb expr_test_utils: mock data_dictionary::database
Add a function which creates a mock instance
of data_dictionary::database.

prepare_expression requires a data_dictionary::database
as an argument, so unit tests for it need something
to pass there. make_data_dictionary_database can
be used to create an instance that is sufficient for tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:00 +01:00
Kamil Braun
8e8c32befe test/pylib: manager_client: print used cluster when starting test case
It will be easier to map server IDs to their IP addresses when browsing
through the test logs.
2022-11-17 17:14:23 +01:00
Pavel Emelyanov
bc62ca46d4 lister: Make lister::dir_entry_types an enum_set
This type is currently an unordered_set, but only consists of at most
two elements. Making it an enum_set renders it into a size_t variable
and better describes the intention.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-17 19:01:45 +03:00
Pavel Emelyanov
c6021b57a1 database: Avoid useless local variable
It's used to run lister::scan_dir() with directory_entry_type::directory
only, but for that is copied around on lambda captures. It's simpler
just to use the value directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-17 19:00:49 +03:00
Kamil Braun
b83234d8aa test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.

This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.

Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).
2022-11-17 16:55:12 +01:00
Anna Stuchlik
f7f03e38ee doc: update the link to Enabling Experimental Features 2022-11-17 15:44:46 +01:00
Anna Stuchlik
02cea98f55 doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph 2022-11-17 15:05:00 +01:00
Anna Stuchlik
ce88c61785 doc: update the links to the Enabling Experimental Features section 2022-11-17 14:59:34 +01:00
Avi Kivity
76be6402ed Merge 'repair: harden effective replication map' from Benny Halevy
As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization.

This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation.

While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield.

Fixes #11993

Closes #11994

* github.com:scylladb/scylladb:
  repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
  repair: pass effective_replication_map down to repair_info
  repair: coroutinize sync_data_using_repair
  repair: futurize do_repair_start
  effective_replication_map: add global_effective_replication_map
  shared_token_metadata: get_lock is const
  repair: sync_data_using_repair: require to run on shard 0
  repair: require all node operations to be called on shard 0
  repair: repair_info: keep effective_replication_map
  repair: do_repair_start: use keyspace erm to get keyspace local ranges
  repair: do_repair_start: use keyspace erm for get_primary_ranges
  repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
  repair: do_repair_start: check_in_shutdown first
  repair: get_db().local() where needed
  repair: get topology from erm/token_metdata_ptr
  view: get_view_natural_endpoint: get topology from erm
2022-11-17 13:29:02 +02:00
Konstantin Osipov
262566216b raft: persist the initial raft address map 2022-11-17 14:26:36 +03:00
Konstantin Osipov
b35af73fdf raft: (upgrade) do not use IP addresses from Raft config
Always use raft address map to obtain the IP addresses
of upgrade peers. Right now the map is populated
from Raft configuration, so it's an equivalent transformation,
but in the future raft address map will be populated from other sources:
discovery and gossip, hence the logic of upgrade will change as well.

Do not proceed with the upgrade if an address is
missing from the map, since it means we failed to contact a raft member.
2022-11-17 14:26:31 +03:00
Pavel Emelyanov
2add9ba292 Merge 'Refactor topology out of token_metadata' from Benny Halevy
This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh}
and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987

While at it, the token_metadata functions are turned into coroutines
and topology copy constructor is deleted.  The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology.

Closes #12001

* github.com:scylladb/scylladb:
  locator: refactor topology out of token_metadata
  locator: add types.hh
  topology: delete copy constructor
  token_metadata: coroutinize clone functions
2022-11-17 13:55:34 +03:00
Aleksandra Martyniuk
7ead1a7857 compaction: request abort only once in compaction_data::stop
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.

To prevent this before each request_abort() we check whether an abort
was requested before.

Closes #12004
2022-11-17 12:44:59 +02:00
Benny Halevy
1e2741d2fe abstract_replication_strategy: recognized_options: return unordered_set
An unordered_set is more efficient and there is no need
to return an ordered set for this purpose.

This change facilitates a follow-up change of adding
topology::get_datacenters(), returning an unordered_set
of datacenter names.

Refs #11987

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12003
2022-11-17 11:27:05 +02:00
Botond Dénes
e925c41f02 utils/gs/barrett.hh: aarch64: s/brarett/barrett/
Fix a typo introduced by the the recent patch fixing the spelling of
Barrett. The patch introduced a typo in the aarch64 version of the code,
which wasn't found by promotion, as that only builds on X86_64.

Closes #12006
2022-11-17 11:09:59 +02:00
Konstantin Osipov
051dceeaff raft: (and gossip) begin gossiping raft server ids
We plan to use gossip data to educate Raft RPC about IP addresses
of raft peers. Add raft server ids to application state, so
that when we get a notification about a gossip peer we can
identify which raft server id this notification is for,
specifically, we can find what IP address stands for this server
id, and, whenever the IP address changes, we can update Raft
address map with the new address.

On the same token, at boot time, we now have to start Gossip
before Raft, since Raft won't be able to send any messages
without gossip data about IP addresses.
2022-11-17 12:07:31 +03:00
Konstantin Osipov
990c7a209f raft: change the API of conf change notifications
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.

For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
2022-11-17 12:07:31 +03:00
Benny Halevy
53fdf75cf9 repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
Now that it is available in repair_info.

Fixes #11993

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:30 +02:00
Benny Halevy
b69be61f41 repair: pass effective_replication_map down to repair_info
And make sure the token_metadata ring version is same as the
reference one (from the erm on shard 0), when starting the
repair on each shard.

Refs #11993

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:29 +02:00
Benny Halevy
c47d36b53d repair: coroutinize sync_data_using_repair
Prepare for the next path that will co_await
make_global_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:04 +02:00
Benny Halevy
58b1c17f5d repair: futurize do_repair_start
Turn it into a coroutine to prepare for the next path
that will co_await make_global_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:04 +02:00
Benny Halevy
4b9269b7e2 effective_replication_map: add global_effective_replication_map
Class to hold a coherent view of a keyspace
effective replication map on all shards.

To be used in a following patch to pass the sharded
keyspace e_r_m:s to repair.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:01 +02:00
Avi Kivity
b8b78959fb build: switch to packaged libdeflate rather than a submodule
Now that our toolchain is based on Fedora 37, we can rely on its
libdeflate rather than have to carry our own in a submodule.

Frozen toolchain is regenerated. As a side effect clang is updated
from 15.0.0 to 15.0.4.

Closes #12000
2022-11-17 08:01:00 +02:00
Benny Halevy
2c677e294b shared_token_metadata: get_lock is const
The lock is acquired using an a function that
doesn't modify the shared_token_metadata object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
d6b2124903 repair: sync_data_using_repair: require to run on shard 0
And with that do_sync_data_using_repair can be folded into
sync_data_using_repair.

This will simplify using the effective_replication_map
throughout the operation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
0c56c75cf8 repair: require all node operations to be called on shard 0
To simplify using of the effective_replication_map / token_metadata_ptr
throught the operation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
64b0756adc repair: repair_info: keep effective_replication_map
Sampled when repair info is constructed.
To be used throughout the repair process.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
c7d753cd44 repair: do_repair_start: use keyspace erm to get keyspace local ranges
Rather than calling db.get_keyspace_local_ranges that
looks up the keyspace and its erm again.

We want all the inforamtion derived from the erm to
be based on the same source.

The function is synchronous so this changes doesn't
fix anything, just cleans up the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
aaf74776c2 repair: do_repair_start: use keyspace erm for get_primary_ranges
Ensure that the primary ranges are in sync with the
keyspace erm.

The function is synchronous so this change doesn't fix anything,
it just cleans up the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
9200e6b005 repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
Ensure the erm and topology are in sync.

The function is synchronous so this change doesn't fix
anything, just cleans up the code.

Fix mistake in comment while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:57:56 +02:00
Benny Halevy
59dc2567fd repair: do_repair_start: check_in_shutdown first
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
881eb0df83 repair: get_db().local() where needed
In several places we get the sharded database using get_db()
and then we only use db.local().  Simplify the code by keeping
reference only to the local database upfront.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
c22c4c8527 repair: get topology from erm/token_metdata_ptr
We want the topology to be synchronized with the respective
effective_replication_map / token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
94f2e95a2f view: get_view_natural_endpoint: get topology from erm
Get the topology for the effective replication map rather
than from the storage_proxy to ensure its synchronized
with the natural endpoints.

Since there's no preemption between the two calls
currently there is no issue, so this is merely a clean up
of the code and not supposed to fix anything.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Nadav Har'El
e393639114 test/cql-pytest: reproducer for crash in LWT with null key
This patch adds a reproducer for issue #11954: Attempting an
"IF NOT EXISTS" (LWT) write with a null key crashes Scylla,
instead of producing a simple error message (like happens
without the "IF NOT EXISTS" after #7852 was fixed).

The test passed on Cassandra, but crashes Scylla. Because of this
crash, we can't just mark the test "xfail" and it's temporarily
marked "skip" instead.

Refs #11954.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11982
2022-11-17 07:31:13 +02:00
Benny Halevy
d0bd305d16 locator: refactor topology out of token_metadata
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 21:55:54 +02:00
Benny Halevy
297a4de4e4 locator: add types.hh
To export low-level types that are used by oher modules
for the locator interfaces.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 21:53:05 +02:00
Kamil Braun
0c9cb5c5bf Merge 'raft: wait for the next tick before retrying' from Gusev Petr
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.

The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.

Fixes: #11564

Closes #11769

* github.com:scylladb/scylladb:
  raft: rafactor: remove duplicate code on retries delays
  raft: use wait_for_next_tick in read_barrier
  raft: wait for the next tick before retrying
2022-11-16 18:20:54 +01:00
Aleksandra Martyniuk
4250bd9458 tasks: do not run tasks that are aborted
Currently in start() method a task is run even if it was already
aborted.

When start() is called on an aborted task, its state is set to
task_manager::task_state::failed and it doesn't run.
2022-11-16 18:09:41 +01:00
Aleksandra Martyniuk
ebffca7ea5 tasks: delete unused variable 2022-11-16 18:07:57 +01:00
Aleksandra Martyniuk
752edc2205 tasks: add abort_source to task_manager::task::impl
task_manager::task can be aborted with impl's abort_source.
By default abort request is propagated to all task's descendants.
2022-11-16 18:07:11 +01:00
Avi Kivity
c4f069c6fc Update seastar submodule
* seastar 153223a188...4f4cc00660 (10):
  > Merge 'Avoid using namespace internal' from Pavel Emelyanov
  > Merge 'De-futurize IO class update calls' from Pavel Emelyanov
  > abort_source: subscribe(): remove noexcept qualifier
  > Merge 'Add Prometheus filtering capabilities by label' from Amnon Heiman
  > fsqual: stop causing memory leak error on LeakSanitizer
  > metrics.cc: Do not merge empty histogram
  > Update tutorial.md
  > README-DPDK.md: document --cflags option
  > build: install liburing.pc using stow
  > core/polymorphic_temporary_buffer: include <seastar/core/memory.hh>

Closes #11991
2022-11-16 17:59:33 +02:00
Avi Kivity
3497891cf9 utils: spell "barrett" correctly
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.

Closes #11989
2022-11-16 16:30:38 +02:00
Benny Halevy
0c94ffcc85 topology: delete copy constructor
Topology is copied only from token_metadata_impl::clone_only_token_map
which copies the token_metadata_impl with yielding to prevent reactor
stalls.  This should apply to topology as well, so
add a clone_gently function for cloning the topology
from token_metadata_impl::clone_only_token_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 15:27:28 +02:00
Benny Halevy
4f4fc7fe22 token_metadata: coroutinize clone functions
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 15:27:28 +02:00
Kamil Braun
a83789160d message: messaging_service: check for known topology before calling is_same_dc/rack
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.

When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).

Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.

Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
2022-11-16 14:01:50 +01:00
Kamil Braun
9b2449d3ea test: reenable test_topology::test_decommission_node_add_column
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.

Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
2022-11-16 14:01:50 +01:00
Kamil Braun
0f49813312 test/pylib: util: configurable period in wait_for 2022-11-16 14:01:50 +01:00
Kamil Braun
1bd2471c19 message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780
2022-11-16 14:01:50 +01:00
Kamil Braun
840be34b5f message: messaging_service: topology independent connection settings for GOSSIP verbs
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).

The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.

However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix #11780.

To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).

This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.

Fixes #11992.

Inspired by xemul/scylla@45d48f3d02.
2022-11-16 13:58:07 +01:00
Anna Stuchlik
01c9846bb6 doc: add the link to the Enabling Experimental Features section 2022-11-16 13:24:45 +01:00
Anna Stuchlik
f1b2f44aad doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section 2022-11-16 13:23:07 +01:00
Nadav Har'El
2f2f01b045 materialized views: fix view writes after base table schema change
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.

This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.

This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.

Fixes #10026
Fixes #11542
2022-11-16 13:58:21 +02:00
Nadav Har'El
7cbb0b98bb Merge 'doc: document user defined functions (UDFs)' from Anna Stuchlik
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.

Fixes https://github.com/scylladb/scylladb/issues/11378

Closes #11984

* github.com:scylladb/scylladb:
  doc: label user-defined functions as Experimental
  doc: restore the note for the Count function (removed by mistatke)
  doc: document user defined functions (UDFs)
2022-11-16 13:09:47 +02:00
Botond Dénes
cbf9be9715 Merge 'Avoid 0.0.0.0 (and :0) as preferred IP' from Pavel Emelyanov
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.

Closes #11846

* github.com:scylladb/scylladb:
  messaging_service: Deny putting INADD_ANY as preferred ip
  messaging_service: Toss preferred ip cache management
  gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
  gossiping_property_file_snitch: Make _listen_address optional
2022-11-16 08:30:42 +02:00
Avi Kivity
43d3e91e56 tools: toolchain: prepare: use real bash associative array
When we translate from docker/go arch names to the kernel arch
names, we use an associative array hack using computed variable
names "{$!variable_name}". But it turns out bash has real
associative arrays, introduced with "declare -A". Use the to make
the code a little clearer.

Closes #11985
2022-11-16 08:17:47 +02:00
Botond Dénes
e90d0811d0 Merge 'doc: update ScyllaDB requirements - supported CPUs and AWS i4g instances' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4144

Closes #11226

* github.com:scylladb/scylladb:
  Update docs/getting-started/system-requirements.rst
  doc: specify the recommended AWS instance types
  doc: replace the tables with a generic description of support for Im4gn and Is4gen instances
  doc: add support for AWS i4g instances
  doc: extend the list of supported CPUs
2022-11-16 08:15:00 +02:00
Botond Dénes
bd1fcbc38f Merge 'Introduce reverse vector_deserializer.' from Michał Radwański
As indicated in #11816, we'd like to enable deserializing vectors in reverse.
The forward deserialization is achieved by reading from an input_stream. The
input stream internally is a singly linked list with complicated logic. In order to
allow for going through it in reverse, instead when creating the reverse vector
initializer, we scan the stream and store substreams to all the places that are a
starting point for a next element. The iterator itself just deserializes elements
from the remembered substreams, this time in reverse.

Fixes #11816

Closes #11956

* github.com:scylladb/scylladb:
  test/boost/serialization_test.cc: add test for reverse vector deserializer
  serializer_impl.hh: add reverse vector serializer
  serializer_impl: remove unneeded generic parameter
2022-11-16 07:37:24 +02:00
Anna Stuchlik
cdb6557f23 doc: label user-defined functions as Experimental 2022-11-15 21:22:01 +01:00
Avi Kivity
d85f731478 build: update toolchain to Fedora 37 with clang 15
'cargo' instantiation now overrides internal git client with
cli client due to unbounded memory usage [1].

[1] https://github.com/rust-lang/cargo/issues/10583#issuecomment-1129997984
2022-11-15 16:48:09 +00:00
Anna Stuchlik
1f1d88d04e doc: restore the note for the Count function (removed by mistatke) 2022-11-15 17:41:22 +01:00
Anna Stuchlik
dbb19f55fb doc: document user defined functions (UDFs) 2022-11-15 17:33:05 +01:00
Nadav Har'El
e4dba6a830 test/cql-pytest: add test for when MV requires IS NOT NULL
As noted in issue #11979, Scylla inconsistently (and unlike Cassandra)
requires "IS NOT NULL" one some but not all materialized-view key
columns. Specifically, Scylla does not require "IS NOT NULL" on the
base's partition key, while Cassandra does.

This patch is a test which demonstrates this inconsistency. It currently
passes on Cassandra and fails on Scylla, so is marked xfail.

Refs #11979

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11980
2022-11-15 14:21:48 +01:00
Asias He
16bd9ec8b1 gossip: Improve get_live_token_owners and get_unreachable_token_owners
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952
2022-11-15 14:21:48 +01:00
Botond Dénes
21489c9f9c Merge 'doc: add the "Scylladb Enterprise" label to the Enterprise-only features' from Anna Stuchlik
This PR is a follow-up to https://github.com/scylladb/scylladb/pull/11918.

With this PR:
- The "ScyllaDB Enterprise" label is added to all the features that are only available in ScyllaDB Enterprise.
- The previous Enterprise-only note is removed (it was included in multiple files as _/rst_include/enterprise-only-note.rst_ - this file is removed as it is no longer used anywhere in the docs).
- "Scylla Enterprise" was removed from `versionadded `because now it's clear that the feature was added for Enterprise.

Closes #11975

* github.com:scylladb/scylladb:
  doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore
  doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features
2022-11-15 14:21:48 +01:00
Botond Dénes
34f29c8d67 Merge 'Use with_sstable_directory() helper in tests' from Pavel Emelyanov
The helper is already widely used, one (last) test case can benefit from using it too

Closes #11978

* github.com:scylladb/scylladb:
  test: Indentation fix after previous patch
  test: Wse with_sstable_directory() helper
2022-11-15 14:21:48 +01:00
Nadav Har'El
8a4ab87e44 Merge 'utils: crc: generate crc barrett fold tables at compile time' from Avi Kivity
We use Barrett tables (misspelled in the code unfortunately) to fold
crc computations of multiple buffers into a single crc. This is important
because it turns out to be faster to compute crc of three different buffers
in parallel rather than compute the crc of one large buffer, since the crc
instruction has latency 3.

Currently, we have a separate code generation step to compute the
fold tables. The step generates a new C++ source files with the tables.
But modern C++ allows us to do this computation at compile time, avoiding
the code generation step. This simplifies the build.

This series does that. There is some complication in that the code uses
compiler intrinsics for the computation, and these are not constexpr friendly.
So we first introduce constexpr-friendly alternatives and use them.

To prove the transformation is correct, I compared the generated code from
before the series and from just before the last step (where we use constexpr
evaluation but still retain the generated file) and saw no difference in the values.

Note that constexpr is not strictly needed - we could have run the code in the
global variables' initializer. But that would cause a crash if we run on a pre-clmul
machine, and is not as fun.

Closes #11957

* github.com:scylladb/scylladb:
  test: crc: add unit tests for constexpr clmul and barrett fold
  utils: crc combine table: generate at compile time
  utils: barrett: inline functions in header
  utils: crc combine table: generate tables at compile time
  utils: crc combine table: extract table generation into a constexpr function
  utils: crc combine table: extract "pow table" code into constexpr function
  utils: crc combine table: store tables std::arrray rather than C array
  utils: barrett: make the barrett reduction constexpr friendly
  utils: clmul: add 64-bit constexpr clmul
  utils: barrett: extract barrett reduction constants
  utils: barrett: reorder functions
  utils: make clmul() constexpr
2022-11-15 14:21:48 +01:00
Petr Gusev
ae3e0e3627 raft: rafactor: remove duplicate code on retries delays
Introduce a templated function do_on_leader_with_retries,
use it in add_entries/modify_config/read_barrier. The
function implements the basic logic of retries with aborts
and leader changes handling, adds a delay between
iterations to protect against tight loops.
2022-11-15 13:18:53 +04:00
Petr Gusev
15cc1667d0 raft: use wait_for_next_tick in read_barrier
Replaced the yield on transport_error
with wait_for_next_tick. Added delays for retries, similar
to add_entry/modify_config: we postpone the next
call attempt if we haven't received new information
about the current leader.
2022-11-15 12:31:49 +04:00
Petr Gusev
5e15c3c9bd raft: wait for the next tick before retrying
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.

The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.

We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.

Fixes: #11564
2022-11-15 11:49:26 +04:00
Pavel Emelyanov
8dcd9d98d6 test: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-14 20:11:01 +03:00
Pavel Emelyanov
c9128e9791 test: Wse with_sstable_directory() helper
It's already used everywhere, but one test case wires up the
sstable_directory by hand. Fix it too, but keep in mind, that the caller
fn stops the directory early.

(indentation is deliberately left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-14 20:11:01 +03:00
Michał Radwański
32c60b44c5 test/boost/serialization_test.cc: add test for reverse vector
deserializer

This test is just a copy-pasted version of forward serializer test.
2022-11-14 16:06:24 +01:00
Michał Radwański
dce67f42f8 serializer_impl.hh: add reverse vector serializer
Currently when we want to deserialize mutation in reverse, we unfreeze
it and consume from the end. This new reverse vector deserializer
goes through input stream remembering substreams that contain a given
output range member, and while traversing from the back, deserialize
each substream.
2022-11-14 16:06:24 +01:00
Anna Stuchlik
e36bd208cc doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore 2022-11-14 15:20:51 +01:00
Anna Stuchlik
36324fe748 doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features 2022-11-14 15:16:51 +01:00
Takuya ASADA
da6c472db9 install.sh: Skip systemd existance check when --without-systemd
When --without-systemd specified, install.sh should skip systemd
existance check.

Fixes #11898

Closes #11934
2022-11-14 14:07:46 +02:00
Benny Halevy
ff5527deb1 topology: copy _sort_by_proximity in copy constructor
Fixes #11962

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11965
2022-11-14 13:59:56 +03:00
Pavel Emelyanov
bd48fdaad5 Merge 'handle_state_normal: do not update topology of removed endpoint' from Benny Halevy
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
    if (get_token_metadata().is_member(endpoint)) {
        co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Closes #11930

* github.com:scylladb/scylladb:
  storage_service, system_keyspace: add debugging around system.peers update
  storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
2022-11-14 13:58:28 +03:00
Botond Dénes
8e38551d93 Merge 'Allow each compaction group to have its own compaction backlog tracker' from Raphael "Raph" Carvalho
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction_group will be allowed to
create its own tracker and manage it by itself.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Closes #11762

* github.com:scylladb/scylladb:
  replica: Allow one compaction_backlog_tracker for each compaction_group
  compaction: Make compaction_state available for compaction tasks being stopped
  compaction: Implement move assignment for compaction_backlog_tracker
  compaction: Fix compaction_backlog_tracker move ctor
  compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
  compaction: kill undefined get_unimplemented_backlog_tracker()
  replica: Refactor table::set_compaction_strategy for multiple groups
  Fix exception safety when transferring ongoing charges to new backlog tracker
  replica: move_sstables_from_staging: Use tracker from group owning the SSTable
  replica: Move table::backlog_tracker_adjust_charges() to compaction_group
  replica: table::discard_sstables: Use compaction_group's backlog tracker
  replica: Disable backlog tracker in compaction_group::stop()
  replica: database_sstable_write_monitor: use compaction_group's backlog tracker
  replica: Move table::do_add_sstable() to compaction_group
  test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
  compaction/table_state: Introduce get_backlog_tracker()
2022-11-14 07:05:28 +02:00
Avi Kivity
b8cb34b928 test: crc: add unit tests for constexpr clmul and barrett fold
Check that the constexpr variants indeed match the runtime variants.

I verified manually that exactly one computation in each test is
executed at run time (and is compared against a constant).
2022-11-13 16:22:29 +02:00
Avi Kivity
70217b5109 utils: crc combine table: generate at compile time
By now the crc combine tables are generated at compile time,
but still in a separate code generation step. We now eliminate
the code generation step and instead link the global variables
directly into the main executable. The global variables have
been conveniently named exactly as the code generation step
names them, so we don't need to touch any users.
2022-11-12 17:26:45 +02:00
Avi Kivity
164e991181 utils: barrett: inline functions in header
Avoid duplicate definitions if the same header is used from more than
one place, at it will soon be.
2022-11-12 17:26:08 +02:00
Avi Kivity
a4f06773da utils: crc combine table: generate tables at compile time
Move the tables into global constinit variables that are
generated at compile time. Note the code that creates
the generated crc32_combine_table.cc is still called; it
transorms compile-time generated tables into a C++ source
that contains the same values, as literals.

If we generate a diff between gen/utils/gz/crc_combine_table.cc
before this series and after this patch, we see the only change
in the file is the type of the variable (which changed to
std::array), proving our constexpr code is correct.
2022-11-12 17:16:59 +02:00
Avi Kivity
a229fdc41e utils: crc combine table: extract table generation into a constexpr function
Move the code to a constexpr function, so we can later generate the tables at
compile time. Note that although the function is constexpr, it is still
evaluated at runtime, since the calling function (main()) isn't constexpr
itself.
2022-11-12 17:13:52 +02:00
Avi Kivity
d42bec59bb utils: crc combine table: extract "pow table" code into constexpr function
A "pow table" is used to generate the Barrett fold tables. Extract its
code into a constexpr function so we can later generate the fold tables
at compile time.
2022-11-12 17:11:44 +02:00
Avi Kivity
6e34014b64 utils: crc combine table: store tables std::arrray rather than C array
C arrays cannot be returned from functions and therefore aren't suitable
for constexpr processing. std::array<> is a regular value and so is
constexpr friendly.
2022-11-12 17:09:02 +02:00
Avi Kivity
1e9252f79a utils: barrett: make the barrett reduction constexpr friendly
Dispatch to intrinsics or constexpr based on evaluation context.
2022-11-12 17:04:44 +02:00
Avi Kivity
0bd90b5465 utils: clmul: add 64-bit constexpr clmul
This is used when generating the Barrett reduction tables, and also when
applying the Barrett reduction at runtime, so we need it to be constexpr
friendly.
2022-11-12 17:04:05 +02:00
Avi Kivity
c376c539b8 utils: barrett: extract barrett reduction constants
The constants are repeated across x86_64 and aarch64, so extract
them into a common definition.
2022-11-12 17:00:17 +02:00
Avi Kivity
2fdf81af7b utils: barrett: reorder functions
Reorder functions in dependency order rather than forward
declaring them. This makes them more constexpr-friendly.
2022-11-12 16:52:41 +02:00
Avi Kivity
8aa59a897e utils: make clmul() constexpr
clmul() is a pure function and so should already be constexpr,
but it uses intrinsics that aren't defined as constexpr and
so the compiler can't really compute it at compile time.

Fix by defining a constexpr variant and dispatching based
on whether we're being constant-evaluated or not.

The implementation is simple, but in any case proof that it
is correct will be provided later on.
2022-11-12 16:49:43 +02:00
Raphael S. Carvalho
b88acffd66 replica: Allow one compaction_backlog_tracker for each compaction_group
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.

On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Raphael S. Carvalho
d862dd815c compaction: Make compaction_state available for compaction tasks being stopped
compaction_backlog_tracker will be managed by compaction_manager, in the
per table state. As compaction tasks can access the tracker throughout
its lifetime, remove() can only deregister the state once we're done
stopping all tasks which map to that state.
remove() extracted the state upfront, then performed the stop, to
prevent new tasks from being registered and left behind. But we can
avoid the leak of new tasks by only closing the gate, which waits
for all tasks (which are stopped a step earlier) and once closed,
prevents new tasks from being registered.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Raphael S. Carvalho
0a152a2670 compaction: Implement move assignment for compaction_backlog_tracker
That's needed for std::optional to work on its behalf.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:49 -03:00
Raphael S. Carvalho
fe305cefd0 compaction: Fix compaction_backlog_tracker move ctor
Luckily it's not used anywhere. Default move ctor was picked but
it won't clear _manager of old object, meaning that its destructor
will incorrectly deregister the tracker from
compaction_backlog_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
8e1e30842d compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
A step closer towards a separate backlog tracker for each compaction group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
fedafd76eb compaction: kill undefined get_unimplemented_backlog_tracker()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
90991bda69 replica: Refactor table::set_compaction_strategy for multiple groups
Refactoring the function for it to accomodate multiple compaction
groups.

To still provide strong exception guarantees, preparation and
execution of changes will be separated.

Once multiple groups are supported, each group will be prepared
first, and the noexcept execution will be done as a last step.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
244efddb22 Fix exception safety when transferring ongoing charges to new backlog tracker
When setting a new strategy, the charges of old tracker is transferred
to the new one.

The problem is that we're not reverting changes if exception is
triggered before the new strategy is successfully set.

To fix this exception safety issue, let's copy the charges instead
of moving them. If exception is triggered, the old tracker is still
the one used and remain intact.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
d1e2dbc592 replica: move_sstables_from_staging: Use tracker from group owning the SSTable
When moving SSTables from staging directory, we'll conditionally add
them to backlog tracker. As each group has its own tracker, a given
sstable will be added to the tracker of the group that owns it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
9031dc3199 replica: Move table::backlog_tracker_adjust_charges() to compaction_group
Procedures that call this function happen to be in compaction_group,
so let's move it to group. Simplifies the change where the procedure
retrieves tracker from the group itself.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
116459b69e replica: table::discard_sstables: Use compaction_group's backlog tracker
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
b2d8545b15 replica: Disable backlog tracker in compaction_group::stop()
As we're moving backlog tracker to compaction group, we need to
stop the tracker there too. We're moving it a step earlier in
table::stop(), before sstables are cleared, but that's okay
because it's still done after the group was deregistered
from compaction manager, meaning no compactions are running.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
91b0d772e2 replica: database_sstable_write_monitor: use compaction_group's backlog tracker
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
f37a05b559 replica: Move table::do_add_sstable() to compaction_group
All callers of do_add_sstable() live in compaction_group, so it
should be moved into compaction_group too. It also makes easier
for the function to retrieve the backlog tracker from the group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
835927a2ad test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
Important for decoupling backlog tracker from table's compaction
strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
1ec0ef18a5 compaction/table_state: Introduce get_backlog_tracker()
This interface will be helpful for allowing replica::table, unit
tests and sstables::compaction to access the compaction group's tracker
which will be managed by the compaction manager, once we complete
the decoupling work.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Nadav Har'El
ff87624fb4 test/cql-pytest: add another regression test for reversed-type bug
In commit 544ef2caf3 we fixed a bug where
a reveresed clustering-key order caused problems using a secondary index
because of incorrect type comparison. That commit also included a
regression test for this fix.

However, that fix was incomplete, and improved later in commit
c8653d1321. That later fix was labeled
"better safe than sorry", and did not include a test demonstrating
any actual bug, so unsurprisingly we never backported that second
fix to any older branches.

Recently we discovered that missing the second patch does cause real
problems, and this patch includes a test which fails when the first
patch is in, but the second patch isn't (and passes when both patches
are in, and also passes on Cassandra).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11943
2022-11-11 11:01:22 +02:00
Botond Dénes
302917f63d mutation_compactor: add validator
The mutation compactor is used on most read-paths we have, so adding a
validator to it gives us a good coverage, in particular it gives us full
coverage of queries and compaction.
The validator validates mutation token (and mutation fragment kind)
monotonicity as that is quite cheap, while it is enough to catch the
most common problems. As we already have a validator on the compaction
path (in the sstable writer), the validator is disabled when the
mutation compactor is instantiated for compaction.
We should probably make this configurable at some point. The addition
of this validator should prevent the worst of the fragment reordering
bugs to affect reads.
2022-11-11 10:26:05 +02:00
Botond Dénes
5c245b4a5e mutation_fragment_stream_validator: add a 'none' validation level
Which, as its name suggests, makes the validating filter not validate
anything at all. This validation level can be used effectively to make
it so as if the validator was not there at all.
2022-11-11 09:58:44 +02:00
Botond Dénes
a4b58f5261 test/boost/mutation_query_test: test_partition_limit: sort input data
The test's input data is currently out-of-order, violating a fundamental
invariant of data always being sorted. This doesn't cause any problems
right now, but soon it will. Sort it to avoid it.
2022-11-11 09:58:44 +02:00
Botond Dénes
2c551bb7ce querier: consume_page(): use partition_start as the sentinel value
Said method calls `compact_mutation_state::start_new_page()` which
requires the kind of the next fragment in the reader. When there is no
fragment (reader is at EOS), we use partition-end. This was a poor
choice: if the reader is at EOS, partition-kind was the last fragment
kind, if the stream were to continue the next fragment would be a
partition-start.
2022-11-11 09:58:18 +02:00
Botond Dénes
0bcfc9d522 treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
We just added a convenience static factory method for partition end,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Botond Dénes
6a002953e9 position_in_partition: add for_partition_{start,end}() 2022-11-11 09:58:18 +02:00
Kamil Braun
4a2ec888d5 Merge 'test.py: use internal id to manage servers' from Alecco
Instead of using assigned IP addresses, use a local integer ID for
managing servers. IP address can be reused by a different server.

While there, get host ID (UUID). This can also be reused with `node
replace` so it's not good enough for tracking.

Closes #11747

* github.com:scylladb/scylladb:
  test.py: use internal id to manage servers
  test.py: rename hostname to ip_addr
  test.py: get host id
  test.py: use REST api client in ScyllaCluster
  test.py: remove unnecessary reference to web app
  test.py: requests without aiohttp ClientSession
2022-11-10 17:12:16 +01:00
Kamil Braun
1cc68b262e docs: describe the Raft upgrade and recovery procedures
In the 5.1 -> 5.2 upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.

In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
  successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
  situation.

Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.

Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
  use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
  once they exist
2022-11-10 17:08:57 +01:00
Kamil Braun
3dab07ec11 docs: add upgrade guide 5.1 -> 5.2
It's a copy-paste from the 5.0 -> 5.1 guide with substitutions:
s/5.1/5.2,
s/5.0/5.1

The metric update guide is not written, I left a TODO.

Also I didn't include the guide in
docs/upgrade/upgrade-opensource/index.rst, since 5.2 is not released
yet.

The guide can be accessed by manually following the link:
/upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/
2022-11-10 16:49:14 +01:00
Alejo Sanchez
700054abee test.py: use internal id to manage servers
Instead of using assigned IP addresses, use an internal server id.

Define types to distinguish local server id, host ID (UUID), and IP
address.

This is needed to test servers changing IP address and for node replace
(host UUID).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
1e38f5478c test.py: rename hostname to ip_addr
The code explicitly manages an IP as string, make it explicit in the
variable name.

Define its type and test for set in the instance instead of using an
empty string as placeholder.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
f478eb52a3 test.py: get host id
When initializing a ScyllaServer, try to get the host id instead of only
checking the REST API is up.

Use the existing aiohttp session from ScyllaCluster.

In case of HTTP error check the status was not an internal error (500+).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
78663dda72 test.py: use REST api client in ScyllaCluster
Move the REST api client to ScyllaCluster. This will allow the cluster
to query its own servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
75ea345611 test.py: remove unnecessary reference to web app
The aiohttp.web.Application only needs to be passed, so don't store a
reference in ScyllaCluster object.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
a5316b0c6b test.py: requests without aiohttp ClientSession
Simplify REST helper by doing requests without a session.

Reusing an aiohttp.ClientSession causes knock-on effects on
`rest_api/test_task_manager` due to handling exceptions outside of an
async with block.

Requests for cluster management and Scylla REST API don't need session,
anyway.

Raise HTTPError with status code, text reason, params, and json.

In ScyllaCluster.install_and_start() instead of adding one more custom
exception, just catch all exceptions as they will be re-raised later.

While there avoid code duplication and improve sanity, type checking,
and lint score.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Botond Dénes
21bc37603a Merge 'utils: config_src: add set_value_on_all_shards functions' from Benny Halevy
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.

However, the latter broadcasts all value to all shards
so it's terribly inefficient.

Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.

Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.

Refs #7316

Closes #11893

* github.com:scylladb/scylladb:
  tests: check ttl on different shards
  utils: config_src: add set_value_on_all_shards functions
  utils: config_file: add config_source::API
2022-11-10 07:16:39 +02:00
Botond Dénes
3aff59f189 Merge 'staging sstables: filter tokens for view update generation' from Benny Halevy
This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator.

The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges.

Refs #9559

Closes #11932

* github.com:scylladb/scylladb:
  db: view_update_generator: always clean up staging sstables
  compaction: extract incremental_owned_ranges_checker out to dht
2022-11-10 07:00:51 +02:00
Avi Kivity
9b6ab5db4a Update seastar submodule
* seastar e0dabb361f...153223a188 (8):
  > build: compile dpdk with -fpie (position independent executable)
  > Merge 'io_request: remove ctor overloads of io_request and s/io_request/const io_request/' from Kefu Chai
  > iostream: remove unused function
  > smp: destroy_smp_service_group: verify smp_service_group id
  > core/circular_buffer: refactor loop in circular_buffer::erase()
  > Merge 'Outline reactor::add_task() and sanitize reactor::shuffle() methods' from Pavel Emelyanov
  > Add NOLINT for cert-err58-cpp
  > tests: Fix false-positive use-after-free detection

Closes #11940
2022-11-09 23:36:50 +02:00
Aleksandra Martyniuk
b0ed4d1f0f tests: check ttl on different shards
Test checking if ttl is properly set is extended to check
whether the ttl value is changed on non-zero shard.
2022-11-09 16:58:46 +02:00
Botond Dénes
725e5b119d Revert "replica: Pick new generation for SSTables being moved from staging dir"
This reverts commit ba6186a47f.

Said commit violates the widely held assumption that sstables
generations can be used as sstable identity. One known problem caused
this is potential OOO partition emitted when reading from sstables
(#11843). We now also have a better fix for #11789 (the bug this commit
was meant to fix): 4aa0b16852. So we can
revert without regressions.

Fixes: #11843

Closes #11886
2022-11-09 16:35:31 +02:00
Eliran Sinvani
ab7429b77d cql: Fix crash upon use of the word empty for service level name
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.

Tests: 1. The fixed issue scenario from github.
       2. Unit tests in release mode.

Fixes #11774

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>

Closes #11777
2022-11-09 15:58:57 +02:00
Anna Stuchlik
d2e54f7097 Merge branch 'master' into anna-requirements-arm-aws 2022-11-09 14:39:00 +01:00
Anna Stuchlik
8375304d9b Update docs/getting-started/system-requirements.rst
Co-authored-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2022-11-09 14:37:34 +01:00
Benny Halevy
38d8777d42 storage_service, system_keyspace: add debugging around system.peers update
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:47 +02:00
Benny Halevy
5401b6055c storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
        if (get_token_metadata().is_member(endpoint)) {
            co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:22 +02:00
Benny Halevy
1a183047c0 utils: config_src: add set_value_on_all_shards functions
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.

However, the latter broadcasts all value to all shards
so it's terribly inefficient.

Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.

Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.

Refs #7316

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 11:55:14 +02:00
Benny Halevy
e83f42ec70 utils: config_file: add config_source::API
For task_manager test api.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 11:53:20 +02:00
Botond Dénes
94db2123b9 Update tools/java submodule
* tools/java 583261fc0e...caf754f243 (1):
  > build: remove JavaScript snippets in ant build file
2022-11-09 07:59:04 +02:00
Benny Halevy
10f8f13b90 db: view_update_generator: always clean up staging sstables
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).

Refs #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 07:38:22 +02:00
Benny Halevy
fd3e66b0cc compaction: extract incremental_owned_ranges_checker out to dht
It is currently used by cleanup_compaction partition filter.
Factor it out so it can be used to filter staging sstables in
the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 07:32:56 +02:00
Gleb Natapov' via ScyllaDB development
2100a8f4ca service: raft: demote configuration change error to warning since it is retried anyway
Message-Id: <Y2ohbFtljmd5MNw0@scylladb.com>
2022-11-09 00:09:39 +01:00
Avi Kivity
04ecf4ee18 Update tools/java submodule (cassandra-stress fails with node down)
* tools/java 87672be28e...583261fc0e (1):
  > cassandra-stress: pass all hosts stright to the driver
2022-11-08 14:58:14 +02:00
Botond Dénes
7f69cccbdf scylla-gdb.py: $downcast_vptr(): add multiple inheritance support
When a class inherits from multiple virtual base classes, pointers to
instances of this class via one of its base classes, might point to
somewhere into the object, not at its beginning. Therefore, the simple
method employed currently by $downcast_vptr() of casting the provided
pointer to the type extracted from the vtable name fails. Instead when
this situation is detected (detectable by observing that the symbol name
of the partial vtable is not to an offset of +16, but larger),
$downcast_vptr() will iterate over the base classes, adjusting the
pointer with their offsets, hoping to find the true start of the object.
In the one instance I tested this with, this method worked well.
At the very least, the method will now yield a null pointer when it
fails, instead of a badly casted object with corrupt content (which the
developer might or might not attribute to the bad cast).

Closes #11892
2022-11-08 14:51:26 +02:00
Michał Chojnowski
3e0c7a6e9f test: sstable_datafile_test: eliminate a use of std::regex to prevent stack overflow
This usage of std::regex overflows the seastar::thread stack size (128 KiB),
causing memory corruption. Fix that.

Closes #11911
2022-11-08 14:41:34 +02:00
Botond Dénes
2037d7f9cd Merge 'doc: add the "ScyllaDB Enterprise" label to highlight the Enterprise-only features' from Anna Stuchlik
This PR adds the "ScyllaDB Enterprise" label to highlight the Enterprise-only features on the following pages:
- Encryption at Rest - the label indicates that the entire page is about an Enterprise-only feature.
- Compaction - the labels indicate the sections that are Enterprise-only.

There are more occurrences across the docs that require a similar update. I'll update them in another PR if this PR is approved.

Closes #11918

* github.com:scylladb/scylladb:
  doc: fix the links to resolve the warnings
  doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box
  doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box
2022-11-08 09:53:48 +02:00
Raphael S. Carvalho
a57724e711 Make off-strategy compaction wait for view building completion
Prior to off-strategy compaction, streaming / repair would place
staging files into main sstable set, and wait for view building
completion before they could be selected for regular compaction.

The reason for that is that view building relies on table providing
a mutation source without data in staging files. Had regular compaction
mixed staging data with non-staging one, table would have a hard time
providing the required mutation source.

After off-strategy compaction, staging files can be compacted
in parallel to view building. If off-strategy completes first, it
will place the output into the main sstable set. So a parallel view
building (on sstables used for off-strategy) may potentially get a
mutation source containing staging data from the off-strategy output.
That will mislead view builder as it won't be able to detect
changes to data in main directory.

To fix it, we'll do what we did before. Filter out staging files
from compaction, and trigger the operation only after we're done
with view building. We're piggybacking on off-strategy timer for
still allowing the off-strategy to only run at the end of the
node operation, to reduce the amount of compaction rounds on
the data introduced by repair / streaming.

Fixes #11882.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11919
2022-11-08 08:53:58 +02:00
Botond Dénes
243fcb96f0 Update tools/python3 submodule
* tools/python3 bf6e892...773070e (1):
  > create-relocatable-package: harden against missing files
2022-11-08 08:43:30 +02:00
Avi Kivity
46690bcb32 build: harden create-relocatable-package.py against changes in libthread-db.so name
create-relocatable-package.py collects shared libraries used by
executables for packaging. It also adds libthread-db.so to make
debugging possible. However, the name it uses has changed in glibc,
so packaging fails in Fedora 37.

Switch to the version-agnostic names, libthread-db.so. This happens
to be a symlink, so resolve it.

Closes #11917
2022-11-08 08:41:22 +02:00
Takuya ASADA
acc408c976 scylla_setup: fix incorrect type definition on --online-discard option
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.

We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.

Fixes #11700

Closes #11831
2022-11-08 08:40:44 +02:00
Avi Kivity
3d345609d8 config: disable "mc" format sstables for new data
"md" format was introduced in 4.3, in 3530e80ce1, two years ago.
Disable the option to create new sstables with the "mc" format.

Closes #11265
2022-11-08 08:36:27 +02:00
Anna Stuchlik
0eaafced9d doc: fix the links to resolve the warnings 2022-11-07 19:15:21 +01:00
Anna Stuchlik
b57e0cfb7c doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box 2022-11-07 18:54:35 +01:00
Anna Stuchlik
9f3fcb3fa0 doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box 2022-11-07 18:48:37 +01:00
Tomasz Grabiec
a9063f9582 Merge 'service/raft: failure detector: ping raft::server_ids, not gms::inet_addresses' from Kamil Braun
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).

Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.

To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).

This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.

To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
  an IP address, because now the stored `_alive_set` already contains
  `raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
  `gms::inet_address`. To send the ping message, we need to translate
  this to IP address; we do it by the `raft_address_map` pointer
  introduced in an earlier commit.

Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.

Closes #11759

* github.com:scylladb/scylladb:
  direct_failure_detector: get rid of complex `endpoint_id` translations
  service/raft: ping `raft::server_id`s, not `gms::inet_address`es
  service/raft: store `raft_address_map` reference in `direct_fd_pinger`
  gms: gossiper: move `direct_fd_pinger` out to a separate service
  gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
2022-11-07 16:42:35 +01:00
Botond Dénes
2b572d94f5 Merge 'doc: improve the documentation landing page ' from Anna Stuchlik
This PR introduces the following changes to the documentation landing page:

- The " New to ScyllaDB? Start here!" box is added.
- The "Connect your application to Scylla" box is removed.
- Some wording has been improved.
- "Scylla" has been replaced with "ScyllaDB".

Closes #11896

* github.com:scylladb/scylladb:
  Update docs/index.rst
  doc: replace Scylla with ScyllaDB on the landing page
  doc: improve the wording on the landing page
  doc: add the link to the ScyllaDB Basics page to the documentation landing page
2022-11-07 16:18:59 +02:00
Avi Kivity
91f2cd5ac4 test: lib: exception_predicate: use boost::regex instead of std::regex
std::regex was observed to overflow stack on aarch64 in debug mode. Use
boost::regex until the libstdc++ bug[1] is fixed.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582

Closes #11888
2022-11-07 14:03:25 +02:00
Kamil Braun
0c7ff0d2cb docs: a single 5.0 -> 5.1 upgrade guide
There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the
same is true for other version pairs, but I digress) for different
environments:
- "ScyllaDB Image for EC2, GCP, and Azure"
- Ubuntu
- Debian
- RHEL/CentOS

THe Ubuntu and Debian pages used a common template:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
with different variable substitutions.

The "Image" page used a similar template, with some extra content in the
middle:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```

The RHEL/CentOS page used a different template:
```
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst
```

This was an unmaintainable mess. Most of the content was "the same" for
each of these options. The only content that must actually be different
is the part with package installation instructions (e.g. calls to `yum`
vs `apt-get`). The rest of the content was logically the same - the
differences were mistakes, typos, and updates/fixes to the text that
were made in some of these docs but not others.

In this commit I prepare a single page that covers the upgrade and
rollback procedures for each of these options. The section dependent on
the system was implemented using Sphinx Tabs.

I also fixed and changed some parts:

- In the "Gracefully stop the node" section:
Ubuntu/Debian/Images pages had:

```rst
.. code:: sh

   sudo service scylla-server stop
```

RHEL/CentOS pages had:
```rst
.. code:: sh

.. include:: /rst_include/scylla-commands-stop-index.rst
```

the stop-index file contained this:
```rst
.. tabs::

   .. group-tab:: Supported OS

      .. code-block:: shell

         sudo systemctl stop scylla-server

   .. group-tab:: Docker

      .. code-block:: shell

         docker exec -it some-scylla supervisorctl stop scylla

      (without stopping *some-scylla* container)
```

So the RHEL/CentOS version had two tabs: one for Scylla installed
directly on the system, one for Scylla running in Docker - which is
interesting, because nothing anywhere else in the upgrade documents
mentions Docker.  Furthermore, the RHEL/CentOS version used `systemctl`
while the ubuntu/debian/images version used `service` to stop/start
scylla-server.  Both work on modern systems.

The Docker option is completely out of place - the rest of the upgrade
procedure does not mention Docker. So I decided it doesn't make sense to
include it. Docker documentation could be added later if we actually
decide to write upgrade documentation when using Docker...  Between
`systemctl` and `service` I went with `service` as it's a bit
higher-level.

- Similar change for "Start the node" section, and corresponding
  stop/start sections in the Rollback procedure.

- To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb
  repo" in the Debian/Ubuntu tabs, I provide two separate links: to
  Debian and Ubuntu repos.

- the link to rollback procedure in the RPM guide (in 'Download and
  install the new release' section) pointed to rollback procedure from
  3.0 to 3.1 guide... Fixed to point to the current page's rollback
  procedure.

- in the rollback procedure steps summary, the RPM version missed the
  "Restore system tables" step.

- in the rollback procedure, the repository links were pointing to the
  new versions, while they should point to the old versions.

There are some other pre-existing problems I noticed that need fixing:

- EC2/GCP/Azure option has no corresponding coverage in the rollback
  section (Download and install the old release) as it has in the
  upgrade section. There is no guide for rolling back 3rd party and OS
  packages, only Scylla. I left a TODO in a comment.
- the repository links assume certain Debian and Ubuntu versions (Debian
  10 and Ubuntu 20), but there are more available options (e.g. Ubuntu
  22). Not sure how to deal with this problem. Maybe a separate section
  with links? Or just a generic link without choice of platform/version?

Closes #11891
2022-11-07 14:02:08 +02:00
Avi Kivity
9fa1783892 Merge 'cleanup compaction: flush memtable' from Benny Halevy
Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.

Fixes #1239

Closes #11902

* github.com:scylladb/scylladb:
  table: perform_cleanup_compaction: flush memtable
  table: add perform_cleanup_compaction
  api: storage_service: add logging for compaction operations et al
2022-11-07 13:18:12 +02:00
Anna Stuchlik
c8455abb71 Update docs/index.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-11-07 10:25:24 +01:00
AdamStawarz
6bc455ebea Update tombstones-flush.rst
change syntax:

nodetool compact <keyspace>.<mytable>;
to
nodetool compact <keyspace> <mytable>;

Closes #11904
2022-11-07 11:19:26 +02:00
Avi Kivity
224a2877b9 build: disable -Og in debug mode to avoid coroutine asan breakage
Coroutines and asan don't mix well on aarch64. This was seen in
22f13e7ca3 (" Revert "Merge 'cql3: select_statement: coroutinize
indexed_table_select_statement::do_execute_base_query()' from Avi
Kivity"") where a routine coroutinization was reverted due to failures
on aarch64 debug mode.

In clang 15 this is even worse, the existing code starts failing.
However, if we disable optimization (-O0 rather than -Og), things
begin to work again. In fact we can reinstate the patch reverted
above even with clang 12.

Fix (or rather workaround) the problem by avoiding -Og on aarch64
debug mode. There's the lingering fear that release mode is
miscompiled too, but all the tests pass on clang 15 in release mode
so it appears related to asan.

Closes #11894
2022-11-07 10:55:13 +02:00
Benny Halevy
eb3a94e2bc table: perform_cleanup_compaction: flush memtable
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.

Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.

Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.

Fixes #1239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:40 +02:00
Benny Halevy
fc278be6c4 table: add perform_cleanup_compaction
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:33 +02:00
Benny Halevy
85523c45c0 api: storage_service: add logging for compaction operations et al
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:31 +02:00
Petr Gusev
44f48bea0f raft: test_remove_node_with_concurrent_ddl
The test runs remove_node command with background ddl workload.
It was written in an attempt to reproduce scylladb#11228 but seems to have
value on its own.

The if_exists parameter has been added to the add_table
and drop_table functions, since the driver could retry
the request sent to a removed node, but that request
might have already been completed.

Function wait_for_host_known waits until the information
about the node reaches the destination node. Since we add
new nodes at each iteration in main, this can take some time.

A number of abort-related options was added
SCYLLA_CMDLINE_OPTIONS as it simplifies
nailing down problems.

Closes #11734
2022-11-04 17:16:35 +01:00
David Garcia
26bc53771c docs: automatic previews configuration
Closes #11591
2022-11-04 15:44:22 +02:00
Kamil Braun
e086521c1a direct_failure_detector: get rid of complex endpoint_id translations
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.

Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.

In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.

In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
2022-11-04 09:38:08 +01:00
Kamil Braun
bdeef77f20 service/raft: ping raft::server_ids, not gms::inet_addresses
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).

Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.

To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).

This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.

To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
  an IP address, because now the stored `_alive_set` already contains
  `raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
  `gms::inet_address`. To send the ping message, we need to translate
  this to IP address; we do it by the `raft_address_map` pointer
  introduced in an earlier commit.

Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
2022-11-04 09:38:08 +01:00
Kamil Braun
ac70a05c7e service/raft: store raft_address_map reference in direct_fd_pinger
The pinger will use the map to translate `raft::server_id`s to
`gms::inet_address`es when pinging.
2022-11-04 09:38:08 +01:00
Kamil Braun
2c20f2ab9d gms: gossiper: move direct_fd_pinger out to a separate service
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
2022-11-04 09:38:08 +01:00
Kamil Braun
e9a4263e14 gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
`gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them
is to maintain a mapping between `gms::inet_address`es and
`direct_failure_detector::pinger::endpoint_id`s, another is to cache the
last known gossiper's generation number to use it for sending gossip
echo messages. The latter is the only gossiper-specific thing in this
class.

We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the
gossiper-specific thing -- the generation number management -- to a
smaller class, `echo_pinger`.

`echo_pinger` is a top-level class (not a nested one like
`direct_fd_pinger` was) so we can forward-declare it and pass references
to it without including gms/gossiper.hh header.
2022-11-04 09:38:08 +01:00
Avi Kivity
768d77d31b Update seastar submodule
* seastar f32ed00954...e0dabb361f (12):
  > sstring: define formatter
  > file: Dont violate API layering
  > Add compile_commands.json to gitignore
  > Merge 'Add an allocation failure metric' from Travis Downs
  > Use const test objects
  > Ragel chunk parser: compilation err, unused var
  > build: do not expose Valgrind in SeastarTargets.cmake
  > defer: mark deferred_* with [[nodiscard]]
  > Log selected reactor backend during startup
  > http: mark str with [[maybe_unused]]
  > Merge 'reactor: open fd without O_NONBLOCK when using io_uring backend' from Kefu Chai
  > reactor: add accept and connect to io_uring backend

Closes #11895
2022-11-04 09:27:56 +04:00
Anna Stuchlik
fb01565a15 doc: replace Scylla with ScyllaDB on the landing page 2022-11-03 17:42:49 +01:00
Anna Stuchlik
7410ab0132 doc: improve the wording on the landing page 2022-11-03 17:38:14 +01:00
Anna Stuchlik
ab5e48261b doc: add the link to the ScyllaDB Basics page to the documentation landing page 2022-11-03 17:31:03 +01:00
Pavel Emelyanov
efbfcdb97e Merge 'Replicate raft_address_map non-expiring entries to other shards' from Kamil Braun
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.

Closes #11791

* github.com:scylladb/scylladb:
  test/raft: raft_address_map_test: add replication test
  service/raft: raft_address_map: replicate non-expiring entries to other shards
  service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
  service/raft: turn raft_address_map into a service
2022-11-03 18:34:42 +03:00
Avi Kivity
ca2010144e test: loading_cache_test: fix use-after-free in test_loading_cache_remove_leaves_no_old_entries_behind
We capture `key` by reference, but it is in a another continuation.

Capture it by value, and avoid the default capture specification.

Found by clang 15 + asan + aarch64.

Closes #11884
2022-11-03 17:23:40 +02:00
Avi Kivity
0c3967cf5e Merge 'scylla-gdb.py: improve scylla-fiber' from Botond Dénes
The main theme of this patchset is improving `scylla-fiber`, with some assorted unrelated improvement tagging along.
In lieu of explicit support for mapping up continuation chains in memory from seastar (there is one but it uses function calls), scylla fiber uses a quite crude method to do this: it scans task objects for outbound references to other task objects to find waiters tasks and scans inbound references from other tasks to find waited-on tasks. This works well for most objects, but there are some problematic ones:
* `seastar::thread_context`: the waited-on task (`seastar::(anonymous namespace)::thread_wake_task`) is allocated on the thread's stack which is not in the object itself. Scylla fiber now scans the stack bottom-up to find this task.
* `seastar::smp_message_queue::async_work_item`: the waited on task lives on another shard. Scylla fiber now digs out the remote shard from the work item and continues the search on the remote shard.
* `seastar::when_all_state`: the waited on task is a member in the same object tripping loop detection and terminating the search. Seastar fiber now uses the `_continuation` member explicitely to look for the next links.

Other minor improvements were also done, like including the shard of the task in the printout.
Example demonstrating all the new additions:
```
(gdb) scylla fiber 0x000060002d650200
Stopping because loop is detected: task 0x000061c00385fb60 was seen before.
[shard 28] #-13 (task*) 0x000061c00385fba0 0x00000000003b5b00 vtable for seastar::internal::when_all_state_component<seastar::future<void> > + 16
[shard 28] #-12 (task*) 0x000061c00385fb60 0x0000000000417010 vtable for seastar::internal::when_all_state<seastar::internal::identity_futures_tuple<seastar::future<void>, seastar::future<void> >, seastar::future<void>, seastar::future<void> > + 16
[shard 28] #-11 (task*) 0x000061c009f16420 0x0000000000419830 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_6futureISt5tupleIJNS4_IvEES6_EEE14discard_resultEvEUlDpOT_E_ZNS8_14then_impl_nrvoISC_S6_EET0_OT_EUlOS3_RSC_ONS_12future_stateIS7_EEE_S7_EE + 16
[shard 28] #-10 (task*) 0x000061c0098e9e00 0x0000000000447440 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>::run_and_dispose()::{lambda(auto:1)#1}, seastar::future<void>::then_wrapped_nrvo<void, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> >(seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-9 (task*) 0x000060000858dcd0 0x0000000000449d68 vtable for seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> + 16
[shard  0] #-8 (task*) 0x0000600050c39f60 0x00000000007abe98 vtable for seastar::parallel_for_each_state + 16
[shard  0] #-7 (task*) 0x000060000a59c1c0 0x0000000000449f60 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::sharded<cql_transport::cql_server>::stop()::{lambda(seastar::future<void>)#2}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#2}>({lambda(seastar::future<void>)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-6 (task*) 0x000060000a59c400 0x0000000000449ea0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, cql_transport::controller::do_stop_server()::{lambda(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&)#1}::operator()(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&) const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-5 (task*) 0x0000600009d86cc0 0x0000000000449c00 vtable for seastar::internal::do_with_state<std::tuple<std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > > >, seastar::future<void> > + 16
[shard  0] #-4 (task*) 0x00006000019ffe20 0x00000000007ab368 vtable for seastar::(anonymous namespace)::thread_wake_task + 16
[shard  0] #-3 (task*) 0x00006000085ad080 0x0000000000809e18 vtable for seastar::thread_context + 16
[shard  0] #-2 (task*) 0x0000600009c04100 0x00000000006067f8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS6_E_clES7_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSC_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSD_DpOSG_EUlvE0_ZNS_6futureIvE14then_impl_nrvoIST_SV_EET0_SQ_EUlOS3_RST_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #-1 (task*) 0x000060000a59c080 0x0000000000606ae8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS9_E_clESA_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSF_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSG_DpOSJ_EUlvE1_Lb0EEEZNS5_17then_wrapped_nrvoIS5_SX_EENSD_ISG_E4typeEOT0_EUlOS3_RSX_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #0  (task*) 0x000060002d650200 0x0000000000606378 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<service::storage_service::run_with_api_lock<service::storage_service::drain()::{lambda(service::storage_service&)#1}>(seastar::basic_sstring<char, unsigned int, 15u, true>, service::storage_service::drain()::{lambda(service::storage_service&)#1}&&)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(service::storage_service&)#1}>({lambda(service::storage_service&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(service::storage_service&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #1  (task*) 0x000060000bc40540 0x0000000000606d48 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_3smp9submit_toIZNS_7shardedIN7service15storage_serviceEE9invoke_onIZNSB_17run_with_api_lockIZNSB_5drainEvEUlRSB_E_EEDaNS_13basic_sstringIcjLj15ELb1EEEOT_EUlSF_E_JES5_EET1_jNS_21smp_submit_to_optionsESK_DpOT0_EUlvE_EENS_8futurizeINSt9result_ofIFSJ_vEE4typeEE4typeEjSN_SK_EUlvE_Lb0EEEZNS5_17then_wrapped_nrvoIS5_S10_EENSS_ISJ_E4typeEOT0_EUlOS3_RS10_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #2  (task*) 0x000060000332afc0 0x00000000006cb1c8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<seastar::json::json_return_type>, api::set_storage_service(api::http_context&, seastar::httpd::routes&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >) const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}, {lambda()#1}<seastar::json::json_return_type> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::json::json_return_type>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #3  (task*) 0x000060000a1af700 0x0000000000812208 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::function_handler(std::function<seastar::future<seastar::json::json_return_type> (std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)> const&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >) const::{lambda(seastar::json::json_return_type&&)#1}, seastar::future<seastar::json::json_return_type>::then_impl_nrvo<seastar::json::json_return_type&&, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > >(seastar::json::json_return_type&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, seastar::json::json_return_type&, seastar::future_state<seastar::json::json_return_type>&&)#1}, seastar::json::json_return_type> + 16
[shard  0] #4  (task*) 0x0000600009d86440 0x0000000000812228 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::handle(seastar::basic_sstring<char, unsigned int, 15u, true> const&, std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future>({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #5  (task*) 0x0000600009dba0c0 0x0000000000812f48 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::handle_exception<std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&>(std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&)::{lambda(auto:1&&)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_wrapped_nrvo<seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(auto:1&&)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #6  (task*) 0x0000600026783ae0 0x00000000008118b0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<bool>, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}<bool> >({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<bool>&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #7  (task*) 0x000060000a4089c0 0x0000000000811790 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read_one()::{lambda()#1}::operator()()::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(std::default_delete<std::unique_ptr>)#1}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(bool)#2}, seastar::future<bool>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}, {lambda(std::default_delete<std::unique_ptr>)#1}<void> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&, seastar::future_state<bool>&&)#1}, bool> + 16
[shard  0] #8  (task*) 0x000060000a5b16e0 0x0000000000811430 vtable for seastar::internal::do_until_state<seastar::httpd::connection::read()::{lambda()#1}, seastar::httpd::connection::read()::{lambda()#2}> + 16
[shard  0] #9  (task*) 0x000060000aec1080 0x00000000008116d0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read()::{lambda(seastar::future<void>)#3}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#3}>({lambda(seastar::future<void>)#3}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#3}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #10 (task*) 0x000060000b7d2900 0x0000000000811950 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::httpd::connection::read()::{lambda()#4}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::httpd::connection::read()::{lambda()#4}>(seastar::httpd::connection::read()::{lambda()#4}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::httpd::connection::read()::{lambda()#4}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16

Found no further pointers to task objects.
If you think there should be more, run `scylla fiber 0x000060002d650200 --verbose` to learn more.
Note that continuation across user-created seastar::promise<> objects are not detected by scylla-fiber.
```

Closes #11822

* github.com:scylladb/scylladb:
  scylla-gdb.py: collection_element: add support for boost::intrusive::list
  scylla-gdb.py: optional_printer: eliminate infinite loop
  scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects
  scylla-gdb.py: scylla-fiber: reject self-references when probing pointers
  scylla-gdb.py: scylla-fiber: add starting task to known tasks
  scylla-gdb.py: scylla-fiber: add support for walking over when_all
  scylla-gdb.py: add when_all_state to task type whitelist
  scylla-gdb.py: scylla-fiber: also print shard of tasks
  scylla-gdb.py: scylla-fiber: unify task printing
  scylla-gdb.py: scylla fiber: add support for walking over shards
  scylla-gdb.py: scylla fiber: add support for walking over seastar threads
  scylla-gdb.py: scylla-ptr: keep current thread context
  scylla-gdb.py: improve scylla column_families
  scylla-gdb.py: scylla_sstables.filename(): fix generation formatting
  scylla-gdb.py: improve schema_ptr
  scylla-gdb.py: scylla memory: restore compatibility with <= 5.1
2022-11-03 13:52:31 +02:00
Kamil Braun
2049962e11 Fix version numbers in upgrade page title
Closes #11878
2022-11-03 10:06:25 +02:00
Takuya ASADA
45789004a3 install-dependencies.sh: update node_exporter to 1.4.0
To fix CVE-2022-24675, we need to a binary compiled in <= golang 1.18.1.
Only released version which compiled <= golang 1.18.1 is node_exporter
1.4.0, so we need to update to it.

See scylladb/scylla-enterprise#2317

Closes #11400

[avi: regenerated frozen toolchain]

Closes #11879
2022-11-03 10:15:22 +04:00
Yaron Kaikov
20110bdab4 configure.py: remove un-used tar files creation
Starting from https://github.com/scylladb/scylla-pkg/pull/3035 we
removed all old tar.gz prefix from uploading to S3 or been used by
downstream jobs.

Hence, there is no point building those tar.gz files anymore

Closes #11865
2022-11-02 17:44:09 +02:00
Anna Stuchlik
d1f7cc99bc doc: fix the external links to the ScyllaDB University lesson about TTL
Closes #11876
2022-11-02 15:05:43 +02:00
Nadav Har'El
59fa8fe903 Merge 'doc: add the information about AArch64 support to Requirements' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/864

This PR:
- updates the introduction to add information about AArch64 and rewrite the content.
- replaces "Scylla" with "ScyllaDB".

Closes #11778

* github.com:scylladb/scylladb:
  Update docs/getting-started/system-requirements.rst
  doc: fix the link to the OS Support page
  doc: replace Scylla with ScyllaDB
  doc: update the info about supported architecture and rewrite the introduction
2022-11-02 11:18:20 +02:00
Anna Stuchlik
ea799ad8fd Update docs/getting-started/system-requirements.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-11-02 09:56:56 +01:00
Kamil Braun
db6cc035ed test/raft: raft_address_map_test: add replication test 2022-10-31 09:17:12 +01:00
Kamil Braun
7d84007fd5 service/raft: raft_address_map: replicate non-expiring entries to other shards
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
2022-10-31 09:17:12 +01:00
Kamil Braun
acacbad465 service/raft: raft_address_map: assert when entry is missing in drop_expired_entries 2022-10-31 09:17:12 +01:00
Kamil Braun
159bb32309 service/raft: turn raft_address_map into a service 2022-10-31 09:17:10 +01:00
Botond Dénes
63a90cfb6c scylla-gdb.py: collection_element: add support for boost::intrusive::list 2022-10-31 08:18:20 +02:00
Botond Dénes
2fa1864174 scylla-gdb.py: optional_printer: eliminate infinite loop
Currently, to_string() recursively calls itself for engaged optionals.
Eliminate it. Also, use the std_optional wrapper instead of accessing
std::optional internals directly.
2022-10-31 08:18:20 +02:00
Botond Dénes
77b2555a04 scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects
Scylla fiber uses a crude method of scanning inbound and outbound
references to/from other task objects of recognized type. This method
cannot detect user instantiated promise<> objects. Add a note about this
to the printout, so users are beware of this.
2022-10-31 08:18:20 +02:00
Botond Dénes
2276565a2e scylla-gdb.py: scylla-fiber: reject self-references when probing pointers
A self-reference is never the pointer we are looking for when looking
for other tasks referencing us. Reject such references when scanning
outright.
2022-10-31 08:18:20 +02:00
Botond Dénes
f4365dd7f5 scylla-gdb.py: scylla-fiber: add starting task to known tasks
We collect already seen tasks in a set to be able to detect perceived
task loops and stop when one is seen. Initialize this set with the
starting task, so if it forms a loop, we won't repeat it in the trace
before cutting the loop.
2022-10-31 08:18:20 +02:00
Botond Dénes
48bbf2e467 scylla-gdb.py: scylla-fiber: add support for walking over when_all 2022-10-31 08:18:20 +02:00
Botond Dénes
cb8f02e24b scylla-gdb.py: add when_all_state to task type whitelist 2022-10-31 08:18:20 +02:00
Botond Dénes
62621abc44 scylla-gdb.py: scylla-fiber: also print shard of tasks
Now that scylla-fiber can cross shards, it is important to display the
shard each task in the chain lives on.
2022-10-31 08:18:19 +02:00
Botond Dénes
c21c80f711 scylla-gdb.py: scylla-fiber: unify task printing
Currently there is two loops and a separate line printing the starting
task, all duplicating the formatting logic. Define a method for it and
use it in all 3 places instead.
2022-10-31 08:18:19 +02:00
Botond Dénes
c103280bfd scylla-gdb.py: scylla fiber: add support for walking over shards
Shard boundaries can be crossed in one direction currently: when looking
for waiters on a task, but not in the other direction (looking for
waited-on tasks). This patch fixes that.
2022-10-31 08:18:19 +02:00
Botond Dénes
437f888ba0 scylla-gdb.py: scylla fiber: add support for walking over seastar threads
Currently seastar threads end any attempt to follow waited-on-futures.
Seastar threads need special handling because it allocates the wake up
task on its stack. This patch adds this special handling.
2022-10-31 08:18:19 +02:00
Botond Dénes
fcc63965ed scylla-gdb.py: scylla-ptr: keep current thread context
scylla_ptr.analyze() switches to the thread the analyzed object lives
on, but forgets to switch back. This was very annoying as any commands
using it (which is a bunch of them) were prone to suddenly and
unexpectedly switching threads.
This patch makes sure that the original thread context is switched back
to after analyzing the pointer.
2022-10-31 08:18:19 +02:00
Botond Dénes
91516c1d68 scylla-gdb.py: improve scylla column_families
Rename to scylla tables. Less typing and more up-to-date.
By default it now only lists tables from local shard. Added flag -a
which brings back old behaviour (lists on all shards).
Added -u (only list user tables) and -k (list tables of provided
keyspace only) filtering options.
2022-10-31 08:18:19 +02:00
Botond Dénes
1d3d613b76 scylla-gdb.py: scylla_sstables.filename(): fix generation formatting
Generation was recently converted from an integer to an object. Update
the filename formatting, while keeping backward compatibility.
2022-10-31 08:18:19 +02:00
Botond Dénes
c869f54742 scylla-gdb.py: improve schema_ptr
Add __getitem__(), so members can be accessed.
Strip " from ks_name and cf_name.
Add is_system().
2022-10-31 08:18:19 +02:00
Botond Dénes
66832af233 scylla-gdb.py: scylla memory: restore compatibility with <= 5.1
Recent reworks around dirty memory manager broke backward compatibility
of the scylla memory command (and possibly others). This patch restores
it.
2022-10-31 08:18:19 +02:00
Pavel Emelyanov
7b193ab0a5 messaging_service: Deny putting INADD_ANY as preferred ip
Even though previous patch makes scylla not gossip this as internal_ip,
an extra sanity check may still be useful. E.g. older versions of scylla
may still do it, or this address can be loaded from system_keyspace.

refs: #11502

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
aa7a759ac9 messaging_service: Toss preferred ip cache management
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
91b460f1c4 gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
Gossiping 0.0.0.0 as preferred IP may break the peer as it will
"interpret" this address as <myself> which is not what peer expects.
However, g.p.f.s. uses --listen-address argument as the internal IP
and it's not prohibited to configure it to be 0.0.0.0

It's better not to gossip the INTERNAL_IP property at all if the listen
address is such.

fixes: #11502

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
99579bd186 gossiping_property_file_snitch: Make _listen_address optional
As the preparation for the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:15:26 +03:00
Michał Radwański
36508bf5e9 serializer_impl: remove unneeded generic parameter
Input stream used in vector_deserializer doesn't need to be generic, as
there is only one implementation used.
2022-10-24 17:21:38 +02:00
Anna Stuchlik
9f7536d549 doc: fix the link to the OS Support page 2022-10-13 15:36:51 +02:00
Anna Stuchlik
1fd1ce042a doc: replace Scylla with ScyllaDB 2022-10-13 15:21:46 +02:00
Anna Stuchlik
81ce7a88de doc: update the info about supported architecture and rewrite the introduction 2022-10-13 15:18:29 +02:00
Anna Stuchlik
3950a1cac8 doc: apply the feedback to improve clarity 2022-10-03 11:14:51 +02:00
Anna Stuchlik
46f0e99884 doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB 2022-09-23 11:46:15 +02:00
Anna Stuchlik
af2a85b191 doc: add the new page to the toctree 2022-09-23 11:37:38 +02:00
Anna Stuchlik
b034e2856e doc: add a troubleshooting article about the missing configuration files 2022-09-23 11:17:18 +02:00
Anna Stuchlik
260f85643d doc: specify the recommended AWS instance types 2022-08-08 14:35:54 +02:00
Anna Stuchlik
2c69a8f458 doc: replace the tables with a generic description of support for Im4gn and Is4gen instances 2022-08-08 14:17:59 +02:00
Anna Stuchlik
ceaf0c41bd doc: add support for AWS i4g instances 2022-08-05 17:18:44 +02:00
Anna Stuchlik
7711436577 doc: extend the list of supported CPUs 2022-08-05 16:55:40 +02:00
245 changed files with 8587 additions and 3527 deletions

View File

@@ -0,0 +1,17 @@
name: "Docs / Amplify enhanced"
on: issue_comment
jobs:
build:
runs-on: ubuntu-latest
if: ${{ github.event.issue.pull_request }}
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Amplify enhanced
env:
TOKEN: ${{ secrets.GITHUB_TOKEN }}
uses: scylladb/sphinx-scylladb-theme/.github/actions/amplify-enhanced@master

3
.gitmodules vendored
View File

@@ -6,9 +6,6 @@
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate
[submodule "abseil"]
path = abseil
url = ../abseil-cpp

View File

@@ -34,6 +34,7 @@
#include "expressions.hh"
#include "conditions.hh"
#include "cql3/constants.hh"
#include "cql3/util.hh"
#include <optional>
#include "utils/overloaded_functor.hh"
#include <seastar/json/json_elements.hh>
@@ -927,9 +928,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
if (!range_key.empty() && range_key != view_hash_key && range_key != view_range_key) {
add_column(view_builder, range_key, attribute_definitions, column_kind::clustering_key);
}
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_hash_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -984,9 +986,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// Note above we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
// for virtual columns when we support Projection.
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_range_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -3642,7 +3645,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
if (exclusive_start_key) {
partition_key pk = pk_from_json(*exclusive_start_key, schema);
auto pos = position_in_partition(position_in_partition::partition_start_tag_t());
auto pos = position_in_partition::for_partition_start();
if (schema->clustering_key_size() > 0) {
pos = pos_from_json(*exclusive_start_key, schema);
}

View File

@@ -279,7 +279,7 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
}
if (ck.is_empty()) {
return position_in_partition(position_in_partition::partition_start_tag_t());
return position_in_partition::for_partition_start();
}
return position_in_partition::for_key(std::move(ck));
}

View File

@@ -8,6 +8,7 @@
#include <chrono>
#include <cstdint>
#include <exception>
#include <optional>
#include <seastar/core/sstring.hh>
#include <seastar/core/coroutine.hh>
@@ -17,6 +18,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include <boost/multiprecision/cpp_int.hpp>
#include "exceptions/exceptions.hh"
#include "gms/gossiper.hh"
#include "gms/inet_address.hh"
#include "inet_address_vectors.hh"
@@ -548,13 +550,26 @@ static future<> scan_table_ranges(
co_return;
}
auto units = co_await get_units(page_sem, 1);
// We don't to limit page size in number of rows because there is a
// builtin limit of the page's size in bytes. Setting this limit to 1
// is useful for debugging the paging code with moderate-size data.
// We don't need to limit page size in number of rows because there is
// a builtin limit of the page's size in bytes. Setting this limit to
// 1 is useful for debugging the paging code with moderate-size data.
uint32_t limit = std::numeric_limits<uint32_t>::max();
// FIXME: which timeout?
// FIXME: if read times out, need to retry it.
std::unique_ptr<cql3::result_set> rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
// Read a page, and if that times out, try again after a small sleep.
// If we didn't catch the timeout exception, it would cause the scan
// be aborted and only be restarted at the next scanning period.
std::unique_ptr<cql3::result_set> rs;
for (;;) {
try {
// FIXME: which timeout?
rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
break;
} catch(exceptions::read_timeout_exception&) {
tlogger.warn("expiration scanner read timed out, will retry: {}",
std::current_exception());
}
// If we didn't break out of this loop, add a minimal sleep
co_await seastar::sleep(1s);
}
auto rows = rs->rows();
auto meta = rs->get_metadata().get_names();
std::optional<unsigned> expiration_column;

15
amplify.yml Normal file
View File

@@ -0,0 +1,15 @@
version: 1
applications:
- frontend:
phases:
build:
commands:
- make setupenv
- make dirhtml
artifacts:
baseDirectory: _build/dirhtml
files:
- '**/*'
cache:
paths: []
appRoot: docs

View File

@@ -49,6 +49,14 @@
extern logging::logger apilog;
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti) {
return os << "table{name=" << ti.name << ", id=" << ti.id << "}";
}
} // namespace std
namespace api {
const locator::token_metadata& http_context::get_token_metadata() {
@@ -100,6 +108,55 @@ std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, con
return parse_tables(ks_name, ctx, it->second);
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, sstring value) {
std::vector<table_info> res;
try {
if (value.empty()) {
const auto& cf_meta_data = ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data();
res.reserve(cf_meta_data.size());
for (const auto& [name, schema] : cf_meta_data) {
res.emplace_back(table_info{name, schema->id()});
}
} else {
std::vector<sstring> names = split(value, ",");
res.reserve(names.size());
const auto& db = ctx.db.local();
for (const auto& table_name : names) {
res.emplace_back(table_info{table_name, db.find_uuid(ks_name, table_name)});
}
}
} catch (const replica::no_such_keyspace& e) {
throw bad_param_exception(e.what());
} catch (const replica::no_such_column_family& e) {
throw bad_param_exception(e.what());
}
return res;
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
return parse_table_infos(ks_name, ctx, it != query_params.end() ? it->second : "");
}
// Run on all tables, skipping dropped tables
future<> run_on_existing_tables(sstring op, replica::database& db, std::string_view keyspace, const std::vector<table_info> local_tables, std::function<future<> (replica::table&)> func) {
std::exception_ptr ex;
for (const auto& ti : local_tables) {
apilog.debug("Starting {} on {}.{}", op, keyspace, ti);
try {
co_await func(db.find_column_family(ti.id));
} catch (const replica::no_such_column_family& e) {
apilog.warn("Skipping {} of {}.{}: {}", op, keyspace, ti, e.what());
} catch (...) {
ex = std::current_exception();
apilog.error("Failed {} of {}.{}: {}", op, keyspace, ti, ex);
}
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
}
}
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
ss::token_range r;
r.start_token = d._start_token;
@@ -118,16 +175,13 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<sstring>)>;
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<table_info>)>;
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return f(ctx, std::move(req), std::move(keyspace), std::move(column_families));
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
@@ -609,93 +663,112 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<table_id>>(column_families | boost::adaptors::transformed([&] (auto& cf_name) {
return db.find_uuid(keyspace, cf_name);
}));
// major compact smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const table_id& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.debug("force_keyspace_compaction: keyspace={} tables={}", keyspace, table_infos);
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto local_tables = table_infos;
// major compact smaller tables first, to increase chances of success if low on space.
std::ranges::sort(local_tables, std::less<>(), [&] (const table_info& ti) {
try {
return db.find_column_family(ti.id).get_stats().live_disk_space_used;
} catch (const replica::no_such_column_family& e) {
return int64_t(-1);
}
});
co_await run_on_existing_tables("force_keyspace_compaction", db, keyspace, local_tables, [] (replica::table& t) {
return t.compact_all_sstables();
});
});
// as a table can be dropped during loop below, let's find it before issuing major compaction request.
for (auto& id : table_ids) {
co_await db.find_column_family(id).compact_all_sstables();
}
co_return;
}).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
} catch (...) {
apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json_void();
});
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
return ss.local().is_cleanup_allowed(keyspace).then([&ctx, keyspace,
column_families = std::move(column_families)] (bool is_cleanup_allowed) mutable {
if (!is_cleanup_allowed) {
return make_exception_future<json::json_return_type>(
std::runtime_error("Can not perform cleanup operation when topology changes"));
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<table_id>>(column_families | boost::adaptors::transformed([&] (auto& table_name) {
return db.find_uuid(keyspace, table_name);
}));
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto local_tables = table_infos;
// cleanup smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const table_id& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
std::ranges::sort(local_tables, std::less<>(), [&] (const table_info& ti) {
try {
return db.find_column_family(ti.id).get_stats().live_disk_space_used;
} catch (const replica::no_such_column_family& e) {
return int64_t(-1);
}
});
auto& cm = db.get_compaction_manager();
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
// as a table can be dropped during loop below, let's find it before issuing the cleanup request.
for (auto& id : table_ids) {
replica::table& t = db.find_column_family(id);
co_await cm.perform_cleanup(owned_ranges_ptr, t.as_table_state());
}
co_return;
}).then([]{
return make_ready_future<json::json_return_type>(0);
co_await run_on_existing_tables("force_keyspace_cleanup", db, keyspace, local_tables, [&] (replica::table& t) {
return t.perform_cleanup_compaction(owned_ranges_ptr);
});
});
});
} catch (...) {
apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(0);
});
ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> tables) -> future<json::json_return_type> {
co_return co_await ctx.db.map_reduce0([&keyspace, &tables] (replica::database& db) -> future<bool> {
bool needed = false;
for (const auto& table : tables) {
auto& t = db.find_column_family(keyspace, table);
needed |= co_await t.perform_offstrategy_compaction();
}
co_return needed;
}, false, std::plus<bool>());
ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);
bool res = false;
try {
res = co_await ctx.db.map_reduce0([&] (replica::database& db) -> future<bool> {
bool needed = false;
co_await run_on_existing_tables("perform_keyspace_offstrategy_compaction", db, keyspace, table_infos, [&needed] (replica::table& t) -> future<> {
needed |= co_await t.perform_offstrategy_compaction();
});
co_return needed;
}, false, std::plus<bool>());
} catch (...) {
apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(res);
}));
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
return ctx.db.invoke_on_all([=] (replica::database& db) {
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_upgrade(owned_ranges_ptr, cf.as_table_state(), exclude_current_version);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
co_await run_on_existing_tables("upgrade_sstables", db, keyspace, table_infos, [&] (replica::table& t) {
return t.get_compaction_manager().perform_sstable_upgrade(owned_ranges_ptr, t.as_table_state(), exclude_current_version);
});
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
});
} catch (...) {
apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(0);
}));
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto& db = ctx.db;
if (column_families.empty()) {
co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);
@@ -707,6 +780,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::decommission.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("decommission");
return ss.local().decommission().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -722,6 +796,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::remove_node.set(r, [&ss](std::unique_ptr<request> req) {
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
for (std::string n : ignore_nodes_strs) {
try {
@@ -797,6 +872,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::drain.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("drain");
return ss.local().drain().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -820,12 +896,14 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::stop_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("stop_gossiping");
return ss.local().stop_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("start_gossiping");
return ss.local().start_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -928,6 +1006,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::rebuild.set(r, [&ss](std::unique_ptr<request> req) {
auto source_dc = req->get_query_param("source_dc");
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -964,6 +1043,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& sp = service::get_storage_proxy();
auto& fs = sp.local().features();
apilog.info("reset_local_schema");
return db::schema_tables::recalculate_schema_version(sys_ks, sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -971,6 +1051,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
auto probability = req->get_query_param("probability");
apilog.info("set_trace_probability: probability={}", probability);
return futurize_invoke([probability] {
double real_prob = std::stod(probability.c_str());
return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {
@@ -1008,6 +1089,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
apilog.info("set_slow_query: enable={} ttl={} threshold={} fast={}", enable, ttl, threshold, fast);
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
if (threshold != "") {
@@ -1034,6 +1116,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
@@ -1041,6 +1124,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
@@ -1366,7 +1450,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<request> req) {
ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto rp = req_params({
{"keyspace", {mandatory::yes}},
{"cf", {""}},
@@ -1402,10 +1487,9 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
}
}
auto f = make_ready_future<>();
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
co_await coroutine::parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
// We always pass here db::snapshot_ctl::snap_views::no since:
// 1. When scrubbing particular tables, there's no need to auto-snapshot their views.
// 2. When scrubbing the whole keyspace, column_families will contain both base tables and views.
@@ -1434,28 +1518,25 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
return stats;
};
return f.then([&ctx, keyspace, column_families, opts, &reduce_compaction_stats] {
return ctx.db.map_reduce0([=] (replica::database& db) {
return map_reduce(column_families, [=, &db] (sstring cfname) {
try {
auto opt_stats = co_await db.map_reduce0([&] (replica::database& db) {
return map_reduce(column_families, [&] (sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_scrub(cf.as_table_state(), opts);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}).then_wrapped([] (auto f) {
if (f.failed()) {
auto ex = f.get_exception();
if (try_catch<sstables::compaction_aborted_exception>(ex)) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::aborted));
} else {
return make_exception_future<json::json_return_type>(std::move(ex));
}
} else if (f.get()->validation_errors) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::validation_errors));
} else {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::successful));
if (opt_stats && opt_stats->validation_errors) {
co_return json::json_return_type(static_cast<int>(scrub_status::validation_errors));
}
});
} catch (const sstables::compaction_aborted_exception&) {
co_return json::json_return_type(static_cast<int>(scrub_status::aborted));
} catch (...) {
apilog.error("scrub keyspace={} tables={} failed: {}", keyspace, column_families, std::current_exception());
throw;
}
co_return json::json_return_type(static_cast<int>(scrub_status::successful));
});
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include <iostream>
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
@@ -41,8 +43,22 @@ sstring validate_keyspace(http_context& ctx, const parameters& param);
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
// Returns an empty vector if no parameter was found.
// If the parameter is found and empty, returns a list of all table names in the keyspace.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
struct table_info {
sstring name;
table_id id;
};
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
// Returns a vector of all table infos given by the parameter, or
// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, routes& r);
@@ -58,4 +74,10 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}
} // namespace api
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti);
} // namespace std

View File

@@ -99,7 +99,7 @@ void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg) {
tmt::get_and_update_ttl.set(r, [&ctx, &cfg] (std::unique_ptr<request> req) -> future<json::json_return_type> {
uint32_t ttl = cfg.task_ttl_seconds();
cfg.task_ttl_seconds.set(boost::lexical_cast<uint32_t>(req->query_parameters["ttl"]));
co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);
co_return json::json_return_type(ttl);
});
}

View File

@@ -28,6 +28,7 @@
#include <seastar/util/closeable.hh>
#include <seastar/core/shared_ptr.hh>
#include "dht/i_partitioner.hh"
#include "sstables/sstables.hh"
#include "sstables/sstable_writer.hh"
#include "sstables/progress_monitor.hh"
@@ -41,6 +42,7 @@
#include "mutation_compactor.hh"
#include "leveled_manifest.hh"
#include "dht/token.hh"
#include "dht/partition_filter.hh"
#include "mutation_writer/shard_based_splitting_writer.hh"
#include "mutation_writer/partition_based_splitting_writer.hh"
#include "mutation_source_metadata.hh"
@@ -220,13 +222,13 @@ public:
~compaction_write_monitor() {
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
}
virtual void on_write_started(const sstables::writer_offset_tracker& tracker) override {
_tracker = &tracker;
_table_s.get_compaction_strategy().get_backlog_tracker().register_partially_written_sstable(_sst, *this);
_table_s.get_backlog_tracker().register_partially_written_sstable(_sst, *this);
}
virtual void on_data_write_completed() override {
@@ -351,7 +353,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
public:
virtual void on_read_started(const sstables::reader_position_tracker& tracker) override {
_tracker = &tracker;
_table_s.get_compaction_strategy().get_backlog_tracker().register_compacting_sstable(_sst, *this);
_table_s.get_backlog_tracker().register_compacting_sstable(_sst, *this);
}
virtual void on_read_completed() override {
@@ -370,7 +372,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
void remove_sstable() {
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
_sst = {};
}
@@ -382,7 +384,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
// We failed to finish handling this SSTable, so we have to update the backlog_tracker
// about it.
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
}
@@ -948,7 +950,7 @@ void compacted_fragments_writer::consume_new_partition(const dht::decorated_key&
.dk = dk,
.tombstone = tombstone(),
.current_emitted_tombstone = tombstone(),
.last_pos = position_in_partition(position_in_partition::partition_start_tag_t()),
.last_pos = position_in_partition::for_partition_start(),
.is_splitting_partition = false
};
do_consume_new_partition(dk);
@@ -1173,30 +1175,8 @@ private:
};
class cleanup_compaction final : public regular_compaction {
class incremental_owned_ranges_checker {
const dht::token_range_vector& _sorted_owned_ranges;
mutable dht::token_range_vector::const_iterator _it;
public:
incremental_owned_ranges_checker(const dht::token_range_vector& sorted_owned_ranges)
: _sorted_owned_ranges(sorted_owned_ranges)
, _it(_sorted_owned_ranges.begin()) {
}
// Must be called with increasing token values.
bool belongs_to_current_node(const dht::token& t) const {
// While token T is after a range Rn, advance the iterator.
// iterator will be stopped at a range which either overlaps with T (if T belongs to node),
// or at a range which is after T (if T doesn't belong to this node).
while (_it != _sorted_owned_ranges.end() && _it->after(t, dht::token_comparator())) {
_it++;
}
return _it != _sorted_owned_ranges.end() && _it->contains(t, dht::token_comparator());
}
};
owned_ranges_ptr _owned_ranges;
incremental_owned_ranges_checker _owned_ranges_checker;
mutable dht::incremental_owned_ranges_checker _owned_ranges_checker;
private:
// Called in a seastar thread
dht::partition_range_vector
@@ -1209,21 +1189,8 @@ private:
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}));
// optimize set of potentially overlapping ranges by deoverlapping them.
non_owned_ranges = dht::partition_range::deoverlap(std::move(non_owned_ranges), dht::ring_position_comparator(*_schema));
// subtract *each* owned range from the partition range of *each* sstable*,
// such that we'll be left only with a set of non-owned ranges.
for (auto& owned_range : owned_ranges) {
dht::partition_range_vector new_non_owned_ranges;
for (auto& non_owned_range : non_owned_ranges) {
auto ret = non_owned_range.subtract(owned_range, dht::ring_position_comparator(*_schema));
new_non_owned_ranges.insert(new_non_owned_ranges.end(), ret.begin(), ret.end());
seastar::thread::maybe_yield();
}
non_owned_ranges = std::move(new_non_owned_ranges);
}
return non_owned_ranges;
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
}
protected:
virtual compaction_completion_desc

View File

@@ -80,8 +80,10 @@ struct compaction_data {
}
void stop(sstring reason) {
stop_requested = std::move(reason);
abort.request_abort();
if (!abort.abort_requested()) {
stop_requested = std::move(reason);
abort.request_abort();
}
}
};

View File

@@ -66,7 +66,8 @@ public:
};
compaction_backlog_tracker(std::unique_ptr<impl> impl) : _impl(std::move(impl)) {}
compaction_backlog_tracker(compaction_backlog_tracker&&) = default;
compaction_backlog_tracker(compaction_backlog_tracker&&);
compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) noexcept;
compaction_backlog_tracker(const compaction_backlog_tracker&) = delete;
~compaction_backlog_tracker();
@@ -74,7 +75,7 @@ public:
void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts);
void register_partially_written_sstable(sstables::shared_sstable sst, backlog_write_progress_manager& wp);
void register_compacting_sstable(sstables::shared_sstable sst, backlog_read_progress_manager& rp);
void transfer_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges = true);
void copy_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges = true) const;
void revert_charges(sstables::shared_sstable sst);
void disable() {

View File

@@ -1097,7 +1097,12 @@ private:
compaction::table_state& t = *_compacting_table;
const auto& maintenance_sstables = t.maintenance_sstable_set();
const auto old_sstables = boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_sstables.all());
// Filter out sstables that require view building, to avoid a race between off-strategy
// and view building. Refs: #11882
const auto old_sstables = boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_sstables.all()
| boost::adaptors::filtered([] (const sstables::shared_sstable& sst) {
return !sst->requires_view_building();
}));
std::vector<sstables::shared_sstable> reshape_candidates = old_sstables;
std::vector<sstables::shared_sstable> sstables_to_remove;
std::unordered_set<sstables::shared_sstable> new_unused_sstables;
@@ -1470,10 +1475,8 @@ private:
bool needs_cleanup(const sstables::shared_sstable& sst,
const dht::token_range_vector& sorted_owned_ranges,
schema_ptr s) {
auto first = sst->get_first_partition_key();
auto last = sst->get_last_partition_key();
auto first_token = dht::get_token(*s, first);
auto last_token = dht::get_token(*s, last);
auto first_token = sst->get_first_decorated_key().token();
auto last_token = sst->get_last_decorated_key().token();
dht::token_range sst_token_range = dht::token_range::make(first_token, last_token);
auto r = std::lower_bound(sorted_owned_ranges.begin(), sorted_owned_ranges.end(), first_token,
@@ -1573,8 +1576,13 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst
}, can_purge_tombstones::no);
}
compaction_manager::compaction_state::compaction_state(table_state& t)
: backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())
{
}
void compaction_manager::add(compaction::table_state& t) {
auto [_, inserted] = _compaction_state.insert({&t, compaction_state{}});
auto [_, inserted] = _compaction_state.insert({&t, compaction_state(t)});
if (!inserted) {
auto s = t.schema();
on_internal_error(cmlog, format("compaction_state for table {}.{} [{}] already exists", s->ks_name(), s->cf_name(), fmt::ptr(&t)));
@@ -1582,22 +1590,21 @@ void compaction_manager::add(compaction::table_state& t) {
}
future<> compaction_manager::remove(compaction::table_state& t) noexcept {
auto handle = _compaction_state.extract(&t);
auto& c_state = get_compaction_state(&t);
if (!handle.empty()) {
auto& c_state = handle.mapped();
// We need to guarantee that a task being stopped will not retry to compact
// a table being removed.
// The requirement above is provided by stop_ongoing_compactions().
_postponed.erase(&t);
// We need to guarantee that a task being stopped will not retry to compact
// a table being removed.
// The requirement above is provided by stop_ongoing_compactions().
_postponed.erase(&t);
// Wait for all compaction tasks running under gate to terminate
// and prevent new tasks from entering the gate.
co_await seastar::when_all_succeed(stop_ongoing_compactions("table removal", &t), c_state.gate.close()).discard_result();
// Wait for the termination of an ongoing compaction on table T, if any.
co_await stop_ongoing_compactions("table removal", &t);
c_state.backlog_tracker.disable();
_compaction_state.erase(&t);
// Wait for all functions running under gate to terminate.
co_await c_state.gate.close();
}
#ifdef DEBUG
auto found = false;
sstring msg;
@@ -1756,7 +1763,7 @@ void compaction_backlog_tracker::register_compacting_sstable(sstables::shared_ss
}
}
void compaction_backlog_tracker::transfer_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges) {
void compaction_backlog_tracker::copy_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges) const {
for (auto&& w : _ongoing_writes) {
new_bt.register_partially_written_sstable(w.first, *w.second);
}
@@ -1766,8 +1773,6 @@ void compaction_backlog_tracker::transfer_ongoing_charges(compaction_backlog_tra
new_bt.register_compacting_sstable(w.first, *w.second);
}
}
_ongoing_writes = {};
_ongoing_compactions = {};
}
void compaction_backlog_tracker::revert_charges(sstables::shared_sstable sst) {
@@ -1775,6 +1780,26 @@ void compaction_backlog_tracker::revert_charges(sstables::shared_sstable sst) {
_ongoing_compactions.erase(sst);
}
compaction_backlog_tracker::compaction_backlog_tracker(compaction_backlog_tracker&& other)
: _impl(std::move(other._impl))
, _ongoing_writes(std::move(other._ongoing_writes))
, _ongoing_compactions(std::move(other._ongoing_compactions))
, _manager(std::exchange(other._manager, nullptr)) {
}
compaction_backlog_tracker&
compaction_backlog_tracker::operator=(compaction_backlog_tracker&& x) noexcept {
if (this != &x) {
if (auto manager = std::exchange(_manager, x._manager)) {
manager->remove_backlog_tracker(this);
}
_impl = std::move(x._impl);
_ongoing_writes = std::move(x._ongoing_writes);
_ongoing_compactions = std::move(x._ongoing_compactions);
}
return *this;
}
compaction_backlog_tracker::~compaction_backlog_tracker() {
if (_manager) {
_manager->remove_backlog_tracker(this);
@@ -1812,3 +1837,14 @@ compaction_backlog_manager::~compaction_backlog_manager() {
tracker->_manager = nullptr;
}
}
void compaction_manager::register_backlog_tracker(compaction::table_state& t, compaction_backlog_tracker new_backlog_tracker) {
auto& cs = get_compaction_state(&t);
cs.backlog_tracker = std::move(new_backlog_tracker);
register_backlog_tracker(cs.backlog_tracker);
}
compaction_backlog_tracker& compaction_manager::get_backlog_tracker(compaction::table_state& t) {
auto& cs = get_compaction_state(&t);
return cs.backlog_tracker;
}

View File

@@ -83,7 +83,9 @@ private:
// Signaled whenever a compaction task completes.
condition_variable compaction_done;
compaction_state() = default;
compaction_backlog_tracker backlog_tracker;
explicit compaction_state(table_state& t);
compaction_state(compaction_state&&) = default;
~compaction_state();
@@ -524,6 +526,9 @@ public:
void register_backlog_tracker(compaction_backlog_tracker& backlog_tracker) {
_backlog_manager.register_backlog_tracker(backlog_tracker);
}
void register_backlog_tracker(compaction::table_state& t, compaction_backlog_tracker new_backlog_tracker);
compaction_backlog_tracker& get_backlog_tracker(compaction::table_state& t);
static sstables::compaction_data create_compaction_data();

View File

@@ -427,14 +427,6 @@ struct null_backlog_tracker final : public compaction_backlog_tracker::impl {
virtual void replace_sstables(std::vector<sstables::shared_sstable> old_ssts, std::vector<sstables::shared_sstable> new_ssts) override {}
};
// Just so that if we have more than one CF with NullStrategy, we don't create a lot
// of objects to iterate over for no reason
// Still thread local because of make_unique. But this will disappear soon
static thread_local compaction_backlog_tracker null_backlog_tracker(std::make_unique<null_backlog_tracker>());
compaction_backlog_tracker& get_null_backlog_tracker() {
return null_backlog_tracker;
}
//
// Null compaction strategy is the default compaction strategy.
// As the name implies, it does nothing.
@@ -453,8 +445,8 @@ public:
return compaction_strategy_type::null;
}
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return get_null_backlog_tracker();
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override {
return std::make_unique<null_backlog_tracker>();
}
};
@@ -462,11 +454,14 @@ leveled_compaction_strategy::leveled_compaction_strategy(const std::map<sstring,
: compaction_strategy_impl(options)
, _max_sstable_size_in_mb(calculate_max_sstable_size_in_mb(compaction_strategy_impl::get_value(options, SSTABLE_SIZE_OPTION)))
, _stcs_options(options)
, _backlog_tracker(std::make_unique<leveled_compaction_backlog_tracker>(_max_sstable_size_in_mb, _stcs_options))
{
_compaction_counter.resize(leveled_manifest::MAX_LEVELS);
}
std::unique_ptr<compaction_backlog_tracker::impl> leveled_compaction_strategy::make_backlog_tracker() {
return std::make_unique<leveled_compaction_backlog_tracker>(_max_sstable_size_in_mb, _stcs_options);
}
int32_t
leveled_compaction_strategy::calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const {
using namespace cql3::statements;
@@ -486,7 +481,6 @@ time_window_compaction_strategy::time_window_compaction_strategy(const std::map<
: compaction_strategy_impl(options)
, _options(options)
, _stcs_options(options)
, _backlog_tracker(std::make_unique<time_window_backlog_tracker>(_options, _stcs_options))
{
if (!options.contains(TOMBSTONE_COMPACTION_INTERVAL_OPTION) && !options.contains(TOMBSTONE_THRESHOLD_OPTION)) {
_disable_tombstone_compaction = true;
@@ -497,6 +491,10 @@ time_window_compaction_strategy::time_window_compaction_strategy(const std::map<
_use_clustering_key_filter = true;
}
std::unique_ptr<compaction_backlog_tracker::impl> time_window_compaction_strategy::make_backlog_tracker() {
return std::make_unique<time_window_backlog_tracker>(_options, _stcs_options);
}
} // namespace sstables
std::vector<sstables::shared_sstable>
@@ -640,7 +638,6 @@ namespace sstables {
date_tiered_compaction_strategy::date_tiered_compaction_strategy(const std::map<sstring, sstring>& options)
: compaction_strategy_impl(options)
, _manifest(options)
, _backlog_tracker(std::make_unique<unimplemented_backlog_tracker>())
{
clogger.warn("DateTieredCompactionStrategy is deprecated. Usually cases for which it is used are better handled by TimeWindowCompactionStrategy."
" Please change your compaction strategy to TWCS as DTCS will be retired in the near future");
@@ -685,17 +682,23 @@ compaction_descriptor date_tiered_compaction_strategy::get_sstables_for_compacti
return sstables::compaction_descriptor({ *it }, service::get_local_compaction_priority());
}
std::unique_ptr<compaction_backlog_tracker::impl> date_tiered_compaction_strategy::make_backlog_tracker() {
return std::make_unique<unimplemented_backlog_tracker>();
}
size_tiered_compaction_strategy::size_tiered_compaction_strategy(const std::map<sstring, sstring>& options)
: compaction_strategy_impl(options)
, _options(options)
, _backlog_tracker(std::make_unique<size_tiered_backlog_tracker>(_options))
{}
size_tiered_compaction_strategy::size_tiered_compaction_strategy(const size_tiered_compaction_strategy_options& options)
: _options(options)
, _backlog_tracker(std::make_unique<size_tiered_backlog_tracker>(_options))
{}
std::unique_ptr<compaction_backlog_tracker::impl> size_tiered_compaction_strategy::make_backlog_tracker() {
return std::make_unique<size_tiered_backlog_tracker>(_options);
}
compaction_strategy::compaction_strategy(::shared_ptr<compaction_strategy_impl> impl)
: _compaction_strategy_impl(std::move(impl)) {}
compaction_strategy::compaction_strategy() = default;
@@ -736,8 +739,8 @@ bool compaction_strategy::use_clustering_key_filter() const {
return _compaction_strategy_impl->use_clustering_key_filter();
}
compaction_backlog_tracker& compaction_strategy::get_backlog_tracker() {
return _compaction_strategy_impl->get_backlog_tracker();
compaction_backlog_tracker compaction_strategy::make_backlog_tracker() {
return compaction_backlog_tracker(_compaction_strategy_impl->make_backlog_tracker());
}
sstables::compaction_descriptor

View File

@@ -106,7 +106,7 @@ public:
sstable_set make_sstable_set(schema_ptr schema) const;
compaction_backlog_tracker& get_backlog_tracker();
compaction_backlog_tracker make_backlog_tracker();
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);

View File

@@ -22,8 +22,6 @@ class strategy_control;
namespace sstables {
compaction_backlog_tracker& get_unimplemented_backlog_tracker();
class sstable_set_impl;
class resharding_descriptor;
@@ -70,7 +68,7 @@ public:
// droppable tombstone histogram and gc_before.
bool worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const tombstone_gc_state& gc_state);
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() = 0;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);

View File

@@ -259,7 +259,6 @@ namespace sstables {
class date_tiered_compaction_strategy : public compaction_strategy_impl {
date_tiered_manifest _manifest;
compaction_backlog_tracker _backlog_tracker;
public:
date_tiered_compaction_strategy(const std::map<sstring, sstring>& options);
virtual compaction_descriptor get_sstables_for_compaction(table_state& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidates) override;
@@ -272,9 +271,7 @@ public:
return compaction_strategy_type::date_tiered;
}
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
};
}

View File

@@ -35,7 +35,6 @@ class leveled_compaction_strategy : public compaction_strategy_impl {
std::optional<std::vector<std::optional<dht::decorated_key>>> _last_compacted_keys;
std::vector<int> _compaction_counter;
size_tiered_compaction_strategy_options _stcs_options;
compaction_backlog_tracker _backlog_tracker;
int32_t calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const;
public:
static unsigned ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size);
@@ -64,9 +63,7 @@ public:
}
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, const ::io_priority_class& iop, reshape_mode mode) override;
};

View File

@@ -82,7 +82,6 @@ public:
class size_tiered_compaction_strategy : public compaction_strategy_impl {
size_tiered_compaction_strategy_options _options;
compaction_backlog_tracker _backlog_tracker;
// Return a list of pair of shared_sstable and its respective size.
static std::vector<std::pair<sstables::shared_sstable, uint64_t>> create_sstable_and_length_pairs(const std::vector<sstables::shared_sstable>& sstables);
@@ -128,9 +127,7 @@ public:
most_interesting_bucket(const std::vector<sstables::shared_sstable>& candidates, int min_threshold, int max_threshold,
size_tiered_compaction_strategy_options options = {});
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, const ::io_priority_class& iop, reshape_mode mode) override;

View File

@@ -15,6 +15,7 @@
#include "compaction_descriptor.hh"
class reader_permit;
class compaction_backlog_tracker;
namespace sstables {
class compaction_strategy;
@@ -43,6 +44,7 @@ public:
virtual future<> on_compaction_completion(sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
virtual const tombstone_gc_state& get_tombstone_gc_state() const noexcept = 0;
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
};
}

View File

@@ -73,7 +73,6 @@ class time_window_compaction_strategy : public compaction_strategy_impl {
// Keep track of all recent active windows that still need to be compacted into a single SSTable
std::unordered_set<timestamp_type> _recent_active_windows;
size_tiered_compaction_strategy_options _stcs_options;
compaction_backlog_tracker _backlog_tracker;
public:
// The maximum amount of buckets we segregate data into when writing into sstables.
// To prevent an explosion in the number of sstables we cap it.
@@ -156,9 +155,7 @@ public:
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) override;

View File

@@ -289,7 +289,8 @@ modes = {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*40,
'optimization-level': 'g',
# -fasan -Og breaks some coroutines on aarch64, use -O0 instead
'optimization-level': ('0' if platform.machine() == 'aarch64' else 'g'),
'per_src_extra_cxxflags': {},
'cmake_build_type': 'Debug',
'can_have_debug_info': True,
@@ -909,6 +910,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/config_file.cc',
'utils/multiprecision_int.cc',
'utils/gz/crc_combine.cc',
'utils/gz/crc_combine_table.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -943,6 +945,7 @@ scylla_core = (['message/messaging_service.cc',
'locator/ec2_snitch.cc',
'locator/ec2_multi_region_snitch.cc',
'locator/gce_snitch.cc',
'locator/topology.cc',
'service/client_state.cc',
'service/storage_service.cc',
'service/misc_services.cc',
@@ -1323,8 +1326,6 @@ deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
'test/lib/log.cc',
'service/raft/discovery.cc'] + scylla_raft_dependencies
deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']
warnings = [
'-Wall',
@@ -1413,12 +1414,8 @@ if not has_wasmtime:
has_wasmtime = os.path.isfile('/usr/lib64/libwasmtime.a') and os.path.isdir('/usr/local/include/wasmtime')
if has_wasmtime:
if platform.machine() == 'aarch64':
print("wasmtime is temporarily not supported on aarch64. Ref: issue #9387")
has_wasmtime = False
else:
for mode in modes:
modes[mode]['cxxflags'] += ' -DSCYLLA_ENABLE_WASMTIME'
for mode in modes:
modes[mode]['cxxflags'] += ' -DSCYLLA_ENABLE_WASMTIME'
else:
print("wasmtime not found - WASM support will not be enabled in this build")
@@ -1604,8 +1601,6 @@ if args.target != '':
seastar_cflags += ' -march=' + args.target
seastar_ldflags = args.user_ldflags
libdeflate_cflags = seastar_cflags
# cmake likes to separate things with semicolons
def semicolon_separated(*flags):
# original flags may be space separated, so convert to string still
@@ -1739,6 +1734,7 @@ libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-l
maybe_static(True, '-lzstd'),
maybe_static(args.staticboost, '-lboost_date_time -lboost_regex -licuuc -licui18n'),
'-lxxhash',
'-ldeflate',
])
if has_wasmtime:
print("Found wasmtime dependency, linking with libwasmtime")
@@ -1949,11 +1945,8 @@ with open(buildfile, 'w') as f:
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
objs.extend(['$builddir/' + mode + '/' + artifact for artifact in [
'libdeflate/libdeflate.a',
] + [
'abseil/' + x for x in abseil_libs
]])
objs.append('$builddir/' + mode + '/gen/utils/gz/crc_combine_table.o')
if binary in tests:
local_libs = '$seastar_libs_{} $libs'.format(mode)
if binary in pure_boost_tests:
@@ -2002,12 +1995,6 @@ with open(buildfile, 'w') as f:
rust_libs[staticlib] = src
else:
raise Exception('No rule for ' + src)
compiles['$builddir/' + mode + '/gen/utils/gz/crc_combine_table.o'] = '$builddir/' + mode + '/gen/utils/gz/crc_combine_table.cc'
compiles['$builddir/' + mode + '/utils/gz/gen_crc_combine_table.o'] = 'utils/gz/gen_crc_combine_table.cc'
f.write('build {}: run {}\n'.format('$builddir/' + mode + '/gen/utils/gz/crc_combine_table.cc',
'$builddir/' + mode + '/utils/gz/gen_crc_combine_table'))
f.write('build {}: link_build.{} {}\n'.format('$builddir/' + mode + '/utils/gz/gen_crc_combine_table', mode,
'$builddir/' + mode + '/utils/gz/gen_crc_combine_table.o'))
f.write(' libs = $seastar_libs_{}\n'.format(mode))
f.write(
'build {mode}-objects: phony {objs}\n'.format(
@@ -2139,24 +2126,16 @@ with open(buildfile, 'w') as f:
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/dist/{mode}/debian: debbuild $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian dist-server-compat-{mode} dist-server-compat-arch-{mode}\n')
f.write(f'build dist-server-compat-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz\n')
f.write(f'build dist-server-compat-arch-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz\n')
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
f.write(f'build dist-server-debuginfo-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz dist-jmx-rpm dist-jmx-deb dist-jmx-compat\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz dist-tools-rpm dist-tools-deb dist-tools-compat\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb dist-python3-compat dist-python3-compat-arch\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz dist-unified-compat-{mode} dist-unified-compat-arch-{mode}\n')
f.write(f'build dist-unified-compat-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz\n')
f.write(f'build dist-unified-compat-arch-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz dist-jmx-rpm dist-jmx-deb\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz dist-tools-rpm dist-tools-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write('rule libdeflate.{mode}\n'.format(**locals()))
f.write(' command = make -C libdeflate BUILD_DIR=../$builddir/{mode}/libdeflate/ CFLAGS="{libdeflate_cflags}" CC={args.cc} ../$builddir/{mode}/libdeflate//libdeflate.a\n'.format(**locals()))
f.write('build $builddir/{mode}/libdeflate/libdeflate.a: libdeflate.{mode}\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
for lib in abseil_libs:
f.write('build $builddir/{mode}/abseil/{lib}: ninja $builddir/{mode}/abseil/build.ninja\n'.format(**locals()))
@@ -2179,17 +2158,13 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent(f'''\
build dist-unified-tar: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz' for mode in default_modes])}
build dist-unified-compat: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz' for mode in default_modes])}
build dist-unified-compat-arch: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz' for mode in default_modes])}
build dist-unified: phony dist-unified-tar dist-unified-compat dist-unified-compat-arch
build dist-unified: phony dist-unified-tar
build dist-server-deb: phony {' '.join(['$builddir/dist/{mode}/debian'.format(mode=mode) for mode in build_modes])}
build dist-server-rpm: phony {' '.join(['$builddir/dist/{mode}/redhat'.format(mode=mode) for mode in build_modes])}
build dist-server-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-server-debuginfo: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-server-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-server-compat-arch: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-server: phony dist-server-tar dist-server-debuginfo dist-server-compat dist-server-compat-arch dist-server-rpm dist-server-deb
build dist-server: phony dist-server-tar dist-server-debuginfo dist-server-rpm dist-server-deb
rule build-submodule-reloc
command = cd $reloc_dir && ./reloc/build_reloc.sh --version $$(<../../build/SCYLLA-PRODUCT-FILE)-$$(sed 's/-/~/' <../../build/SCYLLA-VERSION-FILE)-$$(<../../build/SCYLLA-RELEASE-FILE) --nodeps $args
@@ -2207,8 +2182,7 @@ with open(buildfile, 'w') as f:
dir = tools/jmx
artifact = $builddir/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-jmx-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-jmx-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-jmx: phony dist-jmx-tar dist-jmx-compat dist-jmx-rpm dist-jmx-deb
build dist-jmx: phony dist-jmx-tar dist-jmx-rpm dist-jmx-deb
build tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
reloc_dir = tools/java
@@ -2219,8 +2193,7 @@ with open(buildfile, 'w') as f:
dir = tools/java
artifact = $builddir/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-tools-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-tools: phony dist-tools-tar dist-tools-compat dist-tools-rpm dist-tools-deb
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
build tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
reloc_dir = tools/python3
@@ -2232,14 +2205,10 @@ with open(buildfile, 'w') as f:
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-python3-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-python3-compat-arch: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{arch}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-python3: phony dist-python3-tar dist-python3-compat dist-python3-compat-arch dist-python3-rpm dist-python3-deb
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb
build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-jmx-rpm dist-tools-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-jmx-tar dist-tools-tar
build dist-compat: phony dist-unified-compat dist-server-compat dist-python3-compat
build dist-compat-arch: phony dist-unified-compat-arch dist-server-compat-arch dist-python3-compat-arch
build dist: phony dist-unified dist-server dist-python3 dist-jmx dist-tools
'''))

View File

@@ -1419,7 +1419,7 @@ serviceLevelOrRoleName returns [sstring name]
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower); }
| t=STRING_LITERAL { $name = sstring($t.text); }
| t=QUOTED_NAME { $name = sstring($t.text); }
| k=unreserved_keyword { $name = sstring($t.text);
| k=unreserved_keyword { $name = k;
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower);}
| QMARK {add_recognition_error("Bind variables cannot be used for service levels or role names");}
;

View File

@@ -216,36 +216,95 @@ get_value(const subscript& s, const evaluation_inputs& inputs) {
}
}
// This class represents a value that can be one of three things:
// false, true or null.
// It could be represented by std::optional<bool>, but optional
// can be implicitly casted to bool, which might cause mistakes.
// (bool)(std::make_optional<bool>(false)) will return true,
// despite the fact that the represented value is `false`.
// To avoid any such problems this class is introduced
// along with the is_true() method, which can be used
// to check if the value held is indeed `true`.
class bool_or_null {
std::optional<bool> value;
public:
bool_or_null(bool val) : value(val) {}
bool_or_null(null_value) : value(std::nullopt) {}
static bool_or_null null() {
return bool_or_null(null_value{});
}
bool has_value() const {
return value.has_value();
}
bool is_null() const {
return !has_value();
}
const bool& get_value() const {
return *value;
}
const bool is_true() const {
return has_value() && get_value();
}
};
/// True iff lhs's value equals rhs.
bool equal(const expression& lhs, const managed_bytes_opt& rhs, const evaluation_inputs& inputs) {
if (!rhs) {
return false;
bool_or_null equal(const expression& lhs, const managed_bytes_opt& rhs_bytes, const evaluation_inputs& inputs) {
raw_value lhs_value = evaluate(lhs, inputs);
if (lhs_value.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value found on left-hand side of an equality operator");
}
const auto value = evaluate(lhs, inputs).to_managed_bytes_opt();
if (!value) {
return false;
if (lhs_value.is_null() || !rhs_bytes.has_value()) {
return bool_or_null::null();
}
return type_of(lhs)->equal(managed_bytes_view(*value), managed_bytes_view(*rhs));
managed_bytes lhs_bytes = std::move(lhs_value).to_managed_bytes();
return type_of(lhs)->equal(managed_bytes_view(lhs_bytes), managed_bytes_view(*rhs_bytes));
}
/// Convenience overload for expression.
bool equal(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
return equal(lhs, evaluate(rhs, inputs).to_managed_bytes_opt(), inputs);
}
static std::optional<std::pair<managed_bytes, managed_bytes>> evaluate_binop_sides(const expression& lhs,
const expression& rhs,
const oper_t op,
const evaluation_inputs& inputs) {
raw_value lhs_value = evaluate(lhs, inputs);
raw_value rhs_value = evaluate(rhs, inputs);
/// True iff columns' values equal t.
bool equal(const tuple_constructor& columns_tuple_lhs, const expression& t_rhs, const evaluation_inputs& inputs) {
const cql3::raw_value tup = evaluate(t_rhs, inputs);
const auto& rhs = get_tuple_elements(tup, *type_of(t_rhs));
if (rhs.size() != columns_tuple_lhs.elements.size()) {
if (lhs_value.is_unset_value()) {
throw exceptions::invalid_request_exception(
format("tuple equality size mismatch: {} elements on left-hand side, {} on right",
columns_tuple_lhs.elements.size(), rhs.size()));
format("unset value found on left-hand side of a binary operator with operation {}", op));
}
return boost::equal(columns_tuple_lhs.elements, rhs,
[&] (const expression& lhs, const managed_bytes_opt& b) {
return equal(lhs, b, inputs);
});
if (rhs_value.is_unset_value()) {
throw exceptions::invalid_request_exception(
format("unset value found on right-hand side of a binary operator with operation {}", op));
}
if (lhs_value.is_null() || rhs_value.is_null()) {
return std::nullopt;
}
managed_bytes lhs_bytes = std::move(lhs_value).to_managed_bytes();
managed_bytes rhs_bytes = std::move(rhs_value).to_managed_bytes();
return std::pair(std::move(lhs_bytes), std::move(rhs_bytes));
}
bool_or_null equal(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::EQ, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
auto [lhs_bytes, rhs_bytes] = std::move(*sides_bytes);
return type_of(lhs)->equal(managed_bytes_view(lhs_bytes), managed_bytes_view(rhs_bytes));
}
bool_or_null not_equal(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::NEQ, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
auto [lhs_bytes, rhs_bytes] = std::move(*sides_bytes);
return !type_of(lhs)->equal(managed_bytes_view(lhs_bytes), managed_bytes_view(rhs_bytes));
}
/// True iff lhs is limited by rhs in the manner prescribed by op.
@@ -270,127 +329,77 @@ bool limits(managed_bytes_view lhs, oper_t op, managed_bytes_view rhs, const abs
}
/// True iff the column value is limited by rhs in the manner prescribed by op.
bool limits(const expression& col, oper_t op, const expression& rhs, const evaluation_inputs& inputs) {
bool_or_null limits(const expression& lhs, oper_t op, const expression& rhs, const evaluation_inputs& inputs) {
if (!is_slice(op)) { // For EQ or NEQ, use equal().
throw std::logic_error("limits() called on non-slice op");
}
auto lhs = evaluate(col, inputs).to_managed_bytes_opt();
if (!lhs) {
return false;
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, op, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
const auto b = evaluate(rhs, inputs).to_managed_bytes_opt();
return b ? limits(*lhs, op, *b, type_of(col)->without_reversed()) : false;
}
auto [lhs_bytes, rhs_bytes] = std::move(*sides_bytes);
/// True iff the column values are limited by t in the manner prescribed by op.
bool limits(const tuple_constructor& columns_tuple, const oper_t op, const expression& e,
const evaluation_inputs& inputs) {
if (!is_slice(op)) { // For EQ or NEQ, use equal().
throw std::logic_error("limits() called on non-slice op");
}
const cql3::raw_value tup = evaluate(e, inputs);
const auto& rhs = get_tuple_elements(tup, *type_of(e));
if (rhs.size() != columns_tuple.elements.size()) {
throw exceptions::invalid_request_exception(
format("tuple comparison size mismatch: {} elements on left-hand side, {} on right",
columns_tuple.elements.size(), rhs.size()));
}
for (size_t i = 0; i < rhs.size(); ++i) {
auto& cv = columns_tuple.elements[i];
auto lhs = evaluate(cv, inputs).to_managed_bytes_opt();
if (!lhs || !rhs[i]) {
// CQL dictates that columns_tuple.elements[i] is a clustering column and non-null, but
// let's not rely on grammar constraints that can be later relaxed.
//
// NULL = always fails comparison
return false;
}
const auto cmp = type_of(cv)->without_reversed().compare(
*lhs,
*rhs[i]);
// If the components aren't equal, then we just learned the LHS/RHS order.
if (cmp < 0) {
if (op == oper_t::LT || op == oper_t::LTE) {
return true;
} else if (op == oper_t::GT || op == oper_t::GTE) {
return false;
} else {
throw std::logic_error("Unknown slice operator");
}
} else if (cmp > 0) {
if (op == oper_t::LT || op == oper_t::LTE) {
return false;
} else if (op == oper_t::GT || op == oper_t::GTE) {
return true;
} else {
throw std::logic_error("Unknown slice operator");
}
}
// Otherwise, we don't know the LHS/RHS order, so check the next component.
}
// Getting here means LHS == RHS.
return op == oper_t::LTE || op == oper_t::GTE;
}
/// True iff collection (list, set, or map) contains value.
bool contains(const data_value& collection, const raw_value_view& value) {
if (!value) {
// CONTAINS NULL should evaluate to NULL/false
return false;
}
auto col_type = static_pointer_cast<const collection_type_impl>(collection.type());
auto&& element_type = col_type->is_set() ? col_type->name_comparator() : col_type->value_comparator();
return value.with_linearized([&] (bytes_view val) {
auto exists_in = [&](auto&& range) {
auto found = std::find_if(range.begin(), range.end(), [&] (auto&& element) {
return element_type->compare(element.serialize_nonnull(), val) == 0;
});
return found != range.end();
};
if (col_type->is_list()) {
return exists_in(value_cast<list_type_impl::native_type>(collection));
} else if (col_type->is_set()) {
return exists_in(value_cast<set_type_impl::native_type>(collection));
} else if (col_type->is_map()) {
auto data_map = value_cast<map_type_impl::native_type>(collection);
using entry = std::pair<data_value, data_value>;
return exists_in(data_map | transformed([] (const entry& e) { return e.second; }));
} else {
throw std::logic_error("unsupported collection type in a CONTAINS expression");
}
});
return limits(lhs_bytes, op, rhs_bytes, type_of(lhs)->without_reversed());
}
/// True iff a column is a collection containing value.
bool contains(const column_value& col, const raw_value_view& value, const evaluation_inputs& inputs) {
const auto collection = get_value(col, inputs);
if (collection) {
return contains(col.col->type->deserialize(managed_bytes_view(*collection)), value);
bool_or_null contains(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::CONTAINS, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
const abstract_type& lhs_type = type_of(lhs)->without_reversed();
data_value lhs_collection = lhs_type.deserialize(managed_bytes_view(sides_bytes->first));
const collection_type_impl* collection_type = dynamic_cast<const collection_type_impl*>(&lhs_type);
data_type element_type =
collection_type->is_set() ? collection_type->name_comparator() : collection_type->value_comparator();
auto exists_in = [&](auto&& range) {
auto found = std::find_if(range.begin(), range.end(), [&](auto&& element) {
return element_type->compare(managed_bytes_view(element.serialize_nonnull()), sides_bytes->second) == 0;
});
return found != range.end();
};
if (collection_type->is_list()) {
return exists_in(value_cast<list_type_impl::native_type>(lhs_collection));
} else if (collection_type->is_set()) {
return exists_in(value_cast<set_type_impl::native_type>(lhs_collection));
} else if (collection_type->is_map()) {
auto data_map = value_cast<map_type_impl::native_type>(lhs_collection);
using entry = std::pair<data_value, data_value>;
return exists_in(data_map | transformed([](const entry& e) { return e.second; }));
} else {
return false;
on_internal_error(expr_logger, "unsupported collection type in a CONTAINS expression");
}
}
/// True iff a column is a map containing \p key.
bool contains_key(const column_value& col, cql3::raw_value_view key, const evaluation_inputs& inputs) {
if (!key) {
// CONTAINS_KEY NULL should evaluate to NULL/false
return false;
bool_or_null contains_key(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::CONTAINS_KEY, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
auto type = col.col->type;
const auto collection = get_value(col, inputs);
if (!collection) {
return false;
auto [lhs_bytes, rhs_bytes] = std::move(*sides_bytes);
data_type lhs_type = type_of(lhs);
const map_type_impl::native_type data_map =
value_cast<map_type_impl::native_type>(lhs_type->deserialize(managed_bytes_view(lhs_bytes)));
data_type key_type = static_pointer_cast<const collection_type_impl>(lhs_type)->name_comparator();
for (const std::pair<data_value, data_value>& map_element : data_map) {
bytes serialized_element_key = map_element.first.serialize_nonnull();
if (key_type->compare(managed_bytes_view(rhs_bytes), managed_bytes_view(bytes_view(serialized_element_key))) ==
0) {
return true;
};
}
const auto data_map = value_cast<map_type_impl::native_type>(type->deserialize(managed_bytes_view(*collection)));
auto key_type = static_pointer_cast<const collection_type_impl>(type)->name_comparator();
auto found = key.with_linearized([&] (bytes_view k_bv) {
using entry = std::pair<data_value, data_value>;
return std::find_if(data_map.begin(), data_map.end(), [&] (const entry& element) {
return key_type->compare(element.first.serialize_nonnull(), k_bv) == 0;
});
});
return found != data_map.end();
return false;
}
/// Fetches the next cell value from iter and returns its (possibly null) value.
@@ -439,44 +448,62 @@ std::vector<managed_bytes_opt> get_non_pk_values(const selection& selection, con
namespace {
/// True iff cv matches the CQL LIKE pattern.
bool like(const column_value& cv, const raw_value_view& pattern, const evaluation_inputs& inputs) {
if (!cv.col->type->is_string()) {
bool_or_null like(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
data_type lhs_type = type_of(lhs)->underlying_type();
if (!lhs_type->is_string()) {
expression::printer lhs_printer {
.expr_to_print = lhs,
.debug_mode = false
};
throw exceptions::invalid_request_exception(
format("LIKE is allowed only on string types, which {} is not", cv.col->name_as_text()));
format("LIKE is allowed only on string types, which {} is not", lhs_printer));
}
auto value = get_value(cv, inputs);
// TODO: reuse matchers.
if (pattern && value) {
return value->with_linearized([&pattern] (bytes_view linearized_value) {
return pattern.with_linearized([linearized_value] (bytes_view linearized_pattern) {
return like_matcher(linearized_pattern)(linearized_value);
});
});
} else {
return false;
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::LIKE, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
auto [lhs_managed_bytes, rhs_managed_bytes] = std::move(*sides_bytes);
bytes lhs_bytes = to_bytes(lhs_managed_bytes);
bytes rhs_bytes = to_bytes(rhs_managed_bytes);
return like_matcher(bytes_view(rhs_bytes))(bytes_view(lhs_bytes));
}
/// True iff the column value is in the set defined by rhs.
bool is_one_of(const expression& col, const expression& rhs, const evaluation_inputs& inputs) {
const cql3::raw_value in_list = evaluate(rhs, inputs);
if (in_list.is_null()) {
return false;
bool_or_null is_one_of(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
std::optional<std::pair<managed_bytes, managed_bytes>> sides_bytes =
evaluate_binop_sides(lhs, rhs, oper_t::IN, inputs);
if (!sides_bytes.has_value()) {
return bool_or_null::null();
}
auto [lhs_bytes, rhs_bytes] = std::move(*sides_bytes);
return boost::algorithm::any_of(get_list_elements(in_list), [&] (const managed_bytes_opt& b) {
return equal(col, b, inputs);
});
expression lhs_constant = constant(raw_value::make_value(std::move(lhs_bytes)), type_of(lhs));
utils::chunked_vector<managed_bytes> list_elems = get_list_elements(raw_value::make_value(std::move(rhs_bytes)));
for (const managed_bytes& elem : list_elems) {
if (equal(lhs_constant, elem, evaluation_inputs{}).is_true()) {
return true;
}
}
return false;
}
/// True iff the tuple of column values is in the set defined by rhs.
bool is_one_of(const tuple_constructor& tuple, const expression& rhs, const evaluation_inputs& inputs) {
cql3::raw_value in_list = evaluate(rhs, inputs);
return boost::algorithm::any_of(get_list_of_tuples_elements(in_list, *type_of(rhs)), [&] (const std::vector<managed_bytes_opt>& el) {
return boost::equal(tuple.elements, el, [&] (const expression& c, const managed_bytes_opt& b) {
return equal(c, b, inputs);
});
});
bool is_not_null(const expression& lhs, const expression& rhs, const evaluation_inputs& inputs) {
cql3::raw_value lhs_val = evaluate(lhs, inputs);
if (lhs_val.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value found on left hand side of IS NOT operator");
}
cql3::raw_value rhs_val = evaluate(rhs, inputs);
if (rhs_val.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value found on right hand side of IS NOT operator");
}
if (!rhs_val.is_null()) {
throw exceptions::invalid_request_exception("IS NOT operator accepts only NULL as its right side");
}
return !lhs_val.is_null();
}
const value_set empty_value_set = value_list{};
@@ -511,105 +538,30 @@ value_set intersection(value_set a, value_set b, const abstract_type* type) {
}
bool is_satisfied_by(const binary_operator& opr, const evaluation_inputs& inputs) {
return expr::visit(overloaded_functor{
[&] (const column_value& col) {
if (opr.op == oper_t::EQ) {
return equal(col, opr.rhs, inputs);
} else if (opr.op == oper_t::NEQ) {
return !equal(col, opr.rhs, inputs);
} else if (is_slice(opr.op)) {
return limits(col, opr.op, opr.rhs, inputs);
} else if (opr.op == oper_t::CONTAINS) {
cql3::raw_value val = evaluate(opr.rhs, inputs);
return contains(col, val.view(), inputs);
} else if (opr.op == oper_t::CONTAINS_KEY) {
cql3::raw_value val = evaluate(opr.rhs, inputs);
return contains_key(col, val.view(), inputs);
} else if (opr.op == oper_t::LIKE) {
cql3::raw_value val = evaluate(opr.rhs, inputs);
return like(col, val.view(), inputs);
} else if (opr.op == oper_t::IN) {
return is_one_of(col, opr.rhs, inputs);
} else {
throw exceptions::unsupported_operation_exception(format("Unhandled binary_operator: {}", opr));
}
},
[&] (const subscript& sub) {
if (opr.op == oper_t::EQ) {
return equal(sub, opr.rhs, inputs);
} else if (opr.op == oper_t::NEQ) {
return !equal(sub, opr.rhs, inputs);
} else if (is_slice(opr.op)) {
return limits(sub, opr.op, opr.rhs, inputs);
} else if (opr.op == oper_t::CONTAINS) {
throw exceptions::unsupported_operation_exception("CONTAINS lhs is subscripted");
} else if (opr.op == oper_t::CONTAINS_KEY) {
throw exceptions::unsupported_operation_exception("CONTAINS KEY lhs is subscripted");
} else if (opr.op == oper_t::LIKE) {
throw exceptions::unsupported_operation_exception("LIKE lhs is subscripted");
} else if (opr.op == oper_t::IN) {
return is_one_of(sub, opr.rhs, inputs);
} else {
throw exceptions::unsupported_operation_exception(format("Unhandled binary_operator: {}", opr));
}
},
[&] (const tuple_constructor& cvs) {
if (opr.op == oper_t::EQ) {
return equal(cvs, opr.rhs, inputs);
} else if (is_slice(opr.op)) {
return limits(cvs, opr.op, opr.rhs, inputs);
} else if (opr.op == oper_t::IN) {
return is_one_of(cvs, opr.rhs, inputs);
} else {
throw exceptions::unsupported_operation_exception(
format("Unhandled multi-column binary_operator: {}", opr));
}
},
[] (const token& tok) -> bool {
// The RHS value was already used to ensure we fetch only rows in the specified
// token range. It is impossible for any fetched row not to match now.
return true;
},
[] (const constant&) -> bool {
on_internal_error(expr_logger, "is_satisfied_by: A constant cannot serve as the LHS of a binary expression");
},
[] (const conjunction&) -> bool {
on_internal_error(expr_logger, "is_satisfied_by: a conjunction cannot serve as the LHS of a binary expression");
},
[] (const binary_operator&) -> bool {
on_internal_error(expr_logger, "is_satisfied_by: binary operators cannot be nested");
},
[] (const unresolved_identifier&) -> bool {
on_internal_error(expr_logger, "is_satisfied_by: an unresolved identifier cannot serve as the LHS of a binary expression");
},
[] (const column_mutation_attribute&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: column_mutation_attribute cannot serve as the LHS of a binary expression");
},
[] (const function_call&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: function_call cannot serve as the LHS of a binary expression");
},
[] (const cast&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: cast cannot serve as the LHS of a binary expression");
},
[] (const field_selection&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: field_selection cannot serve as the LHS of a binary expression");
},
[] (const null&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: null cannot serve as the LHS of a binary expression");
},
[] (const bind_variable&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: bind_variable cannot serve as the LHS of a binary expression");
},
[] (const untyped_constant&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: untyped_constant cannot serve as the LHS of a binary expression");
},
[] (const collection_constructor&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: collection_constructor cannot serve as the LHS of a binary expression");
},
[] (const usertype_constructor&) -> bool {
on_internal_error(expr_logger, "is_satisified_by: usertype_constructor cannot serve as the LHS of a binary expression");
},
}, opr.lhs);
if (is<token>(opr.lhs)) {
// The RHS value was already used to ensure we fetch only rows in the specified
// token range. It is impossible for any fetched row not to match now.
// When token restrictions are present we forbid all other restrictions on partition key.
// This means that the partition range is defined solely by restrictions on token.
// When is_satisifed_by is used by filtering we can be sure that the token restrictions
// are fulfilled. In the future it will be possible to evaluate() a token,
// and we will be able to get rid of this risky if.
return true;
}
raw_value binop_eval_result = evaluate(opr, inputs);
if (binop_eval_result.is_null()) {
return false;
}
if (binop_eval_result.is_unset_value()) {
on_internal_error(expr_logger, format("is_satisfied_by: binary operator evaluated to unset value: {}", opr));
}
if (binop_eval_result.is_empty_value()) {
on_internal_error(expr_logger, format("is_satisfied_by: binary operator evaluated to EMPTY_VALUE: {}", opr));
}
return binop_eval_result.view().deserialize<bool>(*boolean_type);
}
} // anonymous namespace
@@ -1723,10 +1675,54 @@ std::optional<bool> get_bool_value(const constant& constant_val) {
return constant_val.view().deserialize<bool>(*boolean_type);
}
cql3::raw_value evaluate(const binary_operator& binop, const evaluation_inputs& inputs) {
if (binop.order == comparison_order::clustering) {
throw exceptions::invalid_request_exception("Can't evaluate a binary operator with SCYLLA_CLUSTERING_BOUND");
}
bool_or_null binop_result(false);
switch (binop.op) {
case oper_t::EQ:
binop_result = equal(binop.lhs, binop.rhs, inputs);
break;
case oper_t::NEQ:
binop_result = not_equal(binop.lhs, binop.rhs, inputs);
break;
case oper_t::LT:
case oper_t::LTE:
case oper_t::GT:
case oper_t::GTE:
binop_result = limits(binop.lhs, binop.op, binop.rhs, inputs);
break;
case oper_t::CONTAINS:
binop_result = contains(binop.lhs, binop.rhs, inputs);
break;
case oper_t::CONTAINS_KEY:
binop_result = contains_key(binop.lhs, binop.rhs, inputs);
break;
case oper_t::LIKE:
binop_result = like(binop.lhs, binop.rhs, inputs);
break;
case oper_t::IN:
binop_result = is_one_of(binop.lhs, binop.rhs, inputs);
break;
case oper_t::IS_NOT:
binop_result = is_not_null(binop.lhs, binop.rhs, inputs);
break;
};
if (binop_result.is_null()) {
return raw_value::make_null();
}
return raw_value::make_value(boolean_type->decompose(binop_result.get_value()));
}
cql3::raw_value evaluate(const expression& e, const evaluation_inputs& inputs) {
return expr::visit(overloaded_functor {
[](const binary_operator&) -> cql3::raw_value {
on_internal_error(expr_logger, "Can't evaluate a binary_operator");
[&](const binary_operator& binop) -> cql3::raw_value {
return evaluate(binop, inputs);
},
[](const conjunction&) -> cql3::raw_value {
on_internal_error(expr_logger, "Can't evaluate a conjunction");

View File

@@ -246,6 +246,21 @@ map_prepare_expression(const collection_constructor& c, data_dictionary::databas
auto key_spec = maps::key_spec_of(*receiver);
auto value_spec = maps::value_spec_of(*receiver);
const map_type_impl* map_type = dynamic_cast<const map_type_impl*>(&receiver->type->without_reversed());
if (map_type == nullptr) {
on_internal_error(expr_logger,
format("map_prepare_expression bad non-map receiver type: {}", receiver->type->name()));
}
data_type map_element_tuple_type = tuple_type_impl::get_instance({map_type->get_keys_type(), map_type->get_values_type()});
// In Cassandra, an empty (unfrozen) map/set/list is equivalent to the column being null. In
// other words a non-frozen collection only exists if it has elements. Return nullptr right
// away to simplify predicate evaluation. See also
// https://issues.apache.org/jira/browse/CASSANDRA-5141
if (map_type->is_multi_cell() && c.elements.empty()) {
return constant::make_null(receiver->type);
}
std::vector<expression> values;
values.reserve(c.elements.size());
bool all_terminal = true;
@@ -264,7 +279,7 @@ map_prepare_expression(const collection_constructor& c, data_dictionary::databas
values.emplace_back(tuple_constructor {
.elements = {std::move(k), std::move(v)},
.type = entry_tuple.type
.type = map_element_tuple_type
});
}
@@ -687,9 +702,13 @@ bind_variable_test_assignment(const bind_variable& bv, data_dictionary::database
}
static
bind_variable
std::optional<bind_variable>
bind_variable_prepare_expression(const bind_variable& bv, data_dictionary::database db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver)
{
if (!receiver) {
return std::nullopt;
}
return bind_variable {
.bind_index = bv.bind_index,
.receiver = receiver

View File

@@ -777,6 +777,19 @@ bool statement_restrictions::has_unrestricted_clustering_columns() const {
return clustering_columns_restrictions_size() < _schema->clustering_key_size();
}
const column_definition& statement_restrictions::unrestricted_column(column_kind kind) const {
const auto& restrictions = get_restrictions(kind);
const auto sorted_cols = expr::get_sorted_column_defs(restrictions);
for (size_t i = 0, count = _schema->columns_count(kind); i < count; ++i) {
if (i >= sorted_cols.size() || sorted_cols[i]->component_index() != i) {
return _schema->column_at(kind, i);
}
}
on_internal_error(rlogger, format(
"no missing columns with kind {} found in expression {}",
to_sstring(kind), restrictions));
};
bool statement_restrictions::clustering_columns_restrictions_have_supporting_index(
const secondary_index::secondary_index_manager& index_manager,
expr::allow_local_index allow_local) const {
@@ -1929,15 +1942,28 @@ sstring statement_restrictions::to_string() const {
return _where ? expr::to_string(*_where) : "";
}
static bool has_eq_null(const query_options& options, const expression& expr) {
return find_binop(expr, [&] (const binary_operator& binop) {
return binop.op == oper_t::EQ && evaluate(binop.rhs, options).is_null();
});
static void validate_primary_key_restrictions(const query_options& options, const std::vector<expr::expression>& restrictions) {
for (const auto& r: restrictions) {
for_each_expression<binary_operator>(r, [&](const binary_operator& binop) {
if (binop.op != oper_t::EQ && binop.op != oper_t::IN) {
return;
}
const auto* c = as_if<column_value>(&binop.lhs);
if (!c) {
return;
}
if (evaluate(binop.rhs, options).is_null()) {
throw exceptions::invalid_request_exception(format("Invalid null value in condition for column {}",
c->col->name_as_text()));
}
});
}
}
bool statement_restrictions::range_or_slice_eq_null(const query_options& options) const {
return boost::algorithm::any_of(_partition_range_restrictions, std::bind_front(has_eq_null, std::cref(options)))
|| boost::algorithm::any_of(_clustering_prefix_restrictions, std::bind_front(has_eq_null, std::cref(options)));
void statement_restrictions::validate_primary_key(const query_options& options) const {
validate_primary_key_restrictions(options, _partition_range_restrictions);
validate_primary_key_restrictions(options, _clustering_prefix_restrictions);
}
} // namespace restrictions
} // namespace cql3

View File

@@ -240,6 +240,15 @@ public:
* @return <code>true</code> if the clustering key has some unrestricted components, <code>false</code> otherwise.
*/
bool has_unrestricted_clustering_columns() const;
/**
* Returns the first unrestricted column for restrictions of the specified kind.
* It's an error to call this function if there are no such columns.
*
* @param kind supported values are column_kind::partition_key and column_kind::clustering_key;
* @return the <code>column_definition</code> for the unrestricted column.
*/
const column_definition& unrestricted_column(column_kind kind) const;
private:
void add_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering, bool for_view);
void add_is_not_restriction(const expr::binary_operator& restr, schema_ptr schema, bool for_view);
@@ -525,8 +534,8 @@ public:
sstring to_string() const;
/// True iff the partition range or slice is empty specifically due to a =NULL restriction.
bool range_or_slice_eq_null(const query_options& options) const;
/// Checks that the primary key restrictions don't contain null values, throws invalid_request_exception otherwise.
void validate_primary_key(const query_options& options) const;
};
}

View File

@@ -435,7 +435,7 @@ bool result_set_builder::restrictions_filter::do_filter(const selection& selecti
clustering_key_prefix ckey = clustering_key_prefix::from_exploded(clustering_key);
// FIXME: push to upper layer so it happens once per row
auto static_and_regular_columns = expr::get_non_pk_values(selection, static_row, row);
return expr::is_satisfied_by(
bool multi_col_clustering_satisfied = expr::is_satisfied_by(
clustering_columns_restrictions,
expr::evaluation_inputs{
.partition_key = &partition_key,
@@ -444,6 +444,9 @@ bool result_set_builder::restrictions_filter::do_filter(const selection& selecti
.selection = &selection,
.options = &_options,
});
if (!multi_col_clustering_satisfied) {
return false;
}
}
auto static_row_iterator = static_row.iterator();

View File

@@ -261,6 +261,10 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (options.getSerialConsistency() == null)
throw new InvalidRequestException("Invalid empty serial consistency level");
#endif
for (size_t i = 0; i < _statements.size(); ++i) {
_statements[i].statement->restrictions().validate_primary_key(options.for_statement(i));
}
if (_has_conditions) {
++_stats.cas_batches;
_stats.statements_in_cas_batches += _statements.size();

View File

@@ -61,8 +61,8 @@ static std::map<sstring, sstring> prepare_options(
}
}
for (const auto& dc : tm.get_topology().get_datacenter_endpoints()) {
options.emplace(dc.first, rf);
for (const auto& dc : tm.get_topology().get_datacenters()) {
options.emplace(dc, rf);
}
}

View File

@@ -112,9 +112,6 @@ future<> modification_statement::check_access(query_processor& qp, const service
future<std::vector<mutation>>
modification_statement::get_mutations(query_processor& qp, const query_options& options, db::timeout_clock::time_point timeout, bool local, int64_t now, service::query_state& qs) const {
if (_restrictions->range_or_slice_eq_null(options)) { // See #7852 and #9290.
throw exceptions::invalid_request_exception("Invalid null value in condition for a key column");
}
auto cl = options.get_consistency();
auto json_cache = maybe_prepare_json_cache(options);
auto keys = build_partition_keys(options, json_cache);
@@ -263,6 +260,8 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
inc_cql_stats(qs.get_client_state().is_internal());
_restrictions->validate_primary_key(options);
if (has_conditions()) {
return execute_with_condition(qp, qs, options);
}
@@ -418,24 +417,23 @@ modification_statement::process_where_clause(data_dictionary::database db, expr:
// Those tables don't have clustering columns so we wouldn't reach this code, thus
// the check seems redundant.
if (require_full_clustering_key()) {
auto& col = s->column_at(column_kind::clustering_key, _restrictions->clustering_columns_restrictions_size());
throw exceptions::invalid_request_exception(format("Missing mandatory PRIMARY KEY part {}", col.name_as_text()));
throw exceptions::invalid_request_exception(format("Missing mandatory PRIMARY KEY part {}",
_restrictions->unrestricted_column(column_kind::clustering_key).name_as_text()));
}
// In general, we can't modify specific columns if not all clustering columns have been specified.
// However, if we modify only static columns, it's fine since we won't really use the prefix anyway.
if (!has_slice(ck_restrictions)) {
auto& col = s->column_at(column_kind::clustering_key, _restrictions->clustering_columns_restrictions_size());
for (auto&& op : _column_operations) {
if (!op->column.is_static()) {
throw exceptions::invalid_request_exception(format("Primary key column '{}' must be specified in order to modify column '{}'",
col.name_as_text(), op->column.name_as_text()));
_restrictions->unrestricted_column(column_kind::clustering_key).name_as_text(), op->column.name_as_text()));
}
}
}
}
if (_restrictions->has_partition_key_unrestricted_components()) {
auto& col = s->column_at(column_kind::partition_key, _restrictions->partition_key_restrictions_size());
throw exceptions::invalid_request_exception(format("Missing mandatory PRIMARY KEY part {}", col.name_as_text()));
throw exceptions::invalid_request_exception(format("Missing mandatory PRIMARY KEY part {}",
_restrictions->unrestricted_column(column_kind::partition_key).name_as_text()));
}
if (has_conditions()) {
validate_where_clause_for_conditions();

View File

@@ -655,68 +655,58 @@ indexed_table_select_statement::do_execute_base_query(
auto cmd = prepare_command_for_base_query(qp, options, state, now, bool(paging_state));
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
struct base_query_state {
query::result_merger merger;
std::vector<primary_key> primary_keys;
std::vector<primary_key>::iterator current_primary_key;
size_t previous_result_size = 0;
size_t next_iteration_size = 0;
base_query_state(uint64_t row_limit, std::vector<primary_key>&& keys)
: merger(row_limit, query::max_partitions)
, primary_keys(std::move(keys))
, current_primary_key(primary_keys.begin())
{}
base_query_state(base_query_state&&) = default;
base_query_state(const base_query_state&) = delete;
};
query::result_merger merger(cmd->get_row_limit(), query::max_partitions);
std::vector<primary_key> keys = std::move(primary_keys);
std::vector<primary_key>::iterator key_it(keys.begin());
size_t previous_result_size = 0;
size_t next_iteration_size = 0;
base_query_state query_state{cmd->get_row_limit(), std::move(primary_keys)};
const bool is_paged = bool(paging_state);
return do_with(std::move(query_state), [this, is_paged, &qp, &state, &options, cmd, timeout] (auto&& query_state) {
auto &merger = query_state.merger;
auto &keys = query_state.primary_keys;
auto &key_it = query_state.current_primary_key;
auto &previous_result_size = query_state.previous_result_size;
auto &next_iteration_size = query_state.next_iteration_size;
return utils::result_repeat([this, is_paged, &previous_result_size, &next_iteration_size, &keys, &key_it, &merger, &qp, &state, &options, cmd, timeout]() {
// Starting with 1 key, we check if the result was a short read, and if not,
// we continue exponentially, asking for 2x more key than before
auto already_done = std::distance(keys.begin(), key_it);
// If the previous result already provided 1MB worth of data,
// stop increasing the number of fetched partitions
if (previous_result_size < query::result_memory_limiter::maximum_result_size) {
next_iteration_size = already_done + 1;
}
next_iteration_size = std::min<size_t>({next_iteration_size, keys.size() - already_done, max_base_table_query_concurrency});
auto key_it_end = key_it + next_iteration_size;
auto command = ::make_lw_shared<query::read_command>(*cmd);
while (key_it != keys.end()) {
// Starting with 1 key, we check if the result was a short read, and if not,
// we continue exponentially, asking for 2x more key than before
auto already_done = std::distance(keys.begin(), key_it);
// If the previous result already provided 1MB worth of data,
// stop increasing the number of fetched partitions
if (previous_result_size < query::result_memory_limiter::maximum_result_size) {
next_iteration_size = already_done + 1;
}
next_iteration_size = std::min<size_t>({next_iteration_size, keys.size() - already_done, max_base_table_query_concurrency});
auto key_it_end = key_it + next_iteration_size;
auto command = ::make_lw_shared<query::read_command>(*cmd);
query::result_merger oneshot_merger(cmd->get_row_limit(), query::max_partitions);
return utils::result_map_reduce(key_it, key_it_end, [this, &qp, &state, &options, cmd, timeout] (auto& key) {
auto command = ::make_lw_shared<query::read_command>(*cmd);
// for each partition, read just one clustering row (TODO: can
// get all needed rows of one partition at once.)
command->slice._row_ranges.clear();
if (key.clustering) {
command->slice._row_ranges.push_back(query::clustering_range::make_singular(key.clustering));
}
return qp.proxy().query_result(_schema, command, {dht::partition_range::make_singular(key.partition)}, options.get_consistency(), {timeout, state.get_permit(), state.get_client_state(), state.get_trace_state()})
.then(utils::result_wrap([] (service::storage_proxy::coordinator_query_result qr) -> coordinator_result<foreign_ptr<lw_shared_ptr<query::result>>> {
return std::move(qr.query_result);
}));
}, std::move(oneshot_merger)).then(utils::result_wrap([is_paged, &previous_result_size, &key_it, key_it_end = std::move(key_it_end), &keys, &merger] (foreign_ptr<lw_shared_ptr<query::result>> result) -> coordinator_result<stop_iteration> {
auto is_short_read = result->is_short_read();
// Results larger than 1MB should be shipped to the client immediately
const bool page_limit_reached = is_paged && result->buf().size() >= query::result_memory_limiter::maximum_result_size;
previous_result_size = result->buf().size();
merger(std::move(result));
key_it = key_it_end;
return stop_iteration(is_short_read || key_it == keys.end() || page_limit_reached);
}));
}).then(utils::result_wrap([&merger, cmd] () mutable {
return make_ready_future<coordinator_result<value_type>>(value_type(merger.get(), std::move(cmd)));
}));
});
query::result_merger oneshot_merger(cmd->get_row_limit(), query::max_partitions);
coordinator_result<foreign_ptr<lw_shared_ptr<query::result>>> rresult = co_await utils::result_map_reduce(key_it, key_it_end, coroutine::lambda([&] (auto& key)
-> future<coordinator_result<foreign_ptr<lw_shared_ptr<query::result>>>> {
auto command = ::make_lw_shared<query::read_command>(*cmd);
// for each partition, read just one clustering row (TODO: can
// get all needed rows of one partition at once.)
command->slice._row_ranges.clear();
if (key.clustering) {
command->slice._row_ranges.push_back(query::clustering_range::make_singular(key.clustering));
}
coordinator_result<service::storage_proxy::coordinator_query_result> rqr
= co_await qp.proxy().query_result(_schema, command, {dht::partition_range::make_singular(key.partition)}, options.get_consistency(), {timeout, state.get_permit(), state.get_client_state(), state.get_trace_state()});
if (!rqr.has_value()) {
co_return std::move(rqr).as_failure();
}
co_return std::move(rqr.value().query_result);
}), std::move(oneshot_merger));
if (!rresult.has_value()) {
co_return std::move(rresult).as_failure();
}
auto& result = rresult.value();
auto is_short_read = result->is_short_read();
// Results larger than 1MB should be shipped to the client immediately
const bool page_limit_reached = is_paged && result->buf().size() >= query::result_memory_limiter::maximum_result_size;
previous_result_size = result->buf().size();
merger(std::move(result));
key_it = key_it_end;
if (is_short_read || page_limit_reached) {
break;
}
}
co_return value_type(merger.get(), std::move(cmd));
}
future<shared_ptr<cql_transport::messages::result_message>>

View File

@@ -824,7 +824,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, view_building(this, "view_building", value_status::Used, true, "Enable view building; should only be set to false when the node is experience issues due to view building")
, enable_sstables_mc_format(this, "enable_sstables_mc_format", value_status::Unused, true, "Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, enable_sstables_md_format(this, "enable_sstables_md_format", value_status::Unused, true, "Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, sstable_format(this, "sstable_format", value_status::Used, "me", "Default sstable file format", {"mc", "md", "me"})
, sstable_format(this, "sstable_format", value_status::Used, "me", "Default sstable file format", {"md", "me"})
, enable_dangerous_direct_import_of_cassandra_counters(this, "enable_dangerous_direct_import_of_cassandra_counters", value_status::Used, false, "Only turn this option on if you want to import tables from Cassandra containing counters, and you are SURE that no counters in that table were created in a version earlier than Cassandra 2.1."
" It is not enough to have ever since upgraded to newer versions of Cassandra. If you EVER used a version earlier than 2.1 in the cluster where these SSTables come from, DO NOT TURN ON THIS OPTION! You will corrupt your data. You have been warned.")
, enable_shard_aware_drivers(this, "enable_shard_aware_drivers", value_status::Used, true, "Enable native transport drivers to use connection-per-shard for better performance")
@@ -907,7 +907,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, force_schema_commit_log(this, "force_schema_commit_log", value_status::Used, false,
"Use separate schema commit log unconditionally rater than after restart following discovery of cluster-wide support for it.")
, task_ttl_seconds(this, "task_ttl_in_seconds", liveness::LiveUpdate, value_status::Used, 10, "Time for which information about finished task stays in memory.")
, cache_index_pages(this, "cache_index_pages", liveness::LiveUpdate, value_status::Used, true,
, cache_index_pages(this, "cache_index_pages", liveness::LiveUpdate, value_status::Used, false,
"Keep SSTable index pages in the global cache after a SSTable read. Expected to improve performance for workloads with big partitions, but may degrade performance for workloads with small partitions.")
, default_log_level(this, "default_log_level", value_status::Used)
, logger_log_level(this, "logger_log_level", value_status::Used)
@@ -1065,7 +1065,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"udf", feature::UDF},
{"cdc", feature::UNUSED},
{"alternator-streams", feature::ALTERNATOR_STREAMS},
{"alternator-ttl", feature::ALTERNATOR_TTL},
{"alternator-ttl", feature::UNUSED },
{"raft", feature::RAFT},
{"broadcast-tables", feature::BROADCAST_TABLES},
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},

View File

@@ -84,7 +84,7 @@ struct experimental_features_t {
// NOTE: RAFT and BROADCAST_TABLES features are not enabled via `experimental` umbrella flag.
// These options should be enabled explicitly.
// RAFT feature has to be enabled if BROADCAST_TABLES is enabled.
enum class feature { UNUSED, UDF, ALTERNATOR_STREAMS, ALTERNATOR_TTL, RAFT,
enum class feature { UNUSED, UDF, ALTERNATOR_STREAMS, RAFT,
BROADCAST_TABLES, KEYSPACE_STORAGE_OPTIONS };
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();

View File

@@ -33,7 +33,7 @@ bool host_filter::can_hint_for(const locator::topology& topo, gms::inet_address
case enabled_kind::enabled_for_all:
return true;
case enabled_kind::enabled_selectively:
return _dcs.contains(topo.get_datacenter(ep));
return topo.has_endpoint(ep, locator::topology::pending::yes) && _dcs.contains(topo.get_datacenter(ep));
case enabled_kind::disabled_for_all:
return false;
}

View File

@@ -96,7 +96,7 @@ void manager::register_metrics(const sstring& group_name) {
future<> manager::start(shared_ptr<service::storage_proxy> proxy_ptr, shared_ptr<gms::gossiper> gossiper_ptr) {
_proxy_anchor = std::move(proxy_ptr);
_gossiper_anchor = std::move(gossiper_ptr);
return lister::scan_dir(_hints_dir, { directory_entry_type::directory }, [this] (fs::path datadir, directory_entry de) {
return lister::scan_dir(_hints_dir, lister::dir_entry_types::of<directory_entry_type::directory>(), [this] (fs::path datadir, directory_entry de) {
ep_key_type ep = ep_key_type(de.name);
if (!check_dc_for(ep)) {
return make_ready_future<>();
@@ -558,7 +558,7 @@ bool manager::end_point_hints_manager::sender::can_send() noexcept {
return true;
} else {
if (!_state.contains(state::ep_state_left_the_ring)) {
_state.set_if<state::ep_state_left_the_ring>(!_shard_manager.local_db().get_token_metadata().is_member(end_point_key()));
_state.set_if<state::ep_state_left_the_ring>(!_shard_manager.local_db().get_token_metadata().is_normal_token_owner(end_point_key()));
}
// send the hints out if the destination Node is part of the ring - we will send to all new replicas in this case
return _state.contains(state::ep_state_left_the_ring);
@@ -656,7 +656,7 @@ future<> manager::change_host_filter(host_filter filter) {
// Iterate over existing hint directories and see if we can enable an endpoint manager
// for some of them
return lister::scan_dir(_hints_dir, { directory_entry_type::directory }, [this] (fs::path datadir, directory_entry de) {
return lister::scan_dir(_hints_dir, lister::dir_entry_types::of<directory_entry_type::directory>(), [this] (fs::path datadir, directory_entry de) {
const ep_key_type ep = ep_key_type(de.name);
if (_ep_managers.contains(ep) || !_host_filter.can_hint_for(_proxy_anchor->get_token_metadata_ptr()->get_topology(), ep)) {
return make_ready_future<>();
@@ -1168,7 +1168,7 @@ void manager::end_point_hints_manager::sender::send_hints_maybe() noexcept {
}
static future<> scan_for_hints_dirs(const sstring& hints_directory, std::function<future<> (fs::path dir, directory_entry de, unsigned shard_id)> f) {
return lister::scan_dir(hints_directory, { directory_entry_type::directory }, [f = std::move(f)] (fs::path dir, directory_entry de) mutable {
return lister::scan_dir(hints_directory, lister::dir_entry_types::of<directory_entry_type::directory>(), [f = std::move(f)] (fs::path dir, directory_entry de) mutable {
unsigned shard_id;
try {
shard_id = std::stoi(de.name.c_str());
@@ -1188,10 +1188,10 @@ manager::hints_segments_map manager::get_current_hints_segments(const sstring& h
scan_for_hints_dirs(hints_directory, [&current_hints_segments] (fs::path dir, directory_entry de, unsigned shard_id) {
manager_logger.trace("shard_id = {}", shard_id);
// IPs level
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::directory }, [&current_hints_segments, shard_id] (fs::path dir, directory_entry de) {
return lister::scan_dir(dir / de.name.c_str(), lister::dir_entry_types::of<directory_entry_type::directory>(), [&current_hints_segments, shard_id] (fs::path dir, directory_entry de) {
manager_logger.trace("\tIP: {}", de.name);
// hints files
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::regular }, [&current_hints_segments, shard_id, ep_addr = de.name] (fs::path dir, directory_entry de) {
return lister::scan_dir(dir / de.name.c_str(), lister::dir_entry_types::of<directory_entry_type::regular>(), [&current_hints_segments, shard_id, ep_addr = de.name] (fs::path dir, directory_entry de) {
manager_logger.trace("\t\tfile: {}", de.name);
current_hints_segments[ep_addr][shard_id].emplace_back(dir / de.name.c_str());
return make_ready_future<>();
@@ -1305,7 +1305,7 @@ void manager::remove_irrelevant_shards_directories(const sstring& hints_director
scan_for_hints_dirs(hints_directory, [] (fs::path dir, directory_entry de, unsigned shard_id) {
if (shard_id >= smp::count) {
// IPs level
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::directory, directory_entry_type::regular }, lister::show_hidden::yes, [] (fs::path dir, directory_entry de) {
return lister::scan_dir(dir / de.name.c_str(), lister::dir_entry_types::full(), lister::show_hidden::yes, [] (fs::path dir, directory_entry de) {
return io_check(remove_file, (dir / de.name.c_str()).native());
}).then([shard_base_dir = dir, shard_entry = de] {
return io_check(remove_file, (shard_base_dir / shard_entry.name.c_str()).native());

View File

@@ -99,7 +99,7 @@ future<> space_watchdog::scan_one_ep_dir(fs::path path, manager& shard_manager,
if (!exists) {
return make_ready_future<>();
} else {
return lister::scan_dir(path, { directory_entry_type::regular }, [this, ep_key, &shard_manager] (fs::path dir, directory_entry de) {
return lister::scan_dir(path, lister::dir_entry_types::of<directory_entry_type::regular>(), [this, ep_key, &shard_manager] (fs::path dir, directory_entry de) {
// Put the current end point ID to state.eps_with_pending_hints when we see the second hints file in its directory
if (_files_count == 1) {
shard_manager.add_ep_with_pending_hints(ep_key);
@@ -138,7 +138,7 @@ void space_watchdog::on_timer() {
_total_size = 0;
for (manager& shard_manager : per_device_limits.managers) {
shard_manager.clear_eps_with_pending_hints();
lister::scan_dir(shard_manager.hints_dir(), {directory_entry_type::directory}, [this, &shard_manager] (fs::path dir, directory_entry de) {
lister::scan_dir(shard_manager.hints_dir(), lister::dir_entry_types::of<directory_entry_type::directory>(), [this, &shard_manager] (fs::path dir, directory_entry de) {
_files_count = 0;
// Let's scan per-end-point directories and enumerate hints files...
//

View File

@@ -355,6 +355,7 @@ schema_ptr system_keyspace::built_indexes() {
}
/*static*/ schema_ptr system_keyspace::peers() {
constexpr uint16_t schema_version_offset = 1; // raft_server_id
static thread_local auto peers = [] {
schema_builder builder(generate_legacy_id(NAME, PEERS), NAME, PEERS,
// partition key
@@ -372,6 +373,7 @@ schema_ptr system_keyspace::built_indexes() {
{"schema_version", uuid_type},
{"tokens", set_type_impl::get_instance(utf8_type, true)},
{"supported_features", utf8_type},
{"raft_server_id", uuid_type},
},
// static columns
{},
@@ -381,7 +383,7 @@ schema_ptr system_keyspace::built_indexes() {
"information about known peers in the cluster"
);
builder.set_gc_grace_seconds(0);
builder.with_version(generate_schema_version(builder.uuid()));
builder.with_version(generate_schema_version(builder.uuid(), schema_version_offset));
return builder.build(schema_builder::compact_storage::no);
}();
return peers;
@@ -1502,6 +1504,7 @@ future<> system_keyspace::update_tokens(gms::inet_address ep, const std::unorder
}
sstring req = format("INSERT INTO system.{} (peer, tokens) VALUES (?, ?)", PEERS);
slogger.debug("INSERT INTO system.{} (peer, tokens) VALUES ({}, {})", PEERS, ep, tokens);
auto set_type = set_type_impl::get_instance(utf8_type, true);
co_await execute_cql(req, ep.addr(), make_set_value(set_type, prepare_tokens(tokens))).discard_result();
co_await force_blocking_flush(PEERS);
@@ -1541,11 +1544,18 @@ future<std::unordered_map<gms::inet_address, locator::host_id>> system_keyspace:
}
future<std::vector<gms::inet_address>> system_keyspace::load_peers() {
auto res = co_await execute_cql(format("SELECT peer FROM system.{}", PEERS));
auto res = co_await execute_cql(format("SELECT peer, tokens FROM system.{}", PEERS));
assert(res);
std::vector<gms::inet_address> ret;
for (auto& row: *res) {
if (!row.has("tokens")) {
// Ignore rows that don't have tokens. Such rows may
// be introduced by code that persists parts of peer
// information (such as RAFT_ID) which may potentially
// race with deleting a peer (during node removal).
continue;
}
ret.emplace_back(row.get_as<net::inet_address>("peer"));
}
co_return ret;
@@ -1594,6 +1604,7 @@ future<> system_keyspace::update_peer_info(gms::inet_address ep, sstring column_
co_await update_cached_values(ep, column_name, value);
sstring req = format("INSERT INTO system.{} (peer, {}) VALUES (?, ?)", PEERS, column_name);
slogger.debug("INSERT INTO system.{} (peer, {}) VALUES ({}, {})", PEERS, column_name, ep, value);
co_await execute_cql(req, ep.addr(), value).discard_result();
}
// sets are not needed, since tokens are updated by another method
@@ -1645,6 +1656,7 @@ future<> system_keyspace::update_schema_version(table_schema_version version) {
*/
future<> system_keyspace::remove_endpoint(gms::inet_address ep) {
sstring req = format("DELETE FROM system.{} WHERE peer = ?", PEERS);
slogger.debug("DELETE FROM system.{} WHERE peer = {}", PEERS, ep);
co_await execute_cql(req, ep.addr()).discard_result();
co_await force_blocking_flush(PEERS);
}
@@ -1869,7 +1881,7 @@ public:
set_cell(cr, "host_id", hostid->uuid());
}
if (tm.is_member(endpoint)) {
if (tm.is_normal_token_owner(endpoint)) {
sstring dc = tm.get_topology().get_location(endpoint).dc;
set_cell(cr, "dc", dc);
}
@@ -2467,23 +2479,25 @@ class db_config_table final : public streaming_virtual_table {
return make_exception_future<>(virtual_table_update_exception("option source is not updateable"));
}
return smp::submit_to(0, [&cfg = _cfg, name = std::move(*name), value = std::move(*value)] () mutable {
return smp::submit_to(0, [&cfg = _cfg, name = std::move(*name), value = std::move(*value)] () mutable -> future<> {
for (auto& c_ref : cfg.values()) {
auto& c = c_ref.get();
if (c.name() == name) {
std::exception_ptr ex;
try {
if (c.set_value(value, utils::config_file::config_source::CQL)) {
return cfg.broadcast_to_all_shards();
if (co_await c.set_value_on_all_shards(value, utils::config_file::config_source::CQL)) {
co_return;
} else {
return make_exception_future<>(virtual_table_update_exception("option is not live-updateable"));
ex = std::make_exception_ptr(virtual_table_update_exception("option is not live-updateable"));
}
} catch (boost::bad_lexical_cast&) {
return make_exception_future<>(virtual_table_update_exception("cannot parse option value"));
ex = std::make_exception_ptr(virtual_table_update_exception("cannot parse option value"));
}
co_await coroutine::return_exception_ptr(std::move(ex));
}
}
return make_exception_future<>(virtual_table_update_exception("no such option"));
co_await coroutine::return_exception(virtual_table_update_exception("no such option"));
});
}
@@ -2881,7 +2895,7 @@ future<> system_keyspace::get_repair_history(::table_id table_id, repair_history
sstring req = format("SELECT * from system.{} WHERE table_uuid = {}", REPAIR_HISTORY, table_id);
co_await _qp.local().query_internal(req, [&f] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
repair_history_entry ent;
ent.id = row.get_as<tasks::task_id>("repair_uuid");
ent.id = tasks::task_id(row.get_as<utils::UUID>("repair_uuid"));
ent.table_uuid = ::table_id(row.get_as<utils::UUID>("table_uuid"));
ent.range_start = row.get_as<int64_t>("range_start");
ent.range_end = row.get_as<int64_t>("range_end");

View File

@@ -128,6 +128,9 @@ const column_definition* view_info::view_column(const column_definition& base_de
void view_info::set_base_info(db::view::base_info_ptr base_info) {
_base_info = std::move(base_info);
// Forget the cached objects which may refer to the base schema.
_select_statement = nullptr;
_partition_slice = std::nullopt;
}
// A constructor for a base info that can facilitate reads and writes from the materialized view.
@@ -1391,9 +1394,9 @@ static std::optional<gms::inet_address>
get_view_natural_endpoint(const sstring& keyspace_name,
const dht::token& base_token, const dht::token& view_token) {
auto &db = service::get_local_storage_proxy().local_db();
auto& topology = service::get_local_storage_proxy().get_token_metadata_ptr()->get_topology();
auto& ks = db.find_keyspace(keyspace_name);
auto erm = ks.get_effective_replication_map();
auto& topology = erm->get_token_metadata_ptr()->get_topology();
auto my_address = utils::fb_utilities::get_broadcast_address();
auto my_datacenter = topology.get_datacenter();
bool network_topology = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy());

View File

@@ -15,6 +15,7 @@
#include "sstables/sstables.hh"
#include "sstables/progress_monitor.hh"
#include "readers/evictable.hh"
#include "dht/partition_filter.hh"
static logging::logger vug_logger("view_update_generator");
@@ -158,7 +159,8 @@ future<> view_update_generator::start() {
::mutation_reader::forwarding::no);
inject_failure("view_update_generator_consume_staging_sstable");
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(s, std::move(permit), *t, sstables, _as, staging_sstable_reader_handle));
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(s, std::move(permit), *t, sstables, _as, staging_sstable_reader_handle),
dht::incremental_owned_ranges_checker::make_partition_filter(_db.get_keyspace_local_ranges(s->ks_name())));
staging_sstable_reader.close().get();
if (result == stop_iteration::yes) {
break;

View File

@@ -9,7 +9,9 @@
#include "i_partitioner.hh"
#include "sharder.hh"
#include <seastar/core/seastar.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "dht/token-sharding.hh"
#include "dht/partition_filter.hh"
#include "utils/class_registrator.hh"
#include "types.hh"
#include "utils/murmur_hash.hh"
@@ -362,4 +364,79 @@ split_range_to_shards(dht::partition_range pr, const schema& s) {
return ret;
}
flat_mutation_reader_v2::filter incremental_owned_ranges_checker::make_partition_filter(const dht::token_range_vector& sorted_owned_ranges) {
return [checker = incremental_owned_ranges_checker(sorted_owned_ranges)] (const dht::decorated_key& dk) mutable {
return checker.belongs_to_current_node(dk.token());
};
}
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& source_ranges, dht::partition_range_vector ranges_to_subtract) {
auto cmp = dht::ring_position_comparator(schema);
// optimize set of potentially overlapping ranges by deoverlapping them.
auto ranges = dht::partition_range::deoverlap(source_ranges, cmp);
dht::partition_range_vector res;
res.reserve(ranges.size() * 2);
auto range = ranges.begin();
auto range_end = ranges.end();
auto range_to_subtract = ranges_to_subtract.begin();
auto range_to_subtract_end = ranges_to_subtract.end();
while (range != range_end) {
if (range_to_subtract == range_to_subtract_end) {
// We're done with range_to_subtracts
res.emplace_back(std::move(*range));
++range;
continue;
}
auto diff = range->subtract(*range_to_subtract, cmp);
auto size = diff.size();
switch (size) {
case 0:
// current range is fully covered by range_to_subtract, done with it
// range_to_subtrace.start <= range.start &&
// range_to_subtrace.end >= range.end
++range;
break;
case 1:
// Possible cases:
// a. range and range_to_subtract are disjoint (so diff == range)
// a.i range_to_subtract.end < range.start
// a.ii range_to_subtract.start > range.end
// b. range_to_subtrace.start > range.start, so it removes the range suffix
// c. range_to_subtrace.start < range.start, so it removes the range prefix
// Does range_to_subtract sort after range?
if (range_to_subtract->start() && (!range->start() || cmp(range_to_subtract->start()->value(), range->start()->value()) > 0)) {
// save range prefix in the result
// (note that diff[0] == range in the disjoint case)
res.emplace_back(std::move(diff[0]));
// done with current range
++range;
} else {
// set the current range to the remaining suffix
*range = std::move(diff[0]);
// done with current range_to_subtract
++range_to_subtract;
}
break;
case 2:
// range contains range_to_subtract
// save range prefix in the result
res.emplace_back(std::move(diff[0]));
// set the current range to the remaining suffix
*range = std::move(diff[1]);
// done with current range_to_subtract
++range_to_subtract;
break;
default:
assert(size <= 2);
}
co_await coroutine::maybe_yield();
}
co_return res;
}
}

View File

@@ -648,6 +648,11 @@ future<utils::chunked_vector<partition_range>> split_range_to_single_shard(const
std::unique_ptr<dht::i_partitioner> make_partitioner(sstring name);
// Returns a sorted and deoverlapped list of ranges that are
// the result of subtracting all ranges from ranges_to_subtract.
// ranges_to_subtract must be sorted and deoverlapped.
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& ranges, dht::partition_range_vector ranges_to_subtract);
} // dht
namespace std {

41
dht/partition_filter.hh Normal file
View File

@@ -0,0 +1,41 @@
/*
* Modified by ScyllaDB
* Copyright (C) 2015-present ScyllaDB
*/
/*
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#pragma once
#include "dht/i_partitioner.hh"
#include "readers/flat_mutation_reader_v2.hh"
namespace dht {
class incremental_owned_ranges_checker {
const dht::token_range_vector& _sorted_owned_ranges;
mutable dht::token_range_vector::const_iterator _it;
public:
incremental_owned_ranges_checker(const dht::token_range_vector& sorted_owned_ranges)
: _sorted_owned_ranges(sorted_owned_ranges)
, _it(_sorted_owned_ranges.begin()) {
}
// Must be called with increasing token values.
bool belongs_to_current_node(const dht::token& t) {
// While token T is after a range Rn, advance the iterator.
// iterator will be stopped at a range which either overlaps with T (if T belongs to node),
// or at a range which is after T (if T doesn't belong to this node).
while (_it != _sorted_owned_ranges.end() && _it->after(t, dht::token_comparator())) {
_it++;
}
return _it != _sorted_owned_ranges.end() && _it->contains(t, dht::token_comparator());
}
static flat_mutation_reader_v2::filter make_partition_filter(const dht::token_range_vector& sorted_owned_ranges);
};
} // dht

View File

@@ -7,6 +7,8 @@
*/
#pragma once
#include "utils/UUID.hh"
#include <seastar/core/sharded.hh>
using namespace seastar;
@@ -21,7 +23,7 @@ class pinger {
public:
// Opaque endpoint ID.
// A specific implementation of `pinger` maps those IDs to 'real' addresses.
using endpoint_id = unsigned;
using endpoint_id = utils::UUID;
// Send a message to `ep` and wait until it responds.
// The wait can be aborted using `as`.

View File

@@ -1,21 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2018-present ScyllaDB
#
#
# SPDX-License-Identifier: AGPL-3.0-or-later
import os
import sys
import argparse
# keep this script just for compatibility.
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Optimize boot parameter settings for Scylla.')
parser.add_argument('--ami', action='store_true', default=False,
help='setup AMI instance')
args = parser.parse_args()
sys.exit(0)

View File

@@ -139,13 +139,8 @@ if __name__ == '__main__':
print('Requires root permission.')
sys.exit(1)
cfg = sysconfig_parser(sysconfdir_p() / 'scylla-server')
ami = cfg.get('AMI')
mode = cfg.get('NETWORK_MODE')
if ami == 'yes' and os.path.exists('/etc/scylla/ami_disabled'):
os.remove('/etc/scylla/ami_disabled')
sys.exit(1)
if mode == 'virtio':
tap = cfg.get('TAP')
user = cfg.get('USER')

View File

@@ -214,7 +214,7 @@ if __name__ == '__main__':
help='skip raid setup')
parser.add_argument('--raid-level-5', action='store_true', default=False,
help='use RAID5 for RAID volume')
parser.add_argument('--online-discard', default=True,
parser.add_argument('--online-discard', default=1, choices=[0, 1], type=int,
help='Configure XFS to discard unused blocks as soon as files are deleted')
parser.add_argument('--nic',
help='specify NIC')
@@ -224,8 +224,6 @@ if __name__ == '__main__':
help='specify swapfile directory (ex: /)')
parser.add_argument('--swap-size', type=int,
help='specify swapfile size in GB')
parser.add_argument('--ami', action='store_true', default=False,
help='setup AMI instance')
parser.add_argument('--setup-nic-and-disks', action='store_true', default=False,
help='optimize NIC and disks')
parser.add_argument('--developer-mode', action='store_true', default=False,
@@ -242,8 +240,6 @@ if __name__ == '__main__':
if is_redhat_variant():
parser.add_argument('--no-selinux-setup', action='store_true', default=False,
help='skip selinux setup')
parser.add_argument('--no-bootparam-setup', action='store_true', default=False,
help='skip bootparam setup')
parser.add_argument('--no-ntp-setup', action='store_true',
default=default_no_ntp_setup,
help='skip ntp setup')
@@ -458,7 +454,7 @@ if __name__ == '__main__':
args.no_raid_setup = not raid_setup
if raid_setup:
level = '5' if raid_level_5 else '0'
run_setup_script('RAID', f'scylla_raid_setup --disks {disks} --enable-on-nextboot --raid-level={level} --online-discard={int(online_discard)}')
run_setup_script('RAID', f'scylla_raid_setup --disks {disks} --enable-on-nextboot --raid-level={level} --online-discard={online_discard}')
coredump_setup = interactive_ask_service('Do you want to enable coredumps?', 'Yes - sets up coredump to allow a post-mortem analysis of the Scylla state just prior to a crash. No - skips this step.', coredump_setup)
args.no_coredump_setup = not coredump_setup

View File

@@ -35,7 +35,6 @@ if __name__ == '__main__':
disable_writeback_cache = str2bool(cfg.get('DISABLE_WRITEBACK_CACHE'))
else:
disable_writeback_cache = 'no'
ami = str2bool(cfg.get('AMI'))
parser = argparse.ArgumentParser(description='Setting parameters on Scylla sysconfig file.')
parser.add_argument('--nic',
@@ -58,8 +57,6 @@ if __name__ == '__main__':
help='Set enforcing fastest available Linux clocksource')
parser.add_argument('--disable-writeback-cache', action='store_true', default=disable_writeback_cache,
help='Disable disk writeback cache')
parser.add_argument('--ami', action='store_true', default=ami,
help='AMI instance mode')
args = parser.parse_args()
if args.nic and not is_valid_nic(args.nic):
@@ -125,6 +122,4 @@ if __name__ == '__main__':
if cfg.has_option('DISABLE_WRITEBACK_CACHE') and str2bool(cfg.get('DISABLE_WRITEBACK_CACHE')) != args.disable_writeback_cache:
cfg.set('DISABLE_WRITEBACK_CACHE', bool2str(args.disable_writeback_cache))
if str2bool(cfg.get('AMI')) != args.ami:
cfg.set('AMI', bool2str(args.ami))
cfg.commit()

View File

@@ -43,8 +43,5 @@ SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --netw
## scylla arguments (for dpdk mode)
#SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack native --dpdk-pmd"
# setup as AMI instance
AMI=no
# Disable disk writeback cache
DISABLE_WRITEBACK_CACHE=no

View File

@@ -144,6 +144,25 @@ This monitoring stack is different from DynamoDB's offering - but Scylla's
is significantly more powerful and gives the user better insights on
the internals of the database and its performance.
## Time To Live (TTL)
Like in DynamoDB, Alternator items which are set to expire at a certain
time will not disappear exactly at that time, but only after some delay.
DynamoDB guarantees that the expiration delay will be less than 48 hours
(though for small tables the delay is often much shorter).
In Alternator, the expiration delay is configurable - it can be set
with the `--alternator-ttl-period-in-seconds` configuration option.
The default is 24 hours.
One thing the implementation is missing is that expiration
events appear in the Streams API as normal deletions - without the
distinctive marker on deletions which are really expirations.
See <https://github.com/scylladb/scylla/issues/5060>.
---
## Experimental API features
Some DynamoDB API features are supported by Alternator, but considered
@@ -154,28 +173,11 @@ feature's implementation is still subject to change and upgrades may not be
possible if such a feature is used. For these reasons, experimental features
are not recommended for mission-critical uses, and they need to be
individually enabled with the "--experimental-features" configuration option.
See [Enabling Experimental Features](/operating-scylla/admin#enabling-experimental-features) for details.
In this release, the following DynamoDB API features are considered
experimental:
* DynamoDB's TTL (item expiration) feature is supported, but in this release
still considered experimental and needs to be enabled explicitly with the
`--experimental-features=alternator-ttl` configuration option.
The experimental implementation is mostly complete, but not throughly
tested or optimized.
Like in DynamoDB, Alternator items which are set to expire at a certain
time will not disappear exactly at that time, but only after some delay.
DynamoDB guarantees that the expiration delay will be less than 48 hours
(though for small tables the delay is often much shorter). In Alternator,
the expiration delay is configurable - it defaults to 24 hours but can
be set with the `--alternator-ttl-period-in-seconds` configuration option.
One thing that this implementation is still missing is that expiration
events appear in the Streams API as normal deletions - without the
distinctive marker on deletions which are really expirations.
<https://github.com/scylladb/scylla/issues/5060>
* The DynamoDB Streams API for capturing change is supported, but still
considered experimental so needs to be enabled explicitly with the
`--experimental-features=alternator-streams` configuration option.

View File

@@ -4,95 +4,231 @@ Raft Consensus Algorithm in ScyllaDB
Introduction
--------------
ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB 5.0 is turning to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
ScyllaDB was originally designed, following Apache Cassandra, to use gossip for topology and schema updates and the Paxos consensus algorithm for
strong data consistency (:doc:`LWT </using-scylla/lwt>`). To achieve stronger consistency without performance penalty, ScyllaDB 5.x has turned to Raft - a consensus algorithm designed as an alternative to both gossip and Paxos.
Raft is a consensus algorithm that implements a distributed, consistent, replicated log across members (nodes). Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
Raft uses a heartbeat mechanism to trigger a leader election. All servers start as followers and remain in the follower state as long as they receive valid RPCs (heartbeat) from a leader or candidate. A leader sends periodic heartbeats to all followers to maintain his authority (leadership). Suppose a follower receives no communication over a period called the election timeout. In that case, it assumes no viable leader and begins an election to choose a new leader.
Leader selection is described in detail in the `raft paper <https://raft.github.io/raft.pdf>`_.
Leader selection is described in detail in the `Raft paper <https://raft.github.io/raft.pdf>`_.
Scylla 5.0 uses Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
ScyllaDB 5.x may use Raft to maintain schema updates in every node (see below). Any schema update, like ALTER, CREATE or DROP TABLE, is first committed as an entry in the replicated Raft log, and, once stored on most replicas, applied to all nodes **in the same order**, even in the face of a node or network failures.
Following Scylla 5.x releases will use Raft to guarantee consistent topology updates similarly.
Following ScyllaDB 5.x releases will use Raft to guarantee consistent topology updates similarly.
.. _raft-quorum-requirement:
Quorum Requirement
-------------------
Raft requires at least a quorum of nodes in a cluster to be available. If multiple nodes fail
and the quorum is lost, the cluster is unavailable for schema updates. See :ref:`Handling Failures <raft-handliing-failures>`
Raft requires at least a quorum of nodes in a cluster to be available. If multiple nodes fail
and the quorum is lost, the cluster is unavailable for schema updates. See :ref:`Handling Failures <raft-handling-failures>`
for information on how to handle failures.
Upgrade Considerations for SyllaDB 5.0 and Later
==================================================
Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one
Note that when you have a two-DC cluster with the same number of nodes in each DC, the cluster will lose the quorum if one
of the DCs is down.
**We recommend configuring three DCs per cluster to ensure that the cluster remains available and operational when one DC is down.**
Enabling Raft
---------------
Enabling Raft in ScyllaDB 5.0
===============================
Enabling Raft in ScyllaDB 5.0 and 5.1
=====================================
.. note::
In ScyllaDB 5.0:
.. warning::
In ScyllaDB 5.0 and 5.1, Raft is an experimental feature.
* Raft is an experimental feature.
* Raft implementation only covers safe schema changes. See :ref:`Safe Schema Changes with Raft <raft-schema-changes>`.
It is not possible to enable Raft in an existing cluster in ScyllaDB 5.0 and 5.1.
In order to have a Raft-enabled cluster in these versions, you must create a new cluster with Raft enabled from the start.
If you are creating a new cluster, add ``raft`` to the list of experimental features in your ``scylla.yaml`` file:
.. warning::
.. code-block:: yaml
experimental_features:
- raft
**Do not** use Raft in production clusters in ScyllaDB 5.0 and 5.1. Such clusters won't be able to correctly upgrade to ScyllaDB 5.2.
If you upgrade to ScyllaDB 5.0 from an earlier version, perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>`
updating the ``scylla.yaml`` file for **each node** in the cluster to enable the experimental Raft feature:
.. code-block:: yaml
experimental_features:
- raft
When all the nodes in the cluster and updated and restarted, the cluster will begin to use Raft for schema changes.
Use Raft only for testing and experimentation in clusters which can be thrown away.
.. warning::
Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.
Verifying that Raft Is Enabled
When creating a new cluster, add ``raft`` to the list of experimental features in your ``scylla.yaml`` file:
.. code-block:: yaml
experimental_features:
- raft
.. _enabling-raft-existing-cluster:
Enabling Raft in ScyllaDB 5.2 and further
=========================================
.. TODO include enterprise versions in this documentation
.. note::
In ScyllaDB 5.2, Raft is Generally Available and can be safely used for consistent schema management.
In ScyllaDB 5.3 it will become enabled by default.
In further versions it will be mandatory.
ScyllaDB 5.2 and later comes equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **internal Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic>`).
.. warning::
Once enabled, Raft cannot be disabled on your cluster. The cluster nodes will fail to restart if you remove the Raft feature.
To enable Raft in an existing cluster in Scylla 5.2 and beyond:
* ensure that the schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and ensuring that the schema version is the same on all nodes,
* then perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>`, updating the ``scylla.yaml`` file for **each node** in the cluster before restarting it to enable the ``consistent_cluster_management`` flag:
.. code-block:: yaml
consistent_cluster_management: true
When all the nodes in the cluster and updated and restarted, the cluster will start the **internal Raft upgrade procedure**.
**You must then verify** that the internal Raft upgrade procedure has finished successfully. Refer to the :ref:`next section <verify-raft-procedure>`.
You can also enable the ``consistent_cluster_management`` flag while performing :doc:`rolling upgrade from 5.1 to 5.2 </upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/upgrade-guide-from-5.1-to-5.2-generic>`: update ``scylla.yaml`` before restarting each node. The internal Raft upgrade procedure will start as soon as the last node was upgraded and restarted. As above, this requires :ref:`verifying <verify-raft-procedure>` that this internal procedure successfully finishes.
Finally, you can enable the ``consistent_cluster_management`` flag when creating a new cluster. This does not use the internal Raft upgrade procedure; instead, Raft is functioning in the cluster and managing schema right from the start.
Until all nodes are restarted with ``consistent_cluster_management: true``, it is still possible to turn this option back off. Once enabled on every node, it must remain turned on (or the node will refuse to restart).
.. _verify-raft-procedure:
Verifying that the internal Raft upgrade procedure finished successfully
========================================================================
.. versionadded:: 5.2
The internal Raft upgrade procedure starts as soon as every node in the cluster restarts with ``consistent_cluster_management`` flag enabled in ``scylla.yaml``.
.. TODO: update the above sentence once 5.3 and later are released.
The procedure requires **full cluster availability** to correctly setup the Raft algorithm; after the setup finishes, Raft can proceed with only a majority of nodes, but this initial setup is an exception.
An unlucky event, such as a hardware failure, may cause one of your nodes to fail. If this happens before the internal Raft upgrade procedure finishes, the procedure will get stuck and your intervention will be required.
To verify that the procedure finishes, look at the log of every Scylla node (using ``journalctl _COMM=scylla``). Search for the following patterns:
* ``Starting internal upgrade-to-raft procedure`` denotes the start of the procedure,
* ``Raft upgrade finished`` denotes the end.
The following is an example of a log from a node which went through the procedure correctly. Some parts were truncated for brevity:
.. code-block:: console
features - Feature SUPPORTS_RAFT_CLUSTER_MANAGEMENT is enabled
raft_group0 - finish_setup_after_join: SUPPORTS_RAFT feature enabled. Starting internal upgrade-to-raft procedure.
raft_group0_upgrade - starting in `use_pre_raft_procedures` state.
raft_group0_upgrade - Waiting until everyone is ready to start upgrade...
raft_group0_upgrade - Joining group 0...
raft_group0 - server 624fa080-8c0e-4e3d-acf6-10af473639ca joined group 0 with group id 8f8a1870-5c4e-11ed-bb13-fe59693a23c9
raft_group0_upgrade - Waiting until every peer has joined Raft group 0...
raft_group0_upgrade - Every peer is a member of Raft group 0.
raft_group0_upgrade - Waiting for schema to synchronize across all nodes in group 0...
raft_group0_upgrade - synchronize_schema: my version: a37a3b1e-5251-3632-b6b4-a9468a279834
raft_group0_upgrade - synchronize_schema: schema mismatches: {}. 3 nodes had a matching version.
raft_group0_upgrade - synchronize_schema: finished.
raft_group0_upgrade - Entering synchronize state.
raft_group0_upgrade - Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.
raft_group0_upgrade - Waiting for all peers to enter synchronize state...
raft_group0_upgrade - All peers in synchronize state. Waiting for schema to synchronize...
raft_group0_upgrade - synchronize_schema: collecting schema versions from group 0 members...
raft_group0_upgrade - synchronize_schema: collected remote schema versions.
raft_group0_upgrade - synchronize_schema: my version: a37a3b1e-5251-3632-b6b4-a9468a279834
raft_group0_upgrade - synchronize_schema: schema mismatches: {}. 3 nodes had a matching version.
raft_group0_upgrade - synchronize_schema: finished.
raft_group0_upgrade - Schema synchronized.
raft_group0_upgrade - Raft upgrade finished.
In a functioning cluster with good network connectivity the procedure should take no more than a few seconds.
Network issues may cause the procedure to take longer, but if all nodes are alive and the network is eventually functional (each pair of nodes is eventually connected), the procedure will eventually finish.
Note the following message, which appears in the log presented above:
.. code-block:: console
Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.
During the procedure, there is a brief window while schema changes are disabled. This is when the schema change mechanism switches from the older unsafe algorithm to the safe Raft-based algorithm. If everything runs smoothly, this window will be unnoticeable; the procedure is designed to minimize that window's length. However, if the procedure gets stuck e.g. due to network connectivity problem, ScyllaDB will return the following error when trying to perform a schema change during this window:
.. code-block:: console
Cannot perform schema or topology changes during this time; the cluster is currently upgrading to use Raft for schema operations.
If this error keeps happening, check the logs of your nodes to learn the state of upgrade. The upgrade procedure may get stuck
if there was a node failure.
In the next example, one of the nodes had a power outage before the procedure could finish. The following shows a part of another node's logs:
.. code-block:: console
raft_group0_upgrade - Entering synchronize state.
raft_group0_upgrade - Schema changes are disabled in synchronize state. If a failure makes us unable to proceed, manual recovery will be required.
raft_group0_upgrade - Waiting for all peers to enter synchronize state...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.3 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...
...
raft_group0_upgrade - Raft upgrade procedure taking longer than expected. Please check if all nodes are live and the network is healthy. If the upgrade procedure does not progress even though the cluster is healthy, try performing a rolling restart of the cluster. If that doesn 't help or some nodes are dead and irrecoverable, manual recovery may be required. Consult the relevant documentation.
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: node 127.90.69.1 not in synchronize state yet...
raft_group0_upgrade - wait_for_peers_to_enter_synchronize_state: retrying in a while...
.. TODO: the 'Consult the relevant documentation' message must be updated to point to this doc.
Note the following message:
.. code-block:: console
raft_group0_upgrade - Raft upgrade procedure taking longer than expected. Please check if all nodes are live and the network is healthy. If the upgrade procedure does not progress even though the cluster is healthy, try performing a rolling restart of the cluster. If that doesn 't help or some nodes are dead and irrecoverable, manual recovery may be required. Consult the relevant documentation.
If the Raft upgrade procedure is stuck, this message will appear periodically in each node's logs.
The message suggests the initial course of action:
* Check if all nodes are alive.
* If a node is down but can be restarted, restart it.
* If all nodes are alive, ensure that the network is healthy: that every node is reachable from every other node.
* If all nodes are alive and the network is healthy, perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of the cluster.
One of the reasons why the procedure may get stuck is a pre-existing problem in schema definitions which causes schema to be unable to synchronize in the cluster. The procedure cannot proceed unless it ensures that schema is synchronized.
If **all nodes are alive and the network is healthy**, you performed a rolling restart, but the issue still persists, contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
If some nodes are **dead and irrecoverable**, you'll need to perform a manual recovery procedure. Consult :ref:`the section about Raft recovery <recover-raft-procedure>`.
Verifying that Raft is enabled
===============================
You can verify that Raft is enabled on your cluster in one of the following ways:
* Retrieve the list of supported features by running:
.. versionadded:: 5.2
.. code-block:: sql
You can verify that Raft is enabled on your cluster by performing the following query on each node:
cqlsh> SELECT supported_features FROM system.local;
With Raft enabled, the list of supported features in the output includes ``SUPPORTS_RAFT_CLUSTER_MANAGEMENT``. For example:
.. code-block:: sql
cqlsh> SELECT * FROM system.scylla_local WHERE key = 'group0_upgrade_state';
The query should return:
.. code-block:: console
:class: hide-copy-button
supported_features
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CDC,CDC_GENERATIONS_V2,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_TABLES_V3,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,UDA,UNBOUNDED_RANGE_TOMBSTONES,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
* Retrieve the list of experimental features by running:
key | value
----------------------+--------------------------
group0_upgrade_state | use_post_raft_procedures
.. code-block:: sql
(1 rows)
cqlsh> SELECT value FROM system.config WHERE name = 'experimental_features'
With Raft enabled, the list of experimental features in the output includes ``raft``.
on every node.
If the query returns 0 rows, or ``value`` is ``synchronize`` or ``use_pre_raft_procedures``, it means that the cluster is in the middle of the internal Raft upgrade procedure; consult the :ref:`relevant section <verify-raft-procedure>`.
If ``value`` is ``recovery``, it means that the cluster is in the middle of the manual recovery procedure. The procedure must be finished. Consult :ref:`the section about Raft recovery <recover-raft-procedure>`.
If ``value`` is anything else, it might mean data corruption or a mistake when performing the manual recovery procedure. The value will be treated as if it was equal to ``recovery`` when the node is restarted.
.. _raft-schema-changes:
@@ -100,23 +236,23 @@ Safe Schema Changes with Raft
-------------------------------
In ScyllaDB, schema is based on :doc:`Data Definition Language (DDL) </cql/ddl>`. In earlier ScyllaDB versions, schema changes were tracked via the gossip protocol, which might lead to schema conflicts if the updates are happening concurrently.
Implementing Raft eliminates schema conflicts and allows full automation of DDL changes under any conditions, as long as a quorum
Implementing Raft eliminates schema conflicts and allows full automation of DDL changes under any conditions, as long as a quorum
of nodes in the cluster is available. The following examples illustrate how Raft provides the solution to problems with schema changes.
* A network partition may lead to a split-brain case, where each subset of nodes has a different version of the schema.
With Raft, after a network split, the majority of the cluster can continue performing schema changes, while the minority needs to wait until it can rejoin the majority. Data manipulation statements on the minority can continue unaffected, provided the :ref:`quorum requirement <raft-quorum-requirement>` is satisfied.
* Two or more conflicting schema updates are happening at the same time. For example, two different columns with the same definition are simultaneously added to the cluster. There is no effective way to resolve the conflict - the cluster will employ the schema with the most recent timestamp, but changes related to the shadowed table will be lost.
* Two or more conflicting schema updates are happening at the same time. For example, two different columns with the same definition are simultaneously added to the cluster. There is no effective way to resolve the conflict - the cluster will employ the schema with the most recent timestamp, but changes related to the shadowed table will be lost.
With Raft, concurrent schema changes are safe.
With Raft, concurrent schema changes are safe.
In summary, Raft makes schema changes safe, but it requires that a quorum of nodes in the cluster is available.
.. _raft-handliing-failures:
.. _raft-handling-failures:
Handling Failures
------------------
@@ -141,10 +277,10 @@ Examples
- Try restarting the node. If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
* - 2 nodes
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Restart at least 1 of the 2 nodes that are down to regain quorum. If you cant recover at least 1 of the 2 nodes, contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
- Restart at least 1 of the 2 nodes that are down to regain quorum. If you cant recover at least 1 of the 2 nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
* - 1 datacenter
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- When the DC comes back online, restart the nodes. If the DC does not come back online and nodes are lost, :doc:`restore the latest cluster backup into a new cluster </operating-scylla/procedures/backup-restore/restore/>`. You can contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
- When the DC comes back online, restart the nodes. If the DC does not come back online and nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
.. list-table:: Cluster B: 2 datacenters, 6 nodes (3 nodes per DC)
@@ -159,10 +295,10 @@ Examples
- Try restarting the node(s). If the node is dead, :doc:`replace it with a new node </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
* - 3 nodes
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- Restart 1 of the 3 nodes that are down to regain quorum. If you cant recover at least 1 of the 3 failed nodes, contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
- Restart 1 of the 3 nodes that are down to regain quorum. If you cant recover at least 1 of the 3 failed nodes, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
* - 1DC
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, :doc:`restore the latest cluster backup into a new cluster </operating-scylla/procedures/backup-restore/restore/>`. You can contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
- When the DCs come back online, restart the nodes. If the DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
.. list-table:: Cluster C: 3 datacenter, 9 nodes (3 nodes per DC)
@@ -175,13 +311,78 @@ Examples
* - 1-4 nodes
- Schema updates are possible and safe.
- Try restarting the nodes. If the nodes are dead, :doc:`replace them with new nodes </operating-scylla/procedures/cluster-management/replace-dead-node-or-more/>`.
* - 1 DC
* - 1 DC
- Schema updates are possible and safe.
- When the DC comes back online, try restarting the nodes in the cluster. If the nodes are dead, :doc:`add 3 new nodes in a new region </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
* - 2 DCs
- Cluster is not fully operational. The data is available for reads and writes, but schema changes are impossible.
- When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, :doc:`restore the latest cluster backup into a new cluster </operating-scylla/procedures/backup-restore/restore/>`. You can contact `ScyllaDB support <https://www.scylladb.com/product/support/>`_ for assistance.
- When the DCs come back online, restart the nodes. If at least one DC fails to come back online and the nodes are lost, consult the :ref:`manual Raft recovery section <recover-raft-procedure>`.
.. _recover-raft-procedure:
Raft manual recovery procedure
==============================
.. versionadded:: 5.2
The manual Raft recovery procedure applies to the following situations:
* :ref:`The internal Raft upgrade procedure <verify-raft-procedure>` got stuck because one of your nodes failed in the middle of the procedure and is irrecoverable,
* or the cluster was running Raft but a majority of nodes (e.g. 2 our of 3) failed and are irrecoverable. Raft cannot progress unless a majority of nodes is available.
.. warning::
Perform the manual recovery procedure **only** if you're dealing with **irrecoverable** nodes. If it is possible to restart your nodes, do that instead of manual recovery.
.. warning::
Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, for example, temporarily partitioned away due to a network failure. If it is possible for the 'dead' nodes to come back to life, they might communicate and interfere with the recovery procedure and cause unpredictable problems.
If you have no means of ensuring that these irrecoverable nodes won't come back to life and communicate with the rest of the cluster, setup firewall rules or otherwise isolate your alive nodes to reject any communication attempts from these dead nodes.
During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`), delete the internal Raft data, and restart the cluster. This will cause the cluster to perform the internal Raft upgrade procedure again, initializing the Raft algorithm from scratch. The manual recovery procedure is applicable both to clusters which were not running Raft in the past and then had Raft enabled, and to clusters which were bootstrapped using Raft.
.. warning::
Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while some nodes are already dead may lead to unavailability of data queries (assuming that you haven't lost it already). For example, if you're using the standard RF=3, CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one of your nodes is dead, restarting another node will cause temporary data query unavailability (until the node finishes restarting). Prepare your service for downtime before proceeding.
#. Perform the following query on **every alive node** in the cluster, using e.g. ``cqlsh``:
.. code-block:: cql
cqlsh> UPDATE system.scylla_local SET value = 'recovery' WHERE key = 'group0_upgrade_state';
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
#. Verify that all the nodes have entered ``RECOVERY`` mode when restarting; look for one of the following messages in their logs:
.. code-block:: console
group0_client - RECOVERY mode.
raft_group0 - setup_group0: Raft RECOVERY mode, skipping group 0 setup.
raft_group0_upgrade - RECOVERY mode. Not attempting upgrade.
#. Remove all your dead nodes using the :doc:`node removal procedure </operating-scylla/procedures/cluster-management/remove-node/>`.
#. Remove existing Raft cluster data by performing the following queries on **every alive node** in the cluster, using e.g. ``cqlsh``:
.. code-block:: cql
cqlsh> TRUNCATE TABLE system.discovery;
cqlsh> TRUNCATE TABLE system.group0_history;
cqlsh> DELETE value FROM system.scylla_local WHERE key = 'raft_group0_id';
#. Make sure that schema is synchronized in the cluster by executing :doc:`nodetool describecluster </operating-scylla/nodetool-commands/describecluster>` on each node and verifying that the schema version is the same on all nodes.
#. We can now leave ``RECOVERY`` mode. On **every alive node**, perform the following query:
.. code-block:: cql
cqlsh> DELETE FROM system.scylla_local WHERE key = 'group0_upgrade_state';
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of your alive nodes.
#. The Raft upgrade procedure will start anew. :ref:`Verify <verify-raft-procedure>` that it finishes successfully.
.. _raft-learn-more:

View File

@@ -13,7 +13,7 @@ sys.path.insert(0, os.path.abspath(".."))
# Build documentation for the following tags and branches
TAGS = []
BRANCHES = ["master"]
BRANCHES = ["master", "branch-5.1"]
# Set the latest version.
LATEST_VERSION = "master"
# Set which versions are not released yet.

View File

@@ -255,7 +255,9 @@ The following options only apply to IncrementalCompactionStrategy:
``space_amplification_goal`` (default: null)
.. versionadded:: 2020.1.6 Scylla Enterprise
:label-tip:`ScyllaDB Enterprise`
.. versionadded:: 2020.1.6
This is a threshold of the ratio of the sum of the sizes of the two largest tiers to the size of the largest tier,
above which ICS will automatically compact the second largest and largest tiers together to eliminate stale data that may have been overwritten, expired, or deleted.

View File

@@ -860,6 +860,18 @@ Other considerations:
- Adding new columns (see ``ALTER TABLE`` below) is a constant time operation. There is thus no need to try to
anticipate future usage when creating a table.
.. _ddl-per-parition-rate-limit:
Limiting the rate of requests per partition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can limit the read rates and writes rates into a partition by applying
a ScyllaDB CQL extension to the CREATE TABLE or ALTER TABLE statements.
See `Per-partition rate limit <https://docs.scylladb.com/stable/cql/cql-extensions.html#per-partition-rate-limit>`_
for details.
.. REMOVE IN FUTURE VERSIONS - Remove the URL above (temporary solution) and replace it with a relative link (once the solution is applied).
.. _alter-table-statement:
ALTER TABLE
@@ -918,6 +930,7 @@ The ``ALTER TABLE`` statement can:
The same note applies to the set of ``compression`` sub-options.
- Change or add any of the ``Encryption options`` above.
- Change or add any of the :ref:`CDC options <cdc-options>` above.
- Change or add per-partition rate limits. See :ref:`Limiting the rate of requests per partition <ddl-per-parition-rate-limit>`.
.. warning:: Dropping a column assumes that the timestamps used for the value of this column are "real" timestamp in
microseconds. Using "real" timestamps in microseconds is the default is and is **strongly** recommended, but as
@@ -927,7 +940,6 @@ The ``ALTER TABLE`` statement can:
.. warning:: Once a column is dropped, it is allowed to re-add a column with the same name as the dropped one
**unless** the type of the dropped column was a (non-frozen) column (due to an internal technical limitation).
.. _drop-table-statement:
DROP TABLE

View File

@@ -142,7 +142,7 @@ You can read more about the ``TIMESTAMP`` retrieved by ``WRITETIME`` in the :ref
- ``TTL`` retrieves the remaining time to live (in *seconds*) for the value of the column, if it set to expire, or ``null`` otherwise.
You can read more about TTL in the :doc:`documentation </cql/time-to-live>` and also in `this Scylla University lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`.
You can read more about TTL in the :doc:`documentation </cql/time-to-live>` and also in `this Scylla University lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`_.
.. _where-clause:
@@ -774,7 +774,7 @@ parameters:
the columns themselves. This means that any subsequent update of the column will also reset the TTL (to whatever TTL
is specified in that update). By default, values never expire. A TTL of 0 is equivalent to no TTL. If the table has a
default_time_to_live, a TTL of 0 will remove the TTL for the inserted or updated values. A TTL of ``null`` is equivalent
to inserting with a TTL of 0. You can read more about TTL in the :doc:`documentation </cql/time-to-live>` and also in `this Scylla University lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`.
to inserting with a TTL of 0. You can read more about TTL in the :doc:`documentation </cql/time-to-live>` and also in `this Scylla University lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`_.
- ``TIMEOUT``: specifies a timeout duration for the specific request.
Please refer to the :ref:`SELECT <using-timeout>` section for more information.

View File

@@ -21,7 +21,6 @@
.. _cql-functions:
.. Need some intro for UDF and native functions in general and point those to it.
.. _udfs:
.. _native-functions:
Functions
@@ -33,13 +32,15 @@ CQL supports two main categories of functions:
- The :ref:`aggregate functions <aggregate-functions>`, which are used to aggregate multiple rows of results from a
``SELECT`` statement.
.. In both cases, CQL provides a number of native "hard-coded" functions as well as the ability to create new user-defined
.. functions.
In both cases, CQL provides a number of native "hard-coded" functions as well as the ability to create new user-defined
functions.
.. .. note:: By default, the use of user-defined functions is disabled by default for security concerns (even when
.. enabled, the execution of user-defined functions is sandboxed and a "rogue" function should not be allowed to do
.. evil, but no sandbox is perfect so using user-defined functions is opt-in). See the ``enable_user_defined_functions``
.. in ``scylla.yaml`` to enable them.
.. note:: Although user-defined functions are sandboxed, protecting the system from a "rogue" function, user-defined functions are disabled by default for extra security.
See the ``enable_user_defined_functions`` in ``scylla.yaml`` to enable them.
Additionally, user-defined functions are still experimental and need to be explicitly enabled by adding ``udf`` to the list of
``experimental_features`` configuration options in ``scylla.yaml``, or turning on the ``experimental`` flag.
See :ref:`Enabling Experimental Features <yaml_enabling_experimental_features>` for details.
.. A function is identifier by its name:
@@ -60,11 +61,11 @@ Native functions
Cast
````
Supported starting from Scylla version 2.1
Supported starting from ScyllaDB version 2.1
The ``cast`` function can be used to convert one native datatype to another.
The following table describes the conversions supported by the ``cast`` function. Scylla will silently ignore any cast converting a cast datatype into its own datatype.
The following table describes the conversions supported by the ``cast`` function. ScyllaDB will silently ignore any cast converting a cast datatype into its own datatype.
=============== =======================================================================================================
From To
@@ -228,6 +229,65 @@ A number of functions are provided to “convert” the native types into binary
takes a 64-bit ``blob`` argument and converts it to a ``bigint`` value. For example, ``bigintAsBlob(3)`` is
``0x0000000000000003`` and ``blobAsBigint(0x0000000000000003)`` is ``3``.
.. _udfs:
User-defined functions :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
User-defined functions (UDFs) execute user-provided code in ScyllaDB. Supported languages are currently Lua and WebAssembly.
UDFs are part of the ScyllaDB schema and are automatically propagated to all nodes in the cluster.
UDFs can be overloaded, so that multiple UDFs with different argument types can have the same function name, for example::
CREATE FUNCTION sample ( arg int ) ...;
CREATE FUNCTION sample ( arg text ) ...;
When calling a user-defined function, arguments can be literals or terms. Prepared statement placeholders can be used, too.
CREATE FUNCTION statement
`````````````````````````
Creating a new user-defined function uses the ``CREATE FUNCTION`` statement. For example::
CREATE OR REPLACE FUNCTION div(dividend double, divisor double)
RETURNS NULL ON NULL INPUT
RETURNS double
LANGUAGE LUA
AS 'return dividend/divisor;';
``CREATE FUNCTION`` with the optional ``OR REPLACE`` keywords creates either a function
or replaces an existing one with the same signature. A ``CREATE FUNCTION`` without ``OR REPLACE``
fails if a function with the same signature already exists. If the optional ``IF NOT EXISTS``
keywords are used, the function will only be created only if another function with the same
signature does not exist. ``OR REPLACE`` and ``IF NOT EXISTS`` cannot be used together.
Behavior for null input values must be defined for each function:
* ``RETURNS NULL ON NULL INPUT`` declares that the function will always return null (without being executed) if any of the input arguments is null.
* ``CALLED ON NULL INPUT`` declares that the function will always be executed.
Function Signature
``````````````````
Signatures are used to distinguish individual functions. The signature consists of a fully-qualified function name of the <keyspace>.<function_name> and a concatenated list of all the argument types.
Note that keyspace names, function names and argument types are subject to the default naming conventions and case-sensitivity rules.
Functions belong to a keyspace; if no keyspace is specified, the current keyspace is used. User-defined functions are not allowed in the system keyspaces.
DROP FUNCTION statement
```````````````````````
Dropping a function uses the ``DROP FUNCTION`` statement. For example::
DROP FUNCTION myfunction;
DROP FUNCTION mykeyspace.afunction;
DROP FUNCTION afunction ( int );
DROP FUNCTION afunction ( text );
You must specify the argument types of the function, the arguments_signature, in the drop command if there are multiple overloaded functions with the same name but different signatures.
``DROP FUNCTION`` with the optional ``IF EXISTS`` keywords drops a function if it exists, but does not throw an error if it doesnt.
.. _aggregate-functions:
Aggregate functions
@@ -290,6 +350,59 @@ instance::
.. _user-defined-aggregates-functions:
User-defined aggregates (UDAs) :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
User-defined aggregates allow the creation of custom aggregate functions. User-defined aggregates can be used in SELECT statement.
Each aggregate requires an initial state of type ``STYPE`` defined with the ``INITCOND`` value (default value: ``null``). The first argument of the state function must have type STYPE. The remaining arguments of the state function must match the types of the user-defined aggregate arguments. The state function is called once for each row, and the value returned by the state function becomes the new state. After all rows are processed, the optional FINALFUNC is executed with the last state value as its argument.
The ``STYPE`` value is mandatory in order to distinguish possibly overloaded versions of the state and/or final function, since the overload can appear after creation of the aggregate.
A complete working example for user-defined aggregates (assuming that a keyspace has been selected using the ``USE`` statement)::
CREATE FUNCTION accumulate_len(acc tuple<bigint,bigint>, a text)
RETURNS NULL ON NULL INPUT
RETURNS tuple<bigint,bigint>
LANGUAGE lua as 'return {acc[1] + 1, acc[2] + #a}';
CREATE OR REPLACE FUNCTION present(res tuple<bigint,bigint>)
RETURNS NULL ON NULL INPUT
RETURNS text
LANGUAGE lua as
'return "The average string length is " .. res[2]/res[1] .. "!"';
CREATE OR REPLACE AGGREGATE avg_length(text)
SFUNC accumulate_len
STYPE tuple<bigint,bigint>
FINALFUNC present
INITCOND (0,0);
CREATE AGGREGATE statement
``````````````````````````
The ``CREATE AGGREGATE`` command with the optional ``OR REPLACE`` keywords creates either an aggregate or replaces an existing one with the same signature. A ``CREATE AGGREGATE`` without ``OR REPLACE`` fails if an aggregate with the same signature already exists. The ``CREATE AGGREGATE`` command with the optional ``IF NOT EXISTS`` keywords creates an aggregate if it does not already exist. The ``OR REPLACE`` and ``IF NOT EXISTS`` phrases cannot be used together.
The ``STYPE`` value defines the type of the state value and must be specified. The optional ``INITCOND`` defines the initial state value for the aggregate; the default value is null. A non-null ``INITCOND`` must be specified for state functions that are declared with ``RETURNS NULL ON NULL INPUT``.
The ``SFUNC`` value references an existing function to use as the state-modifying function. The first argument of the state function must have type ``STYPE``. The remaining arguments of the state function must match the types of the user-defined aggregate arguments. The state function is called once for each row, and the value returned by the state function becomes the new state. State is not updated for state functions declared with ``RETURNS NULL ON NULL INPUT`` and called with null. After all rows are processed, the optional ``FINALFUNC`` is executed with last state value as its argument. It must take only one argument with type ``STYPE``, but the return type of the ``FINALFUNC`` may be a different type. A final function declared with ``RETURNS NULL ON NULL INPUT`` means that the aggregates return value will be null, if the last state is null.
If no ``FINALFUNC`` is defined, the overall return type of the aggregate function is ``STYPE``. If a ``FINALFUNC`` is defined, it is the return type of that function.
DROP AGGREGATE statement
````````````````````````
Dropping an user-defined aggregate function uses the DROP AGGREGATE statement. For example::
DROP AGGREGATE myAggregate;
DROP AGGREGATE myKeyspace.anAggregate;
DROP AGGREGATE someAggregate ( int );
DROP AGGREGATE someAggregate ( text );
The ``DROP AGGREGATE`` statement removes an aggregate created using ``CREATE AGGREGATE``. You must specify the argument types of the aggregate to drop if there are multiple overloaded aggregates with the same name but a different signature.
The ``DROP AGGREGATE`` command with the optional ``IF EXISTS`` keywords drops an aggregate if it exists, and does nothing if a function with the signature does not exist.
.. include:: /rst_include/apache-cql-return-index.rst
.. include:: /rst_include/apache-copyrights.rst
.. include:: /rst_include/apache-copyrights.rst

View File

@@ -6,8 +6,10 @@ System Requirements
Supported Platforms
===================
ScyllaDB runs on 64-bit Linux. The x86_64 and AArch64 architectures are supported (AArch64 support includes AWS EC2 Graviton).
Scylla runs on 64-bit Linux. Here, you can find which :doc:`operating systems, distros, and versions </getting-started/os-support>` are supported.
See :doc:`OS Support by Platform and Version </getting-started/os-support>` for information about
supported operating systems, distros, and versions.
.. _system-requirements-hardware:
@@ -16,39 +18,44 @@ Hardware Requirements
Its recommended to have a balanced setup. If there are only 4-8 :term:`Logical Cores <Logical Core (lcore)>`, large disks or 10Gbps networking may not be needed.
This works in the opposite direction as well.
Scylla can be used in many types of installation environments.
ScyllaDB can be used in many types of installation environments.
To see which system would best suit your workload requirements, use the `Scylla Sizing Calculator <https://price-calc.gh.scylladb.com/>`_ to customize Scylla for your usage.
To see which system would best suit your workload requirements, use the `ScyllaDB Sizing Calculator <https://price-calc.gh.scylladb.com/>`_ to customize ScyllaDB for your usage.
Core Requirements
-----------------
Scylla tries to maximize the resource usage of all system components. The shard-per-core approach allows linear scale-up with the number of cores. As you have more cores, it makes sense to balance the other resources, from memory to network.
ScyllaDB tries to maximize the resource usage of all system components. The shard-per-core approach allows linear scale-up with the number of cores. As you have more cores, it makes sense to balance the other resources, from memory to network.
CPU
^^^
Scylla requires modern Intel CPUs that support the SSE4.2 instruction set and will not boot without it.
The following CPUs are supported by Scylla:
* Intel core: Westmere or later (2010)
* Intel atom: Goldmont or later (2016)
* AMD low power: Jaguar or later (2013)
* AMD standard: Bulldozer or later (2011)
ScyllaDB requires modern Intel/AMD CPUs that support the SSE4.2 instruction set and will not boot without it.
In terms of the number of cores, any number will work since Scylla scales up with the number of cores.
ScyllaDB supports the following CPUs:
* Intel core: Westmere and later (2010)
* Intel atom: Goldmont and later (2016)
* AMD low power: Jaguar and later (2013)
* AMD standard: Bulldozer and later (2011)
* Apple M1 and M2
* Ampere Altra
* AWS Graviton, Graviton2, Graviton3
In terms of the number of cores, any number will work since ScyllaDB scales up with the number of cores.
A practical approach is to use a large number of cores as long as the hardware price remains reasonable.
Between 20-60 logical cores (including hyperthreading) is a recommended number. However, any number will fit.
When using virtual machines, containers, or the public cloud, remember that each virtual CPU is mapped to a single logical core, or thread.
Allow Scylla to run independently without any additional CPU intensive tasks on the same server/cores as Scylla.
Allow ScyllaDB to run independently without any additional CPU intensive tasks on the same server/cores as Scylla.
.. _system-requirements-memory:
Memory Requirements
-------------------
The more memory available, the better Scylla performs, as Scylla uses all of the available memory for caching. The wider the rows are in the schema, the more memory will be required. 64 GB-256 GB is the recommended range for a medium to high workload. Memory requirements are calculated based on the number of :abbr:`lcores (logical cores)` you are using in your system.
The more memory available, the better ScyllaDB performs, as ScyllaDB uses all of the available memory for caching. The wider the rows are in the schema, the more memory will be required. 64 GB-256 GB is the recommended range for a medium to high workload. Memory requirements are calculated based on the number of :abbr:`lcores (logical cores)` you are using in your system.
* Recommended size: 16 GB or 2GB per lcore (whichever is higher)
* Maximum: 1 TiB per lcore, up to 256 lcores
@@ -64,7 +71,7 @@ Disk Requirements
SSD
^^^
We highly recommend SSD and local disks. Scylla is built for a large volume of data and large storage per node.
We highly recommend SSD and local disks. ScyllaDB is built for a large volume of data and large storage per node.
You can use up to 100:1 Disk/RAM ratio, with 30:1 Disk/RAM ratio as a good rule of thumb; for example, 30 TB of storage requires 1 TB of RAM.
We recommend a RAID-0 setup and a replication factor of 3 within the local datacenter (RF=3) when there are multiple drives.
@@ -74,7 +81,7 @@ HDDs are supported but may become a bottleneck. Some workloads may work with HDD
Disk Space
^^^^^^^^^^
Scylla is flushing memtables to SSTable data files for persistent storage. SSTables are periodically compacted to improve performance by merging and rewriting data and discarding the old one. Depending on compaction strategy, disk space utilization temporarily increases during compaction. For this reason, you should leave an adequate amount of free disk space available on a node.
ScyllaDB is flushing memtables to SSTable data files for persistent storage. SSTables are periodically compacted to improve performance by merging and rewriting data and discarding the old one. Depending on compaction strategy, disk space utilization temporarily increases during compaction. For this reason, you should leave an adequate amount of free disk space available on a node.
Use the following table as a guidelines for the minimum disk space requirements based on the compaction strategy:
====================================== =========== ============
@@ -89,7 +96,7 @@ Time-window Compaction Strategy (TWCS) 50% 70%
Incremental Compaction Strategy (ICS) 70% 80%
====================================== =========== ============
Use the default ICS (Scylla Enterprise) or STCS (Scylla Open Source) unless you'll have a clear understanding that another strategy is better for your use case. More on :doc:`choosing a Compaction Strategy </architecture/compaction/compaction-strategies>`.
Use the default ICS (ScyllaDB Enterprise) or STCS (ScyllaDB Open Source) unless you'll have a clear understanding that another strategy is better for your use case. More on :doc:`choosing a Compaction Strategy </architecture/compaction/compaction-strategies>`.
In order to maintain a high level of service availability, keep 50% to 20% free disk space at all times!
.. _system-requirements-network:
@@ -97,7 +104,7 @@ In order to maintain a high level of service availability, keep 50% to 20% free
Network Requirements
====================
A network speed of 10 Gbps or more is recommended, especially for large nodes. To tune the interrupts and their queues, run the Scylla setup scripts.
A network speed of 10 Gbps or more is recommended, especially for large nodes. To tune the interrupts and their queues, run the ScyllaDB setup scripts.
Cloud Instance Recommendations
@@ -106,20 +113,25 @@ Cloud Instance Recommendations
Amazon Web Services (AWS)
--------------------------------
* The recommended instance types are :ref:`i3 <system-requirements-i3-instances>`, :ref:`i3en <system-requirements-i3en-instances>`, and :ref:`i4i <system-requirements-i4i-instances>`.
* We recommend using enhanced networking that exposes the physical network cards to the VM.
.. note::
Some of the ScyllaDB configuration features rely on querying instance metadata.
Disabling access to instance metadata will impact using Ec2 Snitches and tuning performance.
See `AWS - Configure the instance metadata options <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html>`_ for more information.
.. _system-requirements-i3-instances:
We highly recommend EC2 **I3** instances—High I/O. This family includes the High Storage Instances that provide very fast SSD-backed instance storage optimized for very high random I/O performance and provide high IOPS at a low cost. We recommend using enhanced networking that exposes the physical network cards to the VM.
i3 instances
^^^^^^^^^^^^
This family includes the High Storage Instances that provide very fast SSD-backed instance storage optimized for very high random I/O performance and provide high IOPS at a low cost. We recommend using enhanced networking that exposes the physical network cards to the VM.
i3 instances are designed for I/O intensive workloads and equipped with super-efficient NVMe SSD storage. It can deliver up to 3.3 Million IOPS.
An i3 instance is great for low latency and high throughput, compared to the i2 instances, the i3 instance provides storage that it's less expensive and denser along with the ability to deliver substantially more IOPS and more network bandwidth per CPU core.
i3 instances
^^^^^^^^^^^^
=========================== =========== ============ =====================
Model vCPU Mem (GB) Storage (NVMe SSD)
=========================== =========== ============ =====================
@@ -140,13 +152,15 @@ i3.metal New in version 2.3 72 :sup:`*` 512 8 x 1.9 NVMe SSD
Source: `Amazon EC2 I3 Instances <https://aws.amazon.com/ec2/instance-types/i3/>`_
More on using Scylla with `i3.metal vs i3.16xlarge <https://www.scylladb.com/2018/06/21/impact-virtualization-database/>`_
More on using ScyllaDB with `i3.metal vs i3.16xlarge <https://www.scylladb.com/2018/06/21/impact-virtualization-database/>`_
.. _system-requirements-i3en-instances:
i3en instances
^^^^^^^^^^^^^^
i3en instances have up to 4x the networking bandwidth of i3 instances, enabling up to 100 Gbps of sustained network bandwidth.
i3en support is available for Scylla Enterprise 2019.1.1 and higher and Scylla Open Source 3.1 and higher.
i3en support is available for ScyllaDB Enterprise 2019.1.1 and higher and ScyllaDB Open Source 3.1 and higher.
=========================== =========== ============ =====================
@@ -177,12 +191,12 @@ All i3en instances have the following specs:
See `Amazon EC2 I3en Instances <https://aws.amazon.com/ec2/instance-types/i3en/>`_ for details.
.. _system-requirements-i4i-instances:
i4i instances
^^^^^^^^^^^^^^
i4i support is available for ScyllaDB Open Source 5.0 and later and ScyllaDB Enterprise 2021.1.10 and later.
=========================== =========== ============ =====================
Model vCPU Mem (GB) Storage (NVMe SSD)
=========================== =========== ============ =====================
@@ -203,7 +217,7 @@ i4i.32xlarge 128 1,024 8 x 3,750 GB
i4i.metal 128 1,024 8 x 3,750 GB
=========================== =========== ============ =====================
All i41 instances have the following specs:
All i4i instances have the following specs:
* 3.5 GHz all-core turbo Intel® Xeon® Scalable (Ice Lake) processors
* 40 Gbps bandwidth to EBS in the largest size and up to 10 Gbps in the four smallest sizes (twice that of i3 instances. Up to 75 Gbps networking bandwidth (three times more than I3 instances).
@@ -216,11 +230,15 @@ See `ScyllaDB on the New AWS EC2 I4i Instances: Twice the Throughput & Lower Lat
learn more about using ScyllaDB with i4i instances.
Im4gn and Is4gen instances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ScyllaDB supports Arm-based Im4gn and Is4gen instances. See `Amazon EC2 Im4gn and Is4gen instances <https://aws.amazon.com/ec2/instance-types/i4g/>`_ for specification details.
Google Compute Engine (GCE)
-----------------------------------
Pick a zone where Haswell CPUs are found. Local SSD performance offers, according to Google, less than 1 ms of latency and up to 680,000 read IOPS and 360,000 write IOPS.
Image with NVMe disk interface is recommended, CentOS 7 for Scylla Enterprise 2020.1 and older, and Ubuntu 20 for 2021.1 and later.
Image with NVMe disk interface is recommended, CentOS 7 for ScyllaDB Enterprise 2020.1 and older, and Ubuntu 20 for 2021.1 and later.
(`More info <https://cloud.google.com/compute/docs/disks/local-ssd>`_)
Recommended instances types are `n1-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines>`_ and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines>`_

View File

@@ -25,29 +25,31 @@
<div class="grid-x grid-margin-x hs">
.. topic-box::
:title: New to ScyllaDB? Start here!
:link: https://cloud.docs.scylladb.com/stable/scylladb-basics/
:class: large-4
:anchor: ScyllaDB Basics
Learn the essentials of ScyllaDB.
.. topic-box::
:title: Let us manage your DB
:link: https://cloud.docs.scylladb.com
:class: large-4
:anchor: Get Started with Scylla Cloud
:anchor: ScyllaDB Cloud Documentation
Take advantage of Scylla Cloud, a fully-managed database-as-a-service.
Simplify application development with ScyllaDB Cloud - a fully managed database-as-a-service.
.. topic-box::
:title: Manage your own DB
:link: getting-started
:class: large-4
:anchor: Get Started with Scylla
:anchor: ScyllaDB Open Source and Enterprise Documentation
Provision and manage a Scylla cluster in your environment.
Deploy and manage your database in your own environment.
.. topic-box::
:title: Connect your application to Scylla
:link: using-scylla/drivers
:class: large-4
:anchor: Choose a Driver
Use high performance Scylla drivers to connect your application to a Scylla cluster.
.. raw:: html
@@ -57,14 +59,13 @@
<div class="topics-grid topics-grid--products">
<h2 class="topics-grid__title">Our Product List</h2>
<p class="topics-grid__text">To begin choose a product from the list below</p>
<h2 class="topics-grid__title">Our Products</h2>
<div class="grid-container full">
<div class="grid-x grid-margin-x">
.. topic-box::
:title: Scylla Enterprise
:title: ScyllaDB Enterprise
:link: getting-started
:image: /_static/img/mascots/scylla-enterprise.svg
:class: topic-box--product,large-3,small-6
@@ -72,7 +73,7 @@
ScyllaDBs most stable high-performance enterprise-grade NoSQL database.
.. topic-box::
:title: Scylla Open Source
:title: ScyllaDB Open Source
:link: getting-started
:image: /_static/img/mascots/scylla-opensource.svg
:class: topic-box--product,large-3,small-6
@@ -80,15 +81,15 @@
A high-performance NoSQL database with a close-to-the-hardware, shared-nothing approach.
.. topic-box::
:title: Scylla Cloud
:title: ScyllaDB Cloud
:link: https://cloud.docs.scylladb.com
:image: /_static/img/mascots/scylla-cloud.svg
:class: topic-box--product,large-3,small-6
A fully managed NoSQL database as a service powered by Scylla Enterprise.
A fully managed NoSQL database as a service powered by ScyllaDB Enterprise.
.. topic-box::
:title: Scylla Alternator
:title: ScyllaDB Alternator
:link: https://docs.scylladb.com/stable/alternator/alternator.html
:image: /_static/img/mascots/scylla-alternator.svg
:class: topic-box--product,large-3,small-6
@@ -96,23 +97,23 @@
Open source Amazon DynamoDB-compatible API.
.. topic-box::
:title: Scylla Monitoring Stack
:title: ScyllaDB Monitoring Stack
:link: https://monitoring.docs.scylladb.com
:image: /_static/img/mascots/scylla-monitor.svg
:class: topic-box--product,large-3,small-6
Complete open source monitoring solution for your Scylla clusters.
Complete open source monitoring solution for your ScyllaDB clusters.
.. topic-box::
:title: Scylla Manager
:title: ScyllaDB Manager
:link: https://manager.docs.scylladb.com
:image: /_static/img/mascots/scylla-manager.svg
:class: topic-box--product,large-3,small-6
Hassle-free Scylla NoSQL database management for scale-out clusters.
Hassle-free ScyllaDB NoSQL database management for scale-out clusters.
.. topic-box::
:title: Scylla Drivers
:title: ScyllaDB Drivers
:link: https://docs.scylladb.com/stable/using-scylla/drivers/
:image: /_static/img/mascots/scylla-drivers.svg
:class: topic-box--product,large-3,small-6
@@ -120,12 +121,12 @@
Shard-aware drivers for superior performance.
.. topic-box::
:title: Scylla Operator
:title: ScyllaDB Operator
:link: https://operator.docs.scylladb.com
:image: /_static/img/mascots/scylla-enterprise.svg
:class: topic-box--product,large-3,small-6
Easily run and manage your Scylla Cluster on Kubernetes.
Easily run and manage your ScyllaDB cluster on Kubernetes.
.. raw:: html
@@ -135,19 +136,19 @@
<div class="topics-grid">
<h2 class="topics-grid__title">Learn More About Scylla</h2>
<h2 class="topics-grid__title">Learn More About ScyllaDB</h2>
<p class="topics-grid__text"></p>
<div class="grid-container full">
<div class="grid-x grid-margin-x">
.. topic-box::
:title: Attend Scylla University
:title: Attend ScyllaDB University
:link: https://university.scylladb.com/
:image: /_static/img/mascots/scylla-university.png
:class: large-6,small-12
:anchor: Find a Class
| Register to take a *free* class at Scylla University.
| Register to take a *free* class at ScyllaDB University.
| There are several learning paths to choose from.
.. topic-box::
@@ -178,9 +179,9 @@
architecture/index
troubleshooting/index
kb/index
Scylla University <https://university.scylladb.com/>
ScyllaDB University <https://university.scylladb.com/>
faq
Contribute to Scylla <contribute>
Contribute to ScyllaDB <contribute>
glossary
alternator/alternator

View File

@@ -29,7 +29,7 @@ There are two types of compactions:
* Major Compaction
A user triggers (using nodetool) a compaction over all SSTables, merging the individual tables according to the selected compaction strategy.
.. caution:: It is always best to allow Scylla to automatically run minor compactions. Major compactions can exhaust resources, increase operational costs, and take up valuable disk space. This requires you to have 50% more disk space than your data unless you are using `Incremental compaction strategy (ICS)`_.
.. caution:: It is always best to allow Scylla to automatically run minor compactions. Major compactions can exhaust resources, increase operational costs, and take up valuable disk space. This requires you to have 50% more disk space than your data unless you are using :ref:`Incremental compaction strategy (ICS) <incremental-compaction-strategy-ics>`.
View Compaction Statistics
--------------------------
@@ -43,7 +43,7 @@ A compaction strategy is what determines which of the SSTables will be compacted
* `Size-tiered compaction strategy (STCS)`_ - (default setting) triggered when the system has enough similarly sized SSTables.
* `Leveled compaction strategy (LCS)`_ - the system uses small, fixed-size (by default 160 MB) SSTables divided into different levels and lowers both Read and Space Amplification.
* `Incremental compaction strategy (ICS)`_ - Available for Enterprise customers, uses runs of sorted, fixed size (by default 1 GB) SSTables in a similar way that LCS does, organized into size-tiers, similar to STCS size-tiers. If you are an Enterprise customer ICS is an updated strategy meant to replace STCS. It has the same read and write amplification, but has lower space amplification due to the reduction of temporary space overhead is reduced to a constant manageable level.
* :ref:`Incremental compaction strategy (ICS) <incremental-compaction-strategy-ics>` - :label-tip:`ScyllaDB Enterprise` Uses runs of sorted, fixed size (by default 1 GB) SSTables in a similar way that LCS does, organized into size-tiers, similar to STCS size-tiers. If you are an Enterprise customer ICS is an updated strategy meant to replace STCS. It has the same read and write amplification, but has lower space amplification due to the reduction of temporary space overhead is reduced to a constant manageable level.
* `Time-window compaction strategy (TWCS)`_ - designed for time series data and puts data in time order. This strategy replaced Date-tiered compaction. TWCS uses STCS to prevent accumulating SSTables in a window not yet closed. When the window closes, TWCS works towards reducing the SSTables in a time window to one.
* `Date-tiered compaction strategy (DTCS)`_ - designed for time series data, but TWCS should be used instead.
@@ -116,12 +116,10 @@ Likewise, when :term:`bootstrapping<Bootstrap>` a new node, SSTables are streame
.. _incremental-compaction-strategy-ics:
Incremental Compaction Strategy (ICS)
-------------------------------------
Incremental Compaction Strategy (ICS) :label-tip:`ScyllaDB Enterprise`
------------------------------------------------------------------------
.. versionadded:: 2019.1.4 Scylla Enterprise
.. include:: /rst_include/enterprise-only-note.rst
.. versionadded:: 2019.1.4
One of the issues with Size-tiered compaction is that it needs temporary space because SSTables are not removed until they are fully compacted. ICS takes a different approach and splits each large SSTable into a run of sorted, fixed-size (by default 1 GB) SSTables (a.k.a. fragments) in the same way that LCS does, except it treats the entire run and not the individual SSTables as the sizing file for STCS. As the run-fragments are small, the SSTables compact quickly, allowing individual SSTables to be removed as soon as they are compacted. This approach uses low amounts of memory and temporary disk space.

View File

@@ -42,7 +42,7 @@ Steps:
.. code-block:: sh
nodetool compact <keyspace>.<mytable>;
nodetool compact <keyspace> <mytable>;
5. Alter the table and change the grace period back to the original ``gc_grace_seconds`` value.

View File

@@ -27,8 +27,7 @@ endpoint_snitch GossipingPropertyFileSnitch
**Important**
If the node has two physical network interfaces in a multi-datacenter installation.
Set ``listen_address`` to this node's private IP or hostname.
If the node has two physical network interfaces in a multi-datacenter installation, set ``listen_address`` to this node's private IP or hostname.
Set ``broadcast_address`` to the second IP or hostname (for communication between data centers).
Set ``listen_on_broadcast_address`` to true.
Open the storage_port or ssl_storage_port on the public IP firewall.

View File

@@ -102,4 +102,4 @@ Cluster Management Procedures
Procedures for handling failures and practical examples of different scenarios.
* :ref:`Handling Failures<raft-handliing-failures>`
* :ref:`Handling Failures<raft-handling-failures>`

View File

@@ -2,12 +2,18 @@
Scylla Auditing Guide
=====================
.. include:: /rst_include/enterprise-only-note.rst
:label-tip:`ScyllaDB Enterprise`
Auditing allows the administrator to monitor activities on a Scylla cluster, including queries and data changes.
The information is stored in a Syslog or a Scylla table.
Prerequisite
------------
Enable ScyllaDB :doc:`Authentication </operating-scylla/security/authentication>` and :doc:`Authorization </operating-scylla/security/enable-authorization>`.
Enabling Audit
---------------

View File

@@ -2,11 +2,13 @@
Encryption at Rest
==================
:label-tip:`ScyllaDB Enterprise`
.. versionadded:: 2019.1.1 Scylla Enterprise
.. versionchanged:: 2019.1.3 Scylla Enterprise
.. versionadded:: 2019.1.1
.. versionchanged:: 2019.1.3
.. include:: /rst_include/enterprise-only-note.rst
Introduction
=============
Scylla Enterprise protects your sensitive data with data-at-rest encryption.
It protects the privacy of your user's data, reduces the risk of data breaches, and helps meet regulatory requirements.

View File

@@ -7,9 +7,9 @@ LDAP Authentication
saslauthd
.. include:: /rst_include/enterprise-only-note.rst
:label-tip:`ScyllaDB Enterprise`
.. versionadded:: Scylla Enterprise 2021.1.2
.. versionadded:: 2021.1.2
Scylla supports user authentication via an LDAP server by leveraging the SaslauthdAuthenticator.
By configuring saslauthd correctly against your LDAP server, you enable Scylla to check the users credentials through it.

View File

@@ -2,14 +2,14 @@
LDAP Authorization (Role Management)
=====================================
.. include:: /rst_include/enterprise-only-note.rst
:label-tip:`ScyllaDB Enterprise`
.. versionadded:: 2021.1.2
Scylla Enterprise customers can manage and authorize users privileges via an :abbr:`LDAP (Lightweight Directory Access Protocol)` server.
LDAP is an open, vendor-neutral, industry-standard protocol for accessing and maintaining distributed user access control over a standard IP network.
If your users are already stored in an LDAP directory, you can now use the same LDAP server to regulate their roles in Scylla.
.. versionadded:: Scylla Enterprise 2021.1.2
Introduction
------------

View File

@@ -1 +0,0 @@
.. note:: This feature is only available with Scylla Enterprise. If you are using Scylla Open Source, this feature will not be available.

View File

@@ -1,6 +1,6 @@
======================
Troubleshooting Scylla
======================
=========================
Troubleshooting ScyllaDB
=========================
.. toctree::
:hidden:
@@ -8,6 +8,7 @@ Troubleshooting Scylla
support/index
startup/index
upgrade/index
cluster/index
modeling/index
storage/index
@@ -24,13 +25,14 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
:id: "getting-started"
:class: my-panel
* :doc:`Errors and Scylla Customer Support <support/index>`
* :doc:`Scylla Startup <startup/index>`
* :doc:`Scylla Cluster and Node <cluster/index>`
* :doc:`Errors and ScyllaDB Customer Support <support/index>`
* :doc:`ScyllaDB Startup <startup/index>`
* :doc:`ScyllaDB Cluster and Node <cluster/index>`
* :doc:`ScyllaDB Upgrade <upgrade/index>`
* :doc:`Data Modeling <modeling/index>`
* :doc:`Data Storage and SSTables <storage/index>`
* :doc:`CQL errors <CQL/index>`
* :doc:`Scylla Monitoring and Scylla Manager <monitor/index>`
* :doc:`ScyllaDB Monitoring and Scylla Manager <monitor/index>`
Also check out the `Monitoring lesson <https://university.scylladb.com/courses/scylla-operations/lessons/scylla-monitoring/>`_ on Scylla University, which covers how to troubleshoot different issues when running a Scylla cluster.

View File

@@ -0,0 +1,79 @@
Inaccessible "/var/lib/scylla" and "/var/lib/systemd/coredump" after ScyllaDB upgrade
======================================================================================
Problem
^^^^^^^
When you reboot the machine after a ScyllaDB upgrade, you cannot access data directories under ``/var/lib/scylla``, and
coredump saves to ``rootfs``.
The problem may occur when you upgrade ScylaDB Open Source 4.6 or later to a version of ScyllaDB Enterprise if
the ``/etc/systemd/system/var-lib-scylla.mount`` and ``/etc/systemd/system/var-lib-systemd-coredump.mount`` are
deleted by RPM.
To avoid losing the files, the upgrade procedure includes a step to backup the .mount files. The following
example shows the command to backup the files before the :doc:`upgrade from version 5.0 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/upgrade-guide-from-5.0-to-2022.1-rpm/>`:
.. code-block:: console
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ) /etc/systemd/system/{var-lib-scylla,var-lib-systemd-coredump}.mount; do sudo cp -v $conf $conf.backup-5.0; done
If you don't backup the .mount files before the upgrade, the files may be lost.
Solution
^^^^^^^^
If you didn't backup the .mount files before the upgrade and the files were deleted during the upgrade,
you need to restore them manually.
To restore ``/etc/systemd/system/var-lib-systemd-coredump.mount``, run the following:
.. code-block:: console
$ cat << EOS | sudo tee /etc/systemd/system/var-lib-systemd-coredump.mount
[Unit]
Description=Save coredump to scylla data directory
Conflicts=umount.target
Before=scylla-server.service
After=local-fs.target
DefaultDependencies=no
[Mount]
What=/var/lib/scylla/coredump
Where=/var/lib/systemd/coredump
Type=none
Options=bind
[Install]
WantedBy=multi-user.target
EOS
To restore ``/etc/systemd/system/var-lib-scylla.mount``, run the following (specifying your data disk):
.. code-block:: console
$ UUID=`blkid -s UUID -o value <specify your data disk, eg: /dev/md0>`
$ cat << EOS | sudo tee /etc/systemd/system/var-lib-scylla.mount
[Unit]
Description=Scylla data directory
Before=scylla-server.service
After=local-fs.target
DefaultDependencies=no
[Mount]
What=/dev/disk/by-uuid/$UUID
Where=/var/lib/scylla
Type=xfs
Options=noatime
[Install]
WantedBy=multi-user.target
EOS
After restoring .mount files, you need to enable them:
.. code-block:: console
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now var-lib-scylla.mount
$ sudo systemctl enable --now var-lib-systemd-coredump.mount
.. include:: /troubleshooting/_common/ts-return.rst

View File

@@ -0,0 +1,16 @@
Upgrade
=================
.. toctree::
:hidden:
:maxdepth: 2
Inaccessible configuration files after ScyllaDB upgrade </troubleshooting/missing-dotmount-files>
.. panel-box::
:title: Upgrade Issues
:id: "getting-started"
:class: my-panel
* :doc:`Inaccessible "/var/lib/scylla" and "/var/lib/systemd/coredump" after ScyllaDB upgrade </troubleshooting//missing-dotmount-files>`

View File

@@ -5,7 +5,8 @@ Upgrade ScyllaDB Open Source
.. toctree::
:hidden:
ScyllaDB 5.1 to 5.1 <upgrade-guide-from-5.0-to-5.1/index>
ScyllaDB 5.1 to 5.2 <upgrade-guide-from-5.1-to-5.2/index>
ScyllaDB 5.0 to 5.1 <upgrade-guide-from-5.0-to-5.1/index>
ScyllaDB 5.x maintenance release <upgrade-guide-from-5.x.y-to-5.x.z/index>
ScyllaDB 4.6 to 5.0 <upgrade-guide-from-4.6-to-5.0/index>
ScyllaDb 4.5 to 4.6 <upgrade-guide-from-4.5-to-4.6/index>
@@ -36,6 +37,7 @@ Upgrade ScyllaDB Open Source
Procedures for upgrading to a newer version of ScyllaDB Open Source.
* :doc:`Upgrade Guide - ScyllaDB 5.1 to 5.2 <upgrade-guide-from-5.1-to-5.2/index>`
* :doc:`Upgrade Guide - ScyllaDB 5.0 to 5.1 <upgrade-guide-from-5.0-to-5.1/index>`
* :doc:`Upgrade Guide - ScyllaDB 5.x maintenance releases <upgrade-guide-from-5.x.y-to-5.x.z/index>`
* :doc:`Upgrade Guide - ScyllaDB 4.6 to 5.0 <upgrade-guide-from-4.6-to-5.0/index>`

View File

@@ -6,10 +6,7 @@ Upgrade Guide - ScyllaDB 5.0 to 5.1
:maxdepth: 2
:hidden:
ScyllaDB Image <upgrade-guide-from-5.0-to-5.1-image>
Red Hat Enterprise Linux and CentOS <upgrade-guide-from-5.0-to-5.1-rpm>
Ubuntu <upgrade-guide-from-5.0-to-5.1-ubuntu>
Debian <upgrade-guide-from-5.0-to-5.1-debian>
ScyllaDB <upgrade-guide-from-5.0-to-5.1-generic>
Metrics <metric-update-5.0-to-5.1>
.. panel-box::
@@ -20,8 +17,5 @@ Upgrade Guide - ScyllaDB 5.0 to 5.1
Upgrade guides are available for:
* :doc:`Upgrade ScyllaDB Image from 5.0.x to 5.1.y <upgrade-guide-from-5.0-to-5.1-image>`
* :doc:`Upgrade ScyllaDB from 5.0.x to 5.1.y on Red Hat Enterprise Linux and CentOS <upgrade-guide-from-5.0-to-5.1-rpm>`
* :doc:`Upgrade ScyllaDB from 5.0.x to 5.1.y on Ubuntu <upgrade-guide-from-5.0-to-5.1-ubuntu>`
* :doc:`Upgrade ScyllaDB from 5.0.x to 5.1.y on Debian <upgrade-guide-from-5.0-to-5.1-debian>`
* :doc:`Upgrade ScyllaDB from 5.0.x to 5.1.y <upgrade-guide-from-5.0-to-5.1-generic>`
* :doc:`ScyllaDB Metrics Update - Scylla 5.0 to 5.1 <metric-update-5.0-to-5.1>`

View File

@@ -1,13 +0,0 @@
.. |OS| replace:: Debian
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SRC_VERSION| replace:: 5.0
.. |NEW_VERSION| replace:: 5.1
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |PKG_NAME| replace:: scylla
.. |SCYLLA_REPO| replace:: ScyllaDB deb repo
.. _SCYLLA_REPO: https://www.scylladb.com/download/?platform=debian-10&version=scylla-5.1
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - Scylla 5.0 to 5.1
.. _SCYLLA_METRICS: ../metric-update-5.0-to-5.1
.. |UPGRADE_NOTES| replace:: _
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian.rst

View File

@@ -0,0 +1,359 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 5.0
.. |NEW_VERSION| replace:: 5.1
.. |DEBIAN_SRC_REPO| replace:: Debian
.. _DEBIAN_SRC_REPO: https://www.scylladb.com/download/?platform=debian-10&version=scylla-5.0
.. |UBUNTU_SRC_REPO| replace:: Ubuntu
.. _UBUNTU_SRC_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.0
.. |SCYLLA_DEB_SRC_REPO| replace:: ScyllaDB deb repo (|DEBIAN_SRC_REPO|_, |UBUNTU_SRC_REPO|_)
.. |SCYLLA_RPM_SRC_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_SRC_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-5.0
.. |DEBIAN_NEW_REPO| replace:: Debian
.. _DEBIAN_NEW_REPO: https://www.scylladb.com/download/?platform=debian-10&version=scylla-5.1
.. |UBUNTU_NEW_REPO| replace:: Ubuntu
.. _UBUNTU_NEW_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.1
.. |SCYLLA_DEB_NEW_REPO| replace:: ScyllaDB deb repo (|DEBIAN_NEW_REPO|_, |UBUNTU_NEW_REPO|_)
.. |SCYLLA_RPM_NEW_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_NEW_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-5.1
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: Scylla Metrics Update - Scylla 5.0 to 5.1
.. _SCYLLA_METRICS: ../metric-update-5.0-to-5.1
=============================================================================
Upgrade Guide - |SCYLLA_NAME| |SRC_VERSION| to |NEW_VERSION|
=============================================================================
This document is a step by step procedure for upgrading from |SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION|, and rollback to version |SRC_VERSION| if required.
This guide covers upgrading Scylla on Red Hat Enterprise Linux (RHEL) 7/8, CentOS 7/8, Debian 10 and Ubuntu 20.04. It also applies when using ScyllaDB official image on EC2, GCP, or Azure; the image is based on Ubuntu 20.04.
See :doc:`OS Support by Platform and Version </getting-started/os-support>` for information about supported versions.
Upgrade Procedure
=================
A ScyllaDB upgrade is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes in the cluster, serially (i.e. one node at a time), you will:
* Check that the cluster's schema is synchronized
* Drain the node and backup the data
* Backup the configuration file
* Stop ScyllaDB
* Download and install new ScyllaDB packages
* Start ScyllaDB
* Validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating that the node you upgraded is up and running the new version.
**During** the rolling upgrade, it is highly recommended:
* Not to use the new |NEW_VERSION| features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============
Check the cluster schema
-------------------------
Make sure that all nodes have the schema synchronized before upgrade. The upgrade procedure will fail if there is a schema disagreement between nodes.
.. code:: sh
nodetool describecluster
Drain the nodes and backup the data
-----------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having that name under ``/var/lib/scylla`` to a backup device.
When the upgrade is completed on all nodes, remove the snapshot with the ``nodetool clearsnapshot -t <snapshot>`` command to prevent running out of space.
Backup the configuration file
------------------------------
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-src
Gracefully stop the node
------------------------
.. code:: sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
.. tabs::
.. group-tab:: Debian/Ubuntu
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
Before upgrading, check what version you are running now using ``rpm -qa | grep scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_RPM_NEW_REPO|_ to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code:: sh
sudo yum clean all
sudo yum update scylla\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
There are two alternative upgrade procedures:
* :ref:`Upgrading ScyllaDB and simultaneously updating 3rd party and OS packages <upgrade-image-recommended-procedure>`. It is recommended if you are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04.
* :ref:`Upgrading ScyllaDB without updating any external packages <upgrade-image-upgrade-guide-regular-procedure>`.
.. _upgrade-image-recommended-procedure:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
.. code:: sh
cat scylla-packages-<version>-<arch>.txt | sudo xargs -n1 apt-get install -y
Where:
* ``<version>`` - The ScyllaDB version to which you are upgrading ( |NEW_VERSION| ).
* ``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
Example:
.. code:: sh
cat scylla-packages-5.1.2-x86_64.txt | sudo xargs -n1 apt-get install -y
.. note::
Alternatively, you can update the manifest file with the following command:
``sudo apt-get install $(awk '{print $1'} scylla-packages-<version>-<arch>.txt) -y``
.. _upgrade-image-upgrade-guide-regular-procedure:
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
#. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in ``UN`` status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check the ScyllaDB version. Validate that the version matches the one you upgraded to.
#. Check scylla-server log (by ``journalctl _COMM=scylla``) and ``/var/log/syslog`` to validate there are no new errors in the log.
#. Check again after two minutes, to validate no new issues are introduced.
Once you are sure the node upgrade was successful, move to the next node in the cluster.
See |Scylla_METRICS|_ for more information..
Rollback Procedure
==================
.. include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from |SCYLLA_NAME| |NEW_VERSION|.x to |SRC_VERSION|.y. Apply this procedure if an upgrade from |SRC_VERSION| to |NEW_VERSION| failed before completing on all nodes. Use this procedure only for nodes you upgraded to |NEW_VERSION|.
ScyllaDB rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes you rollback to |SRC_VERSION|, serially (i.e. one node at a time), you will:
* Drain the node and stop Scylla
* Retrieve the old ScyllaDB packages
* Restore the configuration file
* Restore system tables
* Reload systemd configuration
* Restart ScyllaDB
* Validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating that the rollback was successful and the node is up and running the old version.
Rollback Steps
==============
Drain and gracefully stop the node
----------------------------------
.. code:: sh
nodetool drain
sudo service scylla-server stop
Download and install the old release
------------------------------------
..
TODO: downgrade for 3rd party packages in EC2/GCP/Azure - like in the upgrade section?
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
.. code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
#. Update the |SCYLLA_RPM_SRC_REPO|_ to |SRC_VERSION|.
#. Install:
.. code:: console
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\\*tools-core
sudo yum downgrade scylla\\* -y
sudo yum install scylla
.. group-tab:: EC2/GCP/Azure Ubuntu Image
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
.. code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer y to the first two questions.
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for reference.
.. code:: sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
sudo cp -r * /var/lib/scylla/data/keyspace_name/table_name-UUID/
sudo chown -R scylla:scylla /var/lib/scylla/data/keyspace_name/table_name-UUID/
Reload systemd configuration
----------------------------
You must reload the unit file if the systemd unit file is changed.
.. code:: sh
sudo systemctl daemon-reload
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
Check the upgrade instructions above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -1,16 +0,0 @@
.. |OS| replace:: EC2, GCP, and Azure
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SRC_VERSION| replace:: 5.0
.. |NEW_VERSION| replace:: 5.1
.. |SCYLLA_NAME| replace:: ScyllaDB Image
.. |PKG_NAME| replace:: scylla
.. |APT| replace:: ScyllaDB deb repo
.. _APT: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.1
.. |SCYLLA_REPO| replace:: ScyllaDB deb repo
.. _SCYLLA_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.1
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - Scylla 5.0 to 5.1
.. _SCYLLA_METRICS: ../metric-update-5.0-to-5.1
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst

View File

@@ -1,10 +0,0 @@
.. |OS| replace:: Red Hat Enterprise Linux and CentOS
.. |SRC_VERSION| replace:: 5.0
.. |NEW_VERSION| replace:: 5.1
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |PKG_NAME| replace:: scylla
.. |SCYLLA_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-5.1
.. |SCYLLA_METRICS| replace:: Scylla Metrics Update - Scylla 5.0 to 5.1
.. _SCYLLA_METRICS: ../metric-update-5.0-to-5.1
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst

View File

@@ -1,13 +0,0 @@
.. |OS| replace:: Ubuntu
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SRC_VERSION| replace:: 5.0
.. |NEW_VERSION| replace:: 5.1
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |PKG_NAME| replace:: scylla
.. |SCYLLA_REPO| replace:: ScyllaDB deb repo
.. _SCYLLA_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.1
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - Scylla 5.0 to 5.1
.. _SCYLLA_METRICS: ../metric-update-5.0-to-5.1
.. |UPGRADE_NOTES| replace:: _
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian.rst

View File

@@ -0,0 +1,21 @@
====================================
Upgrade Guide - ScyllaDB 5.1 to 5.2
====================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB <upgrade-guide-from-5.1-to-5.2-generic>
Metrics <metric-update-5.1-to-5.2>
.. panel-box::
:title: Upgrade Scylla
:id: "getting-started"
:class: my-panel
Upgrade guides are available for:
* :doc:`Upgrade ScyllaDB from 5.1.x to 5.2.y <upgrade-guide-from-5.1-to-5.2-generic>`
* :doc:`ScyllaDB Metrics Update - Scylla 5.1 to 5.2 <metric-update-5.1-to-5.2>`

View File

@@ -0,0 +1,20 @@
Scylla Metric Update - Scylla 5.1 to 5.2
========================================
.. toctree::
:maxdepth: 2
:hidden:
Scylla 5.2 Dashboards are available as part of the latest |mon_root|.
The following metrics are new in Scylla 5.2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - TODO
- TODO

View File

@@ -0,0 +1,411 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 5.1
.. |NEW_VERSION| replace:: 5.2
.. |DEBIAN_SRC_REPO| replace:: Debian
.. _DEBIAN_SRC_REPO: https://www.scylladb.com/download/?platform=debian-10&version=scylla-5.1
.. |UBUNTU_SRC_REPO| replace:: Ubuntu
.. _UBUNTU_SRC_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.1
.. |SCYLLA_DEB_SRC_REPO| replace:: ScyllaDB deb repo (|DEBIAN_SRC_REPO|_, |UBUNTU_SRC_REPO|_)
.. |SCYLLA_RPM_SRC_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_SRC_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-5.1
.. |DEBIAN_NEW_REPO| replace:: Debian
.. _DEBIAN_NEW_REPO: https://www.scylladb.com/download/?platform=debian-10&version=scylla-5.2
.. |UBUNTU_NEW_REPO| replace:: Ubuntu
.. _UBUNTU_NEW_REPO: https://www.scylladb.com/download/?platform=ubuntu-20.04&version=scylla-5.2
.. |SCYLLA_DEB_NEW_REPO| replace:: ScyllaDB deb repo (|DEBIAN_NEW_REPO|_, |UBUNTU_NEW_REPO|_)
.. |SCYLLA_RPM_NEW_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_NEW_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-5.2
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: Scylla Metrics Update - Scylla 5.1 to 5.2
.. _SCYLLA_METRICS: ../metric-update-5.1-to-5.2
=============================================================================
Upgrade Guide - |SCYLLA_NAME| |SRC_VERSION| to |NEW_VERSION|
=============================================================================
This document is a step by step procedure for upgrading from |SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION|, and rollback to version |SRC_VERSION| if required.
This guide covers upgrading Scylla on Red Hat Enterprise Linux (RHEL) 7/8, CentOS 7/8, Debian 10 and Ubuntu 20.04. It also applies when using ScyllaDB official image on EC2, GCP, or Azure; the image is based on Ubuntu 20.04.
See :doc:`OS Support by Platform and Version </getting-started/os-support>` for information about supported versions.
Upgrade Procedure
=================
A ScyllaDB upgrade is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes in the cluster, serially (i.e. one node at a time), you will:
* Check that the cluster's schema is synchronized
* Drain the node and backup the data
* Backup the configuration file
* Stop ScyllaDB
* Download and install new ScyllaDB packages
* (Optional) Enable consistent cluster management in the configuration file
* Start ScyllaDB
* Validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating that the node you upgraded is up and running the new version.
**During** the rolling upgrade, it is highly recommended:
* Not to use the new |NEW_VERSION| features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
If you enabled consistent cluster management in each node's configuration file, then as soon as every node has been upgraded to the new version, the cluster will start a procedure which initializes the Raft algorithm for consistent cluster metadata management.
You must then :ref:`verify <validate-raft-setup>` that this procedure successfully finishes.
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============
Check the cluster schema
-------------------------
Make sure that all nodes have the schema synchronized before upgrade. The upgrade procedure will fail if there is a schema disagreement between nodes.
.. code:: sh
nodetool describecluster
Drain the nodes and backup the data
-----------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having that name under ``/var/lib/scylla`` to a backup device.
When the upgrade is completed on all nodes, remove the snapshot with the ``nodetool clearsnapshot -t <snapshot>`` command to prevent running out of space.
Backup the configuration file
------------------------------
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-src
Gracefully stop the node
------------------------
.. code:: sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
.. tabs::
.. group-tab:: Debian/Ubuntu
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
Before upgrading, check what version you are running now using ``rpm -qa | grep scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_RPM_NEW_REPO|_ to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code:: sh
sudo yum clean all
sudo yum update scylla\* -y
.. group-tab:: EC2/GCP/Azure Ubuntu Image
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
There are two alternative upgrade procedures:
* :ref:`Upgrading ScyllaDB and simultaneously updating 3rd party and OS packages <upgrade-image-recommended-procedure>`. It is recommended if you are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04.
* :ref:`Upgrading ScyllaDB without updating any external packages <upgrade-image-upgrade-guide-regular-procedure>`.
.. _upgrade-image-recommended-procedure:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
.. code:: sh
cat scylla-packages-<version>-<arch>.txt | sudo xargs -n1 apt-get install -y
Where:
* ``<version>`` - The ScyllaDB version to which you are upgrading ( |NEW_VERSION| ).
* ``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
Example:
.. code:: sh
cat scylla-packages-5.2.0-x86_64.txt | sudo xargs -n1 apt-get install -y
.. note::
Alternatively, you can update the manifest file with the following command:
``sudo apt-get install $(awk '{print $1'} scylla-packages-<version>-<arch>.txt) -y``
.. _upgrade-image-upgrade-guide-regular-procedure:
**To upgrade ScyllaDB:**
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla
Answer y to the first two questions.
(Optional) Enable consistent cluster management in the node's configuration file
--------------------------------------------------------------------------------
If you enable this option on every node, this will cause the Scylla cluster to enable Raft and use it to consistently manage cluster-wide metadata as soon as you finish upgrading every node to the new version.
Check the :doc:`Raft in ScyllaDB document </architecture/raft/>` to learn more.
.. TODO: include enterprise versions
In 5.2, Raft-based consistent cluster management is disabled by default.
In 5.3 it will be enabled by default, but you'll be able to disable it explicitly during upgrade if needed (assuming you haven't previously enabled it on every node).
In further versions the option will be removed and consistent cluster management will be enabled unconditionally.
The option can also be enabled after the cluster is upgraded to |NEW_VERSION| (see :ref:`Enabling Raft in existing cluster <enabling-raft-existing-cluster>`).
To enable the option, modify the ``scylla.yaml`` configuration file in ``/etc/scylla/`` and add the following:
.. code:: yaml
consistent_cluster_management: true
.. note:: Once you finish upgrading every node with `consistent_cluster_management` enabled, it won't be possible to turn the option back off.
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
#. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in ``UN`` status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check the ScyllaDB version. Validate that the version matches the one you upgraded to.
#. Check scylla-server log (by ``journalctl _COMM=scylla``) and ``/var/log/syslog`` to validate there are no new errors in the log.
#. Check again after two minutes, to validate no new issues are introduced.
Once you are sure the node upgrade was successful, move to the next node in the cluster.
See |Scylla_METRICS|_ for more information..
After upgrading every node
==========================
The following section applies only if you enabled the ``consistent_cluster_management`` option on every node when upgrading the cluster.
.. _validate-raft-setup:
Validate Raft setup
-------------------
Enabling ``consistent_cluster_management`` on every node during upgrade will cause the Scylla cluster to start an additional internal procedure as soon as every node is upgraded to the new version.
The goal of this procedure is to initialize data structures used by the Raft algorithm to consistently manage cluster-wide metadata such as table schemas.
Assuming you performed the rolling upgrade procedure correctly, in particular ensuring that schema is synchronized on every step, and if there are no problems with cluster connectivity, then this follow-up internal procedure should take no longer than a few seconds to finish.
However, the procedure requires **full cluster availability**. If an unlucky accident (e.g. a hardware problem) causes one of your nodes to fail before this procedure finishes, the procedure may get stuck. This may cause the cluster to end up in a state where schema change operations are unavailable.
Therefore, following the rolling upgrade, **you must verify** that this internal procedure has finished successfully by checking the logs of every Scylla node.
If the procedure gets stuck, manual intervention is required.
Refer to the following document for instructions on how to verify that the procedure was successful and how to proceed if it gets stuck: :ref:`Verifying that the internal Raft upgrade procedure finished successfully <verify-raft-procedure>`.
Rollback Procedure
==================
.. include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from |SCYLLA_NAME| |NEW_VERSION|.x to |SRC_VERSION|.y. Apply this procedure if an upgrade from |SRC_VERSION| to |NEW_VERSION| failed before completing on all nodes. Use this procedure only for nodes you upgraded to |NEW_VERSION|.
.. warning::
The rollback procedure can be applied **only** if some nodes have not been upgraded to |NEW_VERSION| yet.
As soon as the last node in the rolling upgrade procedure is started with |NEW_VERSION|, rollback becomes impossible.
At that point, the only way to restore a cluster to |SRC_VERSION| is by restoring it from backup.
ScyllaDB rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes you rollback to |SRC_VERSION|, serially (i.e. one node at a time), you will:
* Drain the node and stop Scylla
* Retrieve the old ScyllaDB packages
* Restore the configuration file
* Restore system tables
* Reload systemd configuration
* Restart ScyllaDB
* Validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating that the rollback was successful and the node is up and running the old version.
Rollback Steps
==============
Drain and gracefully stop the node
----------------------------------
.. code:: sh
nodetool drain
sudo service scylla-server stop
Download and install the old release
------------------------------------
..
TODO: downgrade for 3rd party packages in EC2/GCP/Azure - like in the upgrade section?
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
.. code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
#. Update the |SCYLLA_RPM_SRC_REPO|_ to |SRC_VERSION|.
#. Install:
.. code:: console
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\\*tools-core
sudo yum downgrade scylla\\* -y
sudo yum install scylla
.. group-tab:: EC2/GCP/Azure Ubuntu Image
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
.. code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer y to the first two questions.
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for reference.
.. code:: sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
sudo cp -r * /var/lib/scylla/data/keyspace_name/table_name-UUID/
sudo chown -R scylla:scylla /var/lib/scylla/data/keyspace_name/table_name-UUID/
Reload systemd configuration
----------------------------
You must reload the unit file if the systemd unit file is changed.
.. code:: sh
sudo systemctl daemon-reload
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
Check the upgrade instructions above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -6,7 +6,6 @@ Scylla Enterprise Features
:maxdepth: 2
:hidden:
Lightweight Transactions </using-scylla/lwt/>
Workload Prioritization </using-scylla/workload-prioritization/>
In-memory tables </using-scylla/in-memory/>
Global Secondary Indexes </using-scylla/secondary-indexes/>

View File

@@ -2,10 +2,9 @@
Scylla in-memory tables
=========================
:label-tip:`ScyllaDB Enterprise`
.. versionadded:: 2018.1.7 Scylla Enterprise
.. include:: /rst_include/enterprise-only-note.rst
.. versionadded:: 2018.1.7
Overview
========

View File

@@ -2,9 +2,7 @@
Workload Prioritization
========================
.. include:: /rst_include/enterprise-only-note.rst
:label-tip:`ScyllaDB Enterprise`
In a typical database there are numerous workloads running at the same time.
Each workload type dictates a different acceptable level of latency and throughput.

View File

@@ -25,6 +25,7 @@ static const std::map<application_state, sstring> application_state_names = {
{application_state::REMOVAL_COORDINATOR, "REMOVAL_COORDINATOR"},
{application_state::INTERNAL_IP, "INTERNAL_IP"},
{application_state::RPC_ADDRESS, "RPC_ADDRESS"},
{application_state::RAFT_SERVER_ID, "RAFT_SERVER_ID"},
{application_state::SEVERITY, "SEVERITY"},
{application_state::NET_VERSION, "NET_VERSION"},
{application_state::HOST_ID, "HOST_ID"},

View File

@@ -38,8 +38,11 @@ enum class application_state {
IGNORE_MSB_BITS,
CDC_GENERATION_ID,
SNITCH_NAME,
// pad to allow adding new states to existing cluster
X10,
// RAFT ID is a server identifier which is maintained
// and gossiped in addition to HOST_ID because it's truly
// unique: any new node gets a new RAFT ID, while may keep
// its existing HOST ID, e.g. if it's replacing an existing node.
RAFT_SERVER_ID,
};
std::ostream& operator<<(std::ostream& os, const application_state& m);

View File

@@ -60,9 +60,6 @@ feature_config feature_config_from_db_config(db::config& cfg, std::set<sstring>
if (!cfg.check_experimental(db::experimental_features_t::feature::ALTERNATOR_STREAMS)) {
fcfg._disabled_features.insert("ALTERNATOR_STREAMS"s);
}
if (!cfg.check_experimental(db::experimental_features_t::feature::ALTERNATOR_TTL)) {
fcfg._disabled_features.insert("ALTERNATOR_TTL"s);
}
if (!cfg.check_experimental(db::experimental_features_t::feature::RAFT)) {
fcfg._disabled_features.insert("SUPPORTS_RAFT_CLUSTER_MANAGEMENT"s);
}

View File

@@ -100,7 +100,7 @@ gossiper::gossiper(abort_source& as, feature_service& features, const locator::s
, _failure_detector_timeout_ms(cfg.failure_detector_timeout_in_ms)
, _force_gossip_generation(cfg.force_gossip_generation)
, _gcfg(std::move(gcfg))
, _direct_fd_pinger(*this) {
, _echo_pinger(*this) {
// Gossiper's stuff below runs only on CPU0
if (this_shard_id() != 0) {
return;
@@ -726,7 +726,7 @@ future<> gossiper::do_status_check() {
// check for dead state removal
auto expire_time = get_expire_time_for_endpoint(endpoint);
if (!is_alive && (now > expire_time)
&& (!get_token_metadata_ptr()->is_member(endpoint))) {
&& (!get_token_metadata_ptr()->is_normal_token_owner(endpoint))) {
logger.debug("time is expiring for endpoint : {} ({})", endpoint, expire_time.time_since_epoch().count());
co_await evict_from_membership(endpoint);
}
@@ -970,7 +970,7 @@ void gossiper::run() {
}).get();
}
_direct_fd_pinger.update_generation_number(_endpoint_state_map[get_broadcast_address()].get_heart_beat_state().get_generation()).get();
_echo_pinger.update_generation_number(_endpoint_state_map[get_broadcast_address()].get_heart_beat_state().get_generation()).get();
}).then_wrapped([this] (auto&& f) {
try {
f.get();
@@ -1020,10 +1020,10 @@ std::set<inet_address> gossiper::get_live_members() const {
std::set<inet_address> gossiper::get_live_token_owners() const {
std::set<inet_address> token_owners;
for (auto& member : get_live_members()) {
auto es = get_endpoint_state_for_endpoint_ptr(member);
if (es && !is_dead_state(*es) && get_token_metadata_ptr()->is_member(member)) {
token_owners.insert(member);
auto normal_token_owners = get_token_metadata_ptr()->get_all_endpoints();
for (auto& node: normal_token_owners) {
if (is_alive(node)) {
token_owners.insert(node);
}
}
return token_owners;
@@ -1031,10 +1031,10 @@ std::set<inet_address> gossiper::get_live_token_owners() const {
std::set<inet_address> gossiper::get_unreachable_token_owners() const {
std::set<inet_address> token_owners;
for (auto&& x : _unreachable_endpoints) {
auto& endpoint = x.first;
if (get_token_metadata_ptr()->is_member(endpoint)) {
token_owners.insert(endpoint);
auto normal_token_owners = get_token_metadata_ptr()->get_all_endpoints();
for (auto& node: normal_token_owners) {
if (!is_alive(node)) {
token_owners.insert(node);
}
}
return token_owners;
@@ -1300,7 +1300,7 @@ bool gossiper::is_gossip_only_member(inet_address endpoint) {
if (!es) {
return false;
}
return !is_dead_state(*es) && !get_token_metadata_ptr()->is_member(endpoint);
return !is_dead_state(*es) && !get_token_metadata_ptr()->is_normal_token_owner(endpoint);
}
clk::time_point gossiper::get_expire_time_for_endpoint(inet_address endpoint) const noexcept {
@@ -1852,7 +1852,7 @@ future<> gossiper::start_gossiping(int generation_nbr, std::map<application_stat
co_await container().invoke_on_all([] (gms::gossiper& g) {
g._failure_detector_loop_done = g.failure_detector_loop();
});
co_await _direct_fd_pinger.update_generation_number(generation_nbr);
co_await _echo_pinger.update_generation_number(generation_nbr);
}
future<std::unordered_map<gms::inet_address, int32_t>>
@@ -2538,68 +2538,18 @@ locator::token_metadata_ptr gossiper::get_token_metadata_ptr() const noexcept {
return _shared_token_metadata.get();
}
future<> gossiper::direct_fd_pinger::update_generation_number(int64_t n) {
future<> echo_pinger::update_generation_number(int64_t n) {
if (n <= _generation_number) {
return make_ready_future<>();
}
return _gossiper.container().invoke_on_all([n] (gossiper& g) {
g._direct_fd_pinger._generation_number = n;
g._echo_pinger._generation_number = n;
});
}
direct_failure_detector::pinger::endpoint_id gossiper::direct_fd_pinger::allocate_id(gms::inet_address addr) {
assert(this_shard_id() == 0);
auto it = _addr_to_id.find(addr);
if (it == _addr_to_id.end()) {
auto id = _next_allocated_id++;
_id_to_addr.emplace(id, addr);
it = _addr_to_id.emplace(addr, id).first;
logger.debug("gossiper::direct_fd_pinger: assigned endpoint ID {} to address {}", id, addr);
}
return it->second;
}
future<gms::inet_address> gossiper::direct_fd_pinger::get_address(direct_failure_detector::pinger::endpoint_id id) {
auto it = _id_to_addr.find(id);
if (it == _id_to_addr.end()) {
// Fetch the address from shard 0. By precondition it must be there.
auto addr = co_await _gossiper.container().invoke_on(0, [id] (gossiper& g) {
auto it = g._direct_fd_pinger._id_to_addr.find(id);
if (it == g._direct_fd_pinger._id_to_addr.end()) {
on_internal_error(logger, format("gossiper::direct_fd_pinger: endpoint id {} has no corresponding address", id));
}
return it->second;
});
it = _id_to_addr.emplace(id, addr).first;
}
co_return it->second;
}
future<bool> gossiper::direct_fd_pinger::ping(direct_failure_detector::pinger::endpoint_id id, abort_source& as) {
try {
co_await _gossiper._messaging.send_gossip_echo(netw::msg_addr(co_await get_address(id)), _generation_number, as);
} catch (seastar::rpc::closed_error&) {
co_return false;
}
co_return true;
future<> echo_pinger::ping(const gms::inet_address& addr, abort_source& as) {
return _gossiper._messaging.send_gossip_echo(netw::msg_addr(addr), _generation_number, as);
}
} // namespace gms
direct_failure_detector::clock::timepoint_t direct_fd_clock::now() noexcept {
return base::now().time_since_epoch().count();
}
future<> direct_fd_clock::sleep_until(direct_failure_detector::clock::timepoint_t tp, abort_source& as) {
auto t = base::time_point{base::duration{tp}};
auto n = base::now();
if (t <= n) {
return make_ready_future<>();
}
return sleep_abortable(t - n, as);
}

View File

@@ -82,6 +82,23 @@ struct gossip_config {
uint32_t skip_wait_for_gossip_to_settle = -1;
};
class gossiper;
// Caches the gossiper's generation number, which is required for sending gossip echo messages.
// Call `ping` to send a gossip echo message to the given address using the last known generation number.
// The generation number is updated by gossiper's loop and replicated to every shard.
class echo_pinger {
friend class gossiper;
gossiper& _gossiper;
int64_t _generation_number{0};
future<> update_generation_number(int64_t n);
echo_pinger(gossiper& g) : _gossiper(g) {}
public:
future<> ping(const gms::inet_address&, abort_source&);
};
/**
* This module is responsible for Gossiping information for the local endpoint. This abstraction
* maintains the list of live and dead endpoints. Periodically i.e. every 1 second this module
@@ -95,6 +112,7 @@ struct gossip_config {
* the Failure Detector.
*/
class gossiper : public seastar::async_sharded_service<gossiper>, public seastar::peering_sharded_service<gossiper> {
friend class echo_pinger;
public:
using clk = seastar::lowres_system_clock;
using ignore_features_of_local_node = bool_class<class ignore_features_of_local_node_tag>;
@@ -605,53 +623,13 @@ private:
future<> update_live_endpoints_version();
public:
// Implementation of `direct_failure_detector::pinger` which uses gossip echo messages for pinging.
// The gossip echo message must be provided this node's gossip generation number.
// It's an integer incremented when the node restarts or when the gossip subsystem restarts.
// We cache the generation number inside `direct_fd_pinger` on every shard and update it in the `gossiper` main loop.
//
// We also store a mapping between `direct_failure_detector::pinger::endpoint_id`s and `inet_address`es.
class direct_fd_pinger : public direct_failure_detector::pinger {
friend class gossiper;
gossiper& _gossiper;
// Only used on shard 0 by `allocate_id`.
direct_failure_detector::pinger::endpoint_id _next_allocated_id{0};
// The mappings are created on shard 0 and lazily replicated to other shards:
// when `ping` or `get_address` is called with an unknown ID on a different shard, it will fetch the ID from shard 0.
std::unordered_map<direct_failure_detector::pinger::endpoint_id, inet_address> _id_to_addr;
// Used to quickly check if given address already has an assigned ID.
// Used only on shard 0, not replicated.
std::unordered_map<inet_address, direct_failure_detector::pinger::endpoint_id> _addr_to_id;
// This node's gossip generation number, updated by gossiper's loop and replicated to every shard.
int64_t _generation_number{0};
future<> update_generation_number(int64_t n);
direct_fd_pinger(gossiper& g) : _gossiper(g) {}
public:
direct_fd_pinger(const direct_fd_pinger&) = delete;
// Allocate a new endpoint_id for `addr`, or if one already exists, return it.
// Call only on shard 0.
direct_failure_detector::pinger::endpoint_id allocate_id(gms::inet_address addr);
// Precondition: `id` was returned from `allocate_id` on shard 0 earlier.
future<gms::inet_address> get_address(direct_failure_detector::pinger::endpoint_id id);
future<bool> ping(direct_failure_detector::pinger::endpoint_id id, abort_source& as) override;
};
direct_fd_pinger& get_direct_fd_pinger() { return _direct_fd_pinger; }
echo_pinger& get_echo_pinger() { return _echo_pinger; }
private:
direct_fd_pinger _direct_fd_pinger;
echo_pinger _echo_pinger;
};
struct gossip_get_endpoint_states_request {
// Application states the sender requested
std::unordered_set<gms::application_state> application_states;
@@ -662,11 +640,3 @@ struct gossip_get_endpoint_states_response {
};
} // namespace gms
// XXX: find a better place to put this?
struct direct_fd_clock : public direct_failure_detector::clock {
using base = std::chrono::steady_clock;
direct_failure_detector::clock::timepoint_t now() noexcept override;
future<> sleep_until(direct_failure_detector::clock::timepoint_t tp, abort_source& as) override;
};

View File

@@ -151,6 +151,10 @@ public:
return versioned_value(host_id.to_sstring());
}
static versioned_value raft_server_id(const utils::UUID& id) {
return versioned_value(id.to_sstring());
}
static versioned_value tokens(const std::unordered_set<dht::token>& tokens) {
return versioned_value(make_full_token_string(tokens));
}

View File

@@ -80,6 +80,11 @@ struct not_a_leader {
raft::server_id leader;
};
struct transient_error {
sstring message();
raft::server_id leader;
};
struct commit_status_unknown {
};

View File

@@ -45,6 +45,7 @@ debian_base_packages=(
pigz
libunistring-dev
libzstd-dev
libdeflate-dev
)
fedora_packages=(
@@ -58,6 +59,7 @@ fedora_packages=(
jsoncpp-devel
rapidjson-devel
snappy-devel
libdeflate-devel
systemd-devel
git
python
@@ -170,11 +172,11 @@ arch_packages=(
thrift
)
NODE_EXPORTER_VERSION=1.3.1
NODE_EXPORTER_VERSION=1.4.0
declare -A NODE_EXPORTER_CHECKSUM=(
["x86_64"]=68f3802c2dd3980667e4ba65ea2e1fb03f4a4ba026cca375f15a0390ff850949
["aarch64"]=f19f35175f87d41545fa7d4657e834e3a37c1fe69f3bf56bc031a256117764e7
["s390x"]=a12802101a5ee1c74c91bdaa5403c00011ebdf36b83b617c903dbd356a978d03
["x86_64"]=e77ff1b0a824a4e13f82a35d98595fe526849c09e3480d0789a56b72242d2abc
["aarch64"]=0b20aa75385a42857a67ee5f6c7f67b229039a22a49c5c61c33f071356415b59
["s390x"]=a98e2aa5f9e557441190d233ba752c0cae28f3130c6a6742b038f3997d034065
)
declare -A NODE_EXPORTER_ARCH=(
["x86_64"]=amd64
@@ -314,7 +316,7 @@ elif [ "$ID" = "fedora" ]; then
pip3 install "$PIP_DEFAULT_ARGS" traceback-with-variables
pip3 install "$PIP_DEFAULT_ARGS" scylla-api-client
cargo install cxxbridge-cmd --root /usr/local
cargo --config net.git-fetch-with-cli=true install cxxbridge-cmd --root /usr/local
if [ -f "$(node_exporter_fullpath)" ] && node_exporter_checksum; then
echo "$(node_exporter_filename) already exists, skipping download"
else

View File

@@ -70,6 +70,7 @@ upgrade=false
supervisor=false
supervisor_log_to_stdout=false
without_systemd=false
skip_systemd_check=false
while [ $# -gt 0 ]; do
case "$1" in
@@ -99,6 +100,7 @@ while [ $# -gt 0 ]; do
;;
"--packaging")
packaging=true
skip_systemd_check=true
shift 1
;;
"--upgrade")
@@ -107,6 +109,7 @@ while [ $# -gt 0 ]; do
;;
"--supervisor")
supervisor=true
skip_systemd_check=true
shift 1
;;
"--supervisor-log-to-stdout")
@@ -115,6 +118,7 @@ while [ $# -gt 0 ]; do
;;
"--without-systemd")
without_systemd=true
skip_systemd_check=true
shift 1
;;
"--help")
@@ -246,7 +250,7 @@ supervisor_conf() {
fi
}
if ! $packaging && [ ! -d /run/systemd/system/ ] && ! $supervisor; then
if ! $skip_systemd_check && [ ! -d /run/systemd/system/ ]; then
echo "systemd is not detected, unsupported distribution."
exit 1
fi
@@ -566,7 +570,7 @@ if $nonroot; then
# nonroot install is also 'offline install'
touch $rprefix/SCYLLA-OFFLINE-FILE
touch $rprefix/SCYLLA-NONROOT-FILE
if ! $supervisor && ! $packaging && ! $without_systemd && check_usermode_support; then
if ! $without_systemd_check && check_usermode_support; then
systemctl --user daemon-reload
fi
echo "Scylla non-root install completed."

Submodule libdeflate deleted from e7e54eab42

View File

@@ -12,6 +12,8 @@
#include <boost/range/algorithm/remove_if.hpp>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include "replica/database.hh"
#include "utils/stall_free.hh"
namespace locator {
@@ -227,7 +229,7 @@ abstract_replication_strategy::get_address_ranges(const token_metadata& tm) cons
future<std::unordered_multimap<inet_address, dht::token_range>>
abstract_replication_strategy::get_address_ranges(const token_metadata& tm, inet_address endpoint) const {
std::unordered_multimap<inet_address, dht::token_range> ret;
if (!tm.is_member(endpoint)) {
if (!tm.is_normal_token_owner(endpoint)) {
co_return ret;
}
bool is_everywhere_topology = get_type() == replication_strategy_type::everywhere_topology;
@@ -468,6 +470,44 @@ void effective_replication_map_factory::submit_background_work(future<> fut) {
});
}
future<> global_effective_replication_map::get_keyspace_erms(sharded<replica::database>& sharded_db, std::string_view keyspace_name) {
return sharded_db.invoke_on(0, [this, &sharded_db, keyspace_name] (replica::database& db) -> future<> {
// To ensure we get the same effective_replication_map
// on all shards, acquire the shared_token_metadata lock.
//
// As a sanity check compare the ring_version on each shard
// to the reference version on shard 0.
//
// This invariant is achieved by storage_service::mutate_token_metadata
// and storage_service::replicate_to_all_cores that first acquire the
// shared_token_metadata lock, then prepare a mutated token metadata
// that will have an incremented ring_version, use it to re-calculate
// all e_r_m:s and clone both on all shards. including the ring version,
// all under the lock.
auto lk = co_await db.get_shared_token_metadata().get_lock();
auto erm = db.find_keyspace(keyspace_name).get_effective_replication_map();
auto ring_version = erm->get_token_metadata().get_ring_version();
_erms[0] = make_foreign(std::move(erm));
co_await coroutine::parallel_for_each(boost::irange(1u, smp::count), [this, &sharded_db, keyspace_name, ring_version] (unsigned shard) -> future<> {
_erms[shard] = co_await sharded_db.invoke_on(shard, [keyspace_name, ring_version] (const replica::database& db) {
const auto& ks = db.find_keyspace(keyspace_name);
auto erm = ks.get_effective_replication_map();
auto local_ring_version = erm->get_token_metadata().get_ring_version();
if (local_ring_version != ring_version) {
on_internal_error(rslogger, format("Inconsistent effective_replication_map ring_verion {}, expected {}", local_ring_version, ring_version));
}
return make_foreign(std::move(erm));
});
});
});
}
future<global_effective_replication_map> make_global_effective_replication_map(sharded<replica::database>& sharded_db, std::string_view keyspace_name) {
global_effective_replication_map ret;
co_await ret.get_keyspace_erms(sharded_db, keyspace_name);
co_return ret;
}
} // namespace locator
std::ostream& operator<<(std::ostream& os, locator::replication_strategy_type t) {

View File

@@ -22,6 +22,7 @@
// forward declaration since replica/database.hh includes this file
namespace replica {
class database;
class keyspace;
}
@@ -101,7 +102,7 @@ public:
virtual inet_address_vector_replica_set get_natural_endpoints(const token& search_token, const effective_replication_map& erm) const;
virtual void validate_options() const = 0;
virtual std::optional<std::set<sstring>> recognized_options(const topology&) const = 0;
virtual std::optional<std::unordered_set<sstring>> recognized_options(const topology&) const = 0;
virtual size_t get_replication_factor(const token_metadata& tm) const = 0;
// Decide if the replication strategy allow removing the node being
// replaced from the natural endpoints when a node is being replaced in the
@@ -265,6 +266,33 @@ inline mutable_effective_replication_map_ptr make_effective_replication_map(abst
// Apply the replication strategy over the current configuration and the given token_metadata.
future<mutable_effective_replication_map_ptr> calculate_effective_replication_map(abstract_replication_strategy::ptr_type rs, token_metadata_ptr tmptr);
// Class to hold a coherent view of a keyspace
// effective replication map on all shards
class global_effective_replication_map {
std::vector<foreign_ptr<effective_replication_map_ptr>> _erms;
public:
global_effective_replication_map() : _erms(smp::count) {}
global_effective_replication_map(global_effective_replication_map&&) = default;
global_effective_replication_map& operator=(global_effective_replication_map&&) = default;
future<> get_keyspace_erms(sharded<replica::database>& sharded_db, std::string_view keyspace_name);
const effective_replication_map& get() const noexcept {
return *_erms[this_shard_id()];
}
const effective_replication_map& operator*() const noexcept {
return get();
}
const effective_replication_map* operator->() const noexcept {
return &get();
}
};
future<global_effective_replication_map> make_global_effective_replication_map(sharded<replica::database>& sharded_db, std::string_view keyspace_name);
} // namespace locator
std::ostream& operator<<(std::ostream& os, locator::replication_strategy_type);

View File

@@ -24,7 +24,7 @@ public:
virtual void validate_options() const override { /* noop */ }
std::optional<std::set<sstring>> recognized_options(const topology&) const override {
std::optional<std::unordered_set<sstring>> recognized_options(const topology&) const override {
// We explicitely allow all options
return std::nullopt;
}

View File

@@ -46,6 +46,10 @@ gossiping_property_file_snitch::gossiping_property_file_snitch(const snitch_conf
if (this_shard_id() == _file_reader_cpu_id) {
io_cpu_id() = _file_reader_cpu_id;
}
if (_listen_address->addr().is_addr_any()) {
logger().warn("Not gossiping INADDR_ANY as internal IP");
_listen_address.reset();
}
}
future<> gossiping_property_file_snitch::start() {
@@ -104,12 +108,15 @@ void gossiping_property_file_snitch::periodic_reader_callback() {
}
std::list<std::pair<gms::application_state, gms::versioned_value>> gossiping_property_file_snitch::get_app_states() const {
sstring ip = format("{}", _listen_address);
return {
std::list<std::pair<gms::application_state, gms::versioned_value>> ret = {
{gms::application_state::DC, gms::versioned_value::datacenter(_my_dc)},
{gms::application_state::RACK, gms::versioned_value::rack(_my_rack)},
{gms::application_state::INTERNAL_IP, gms::versioned_value::internal_ip(std::move(ip))},
};
if (_listen_address.has_value()) {
sstring ip = format("{}", *_listen_address);
ret.emplace_back(gms::application_state::INTERNAL_IP, gms::versioned_value::internal_ip(std::move(ip)));
}
return ret;
}
future<> gossiping_property_file_snitch::read_property_file() {

View File

@@ -93,7 +93,7 @@ private:
unsigned _file_reader_cpu_id;
snitch_signal_t _reconfigured;
promise<> _io_is_stopped;
gms::inet_address _listen_address;
std::optional<gms::inet_address> _listen_address;
void reset_io_state() {
// Reset the promise to allow repeating

View File

@@ -24,7 +24,7 @@ future<endpoint_set> local_strategy::calculate_natural_endpoints(const token& t,
void local_strategy::validate_options() const {
}
std::optional<std::set<sstring>> local_strategy::recognized_options(const topology&) const {
std::optional<std::unordered_set<sstring>> local_strategy::recognized_options(const topology&) const {
// LocalStrategy doesn't expect any options.
return {};
}

Some files were not shown because too many files have changed in this diff Show More