Commit Graph

31193 Commits

Author SHA1 Message Date
Avi Kivity
e6fe9cc683 Merge 'Reapply: "disable_auto_compaction: stop ongoing compactions"' from Eliran Sinvani
Reapply: "disable_auto_compaction: stop ongoing compactions"
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.

The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).

Test: unit (dev,debug)
      write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381

Fixes https://github.com/scylladb/scylla/issues/9313
Refs https://github.com/scylladb/scylla/issues/10146

For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10597

* github.com:scylladb/scylla:
  compaction_manager: Make invoking the empty constructor more explicit
  Reapply: "disable_auto_compaction: stop ongoing compactions"
2022-05-18 18:33:12 +03:00
Eliran Sinvani
c5e5692a01 compaction_manager: Make invoking the empty constructor more explicit
The compaction manager's empty constructor is supposed to be invoked
only in testing environment, however, it is easy to invoke it by mistake
from production code.
Here we add a more verbose constructor and making the default compaction
private, the verbose compiler need to be invoked with a tag
for_testing_tag, this will ensure that this constructor will be invoked
only when intended.
The unit tests were changed according to this new paradigm.

Tests: unit (dev)

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-18 14:57:10 +03:00
Eliran Sinvani
c138981286 Reapply: "disable_auto_compaction: stop ongoing compactions"
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.

The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).

Test: unit (dev,debug)
      write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381

Fixes #9313
Refs #10146

For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-18 14:57:10 +03:00
Botond Dénes
6fd1322bf3 test/boost/mutation_test: test_query_digest: use the same now everywhere
This test started failing sporadically of late. This failure is seen
quite often in CI tests but is very hard to reproduce locally. The
problem seems to be timing related, as the same seeds that fail in CI
don't fail locally. This patch is a speculative fix. The test has a
single time-related components: `gc_clock::now()`. This is invoked in 4
different places during a single iteration, giving ample opportunity for
off-by-one errors to appear. Although there is no solid proof for this
being the problem, this is a good candidate. This patch replaces all
those different invocations, with a single one per test: this value is
then propagated to all places that need it.

Fixes: #10554

Marking the patch as a fix for the issue, if the problem re-surfaces
after this patch we'll re-poen it.

Closes #10589
2022-05-17 16:25:04 +03:00
Takuya ASADA
883b97d8b2 dist/common/scripts: generate debug log when exception occurred
Using traceback_with_variables module, generate more detail traceback
with variables into debug log.
This will help fixing bugs which is hard to reproduce.

Closes #10472

[avi: regenerate frozen toolchain]
2022-05-17 13:18:27 +03:00
Raphael S. Carvalho
ca322fb7c2 compaction_manager: Quickly abort maintenance compaction waiting for its turn
Today, aborting a maintenance compaction like major, which is waiting for
its turn to run, can take lots of time because compaction manager will
only be able to bail out the task once it gets the "permit" from the
serialization mechanism, i.e. semaphore. Meaning that the command that
started the task will only complete after all this time waiting for
the "permit".

To allow a pending maintenance compaction to be quickly aborted, we
can use the abortable variant of get_units(). So when user submits an
abortion request, get_units() will be able to return earlier through
the abort exception.

Refs #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10581
2022-05-17 13:14:51 +03:00
Avi Kivity
a8f9ed56fb Merge 'cql3: Replace relation class with expression' from Jan Ciołek
Before this change parser used to output instances of the `relation` class, which were later converted to `restrction`.
`relation` took care of initial processing such as preparing and some validation checks.

This PR aims to remove the `relation` class and perform it's functionality using only `expression`.
This is a step towards removing the legacy classes and converting all AST analysis to work on `expressions`.

Closes #10409

* github.com:scylladb/scylla:
  cql3: Remove scalar from bind_variable_scalar_prepare_expression
  cql3: expr: Remove shape_type from bind_variable
  cql3: Remove prepare_expression_multi_column
  cql3: Remove relation class
  cql3: Add more tests for expr::printer
  cql3: Make parser output expression for relations
  cql3: expr: add printer for expression
  cql3: expr: expr::to_restriction: Handle token relations
  cql3: expr: expr::to_restriction: Handle multi column relations
  cql3: expr: Add expr::to_restriction for single column relations
  cql3: expr: Add prepare_binary_operator
  cql3: expr: Change how prepare_expression handles bind_variable
  clq3: expr: Add columns to expr::token struct
  cql3: expr: Modify list_prepare_expression to handle lists of IN values
  cql3: expr: Add expr::as_if for non-const expressions
2022-05-17 12:51:54 +03:00
Takuya ASADA
00ce34c29b scylla_prepare: describe error more correctly
Currently our error message on scylla_prepare says "Exception occurred
while creating perftune.yaml", even perftune.yaml is already generated,
and error occurred after that.
To describe error more correctly, add another error message after
perftune.yaml generated.

see scylladb/scylla-enterprise#2201

Closes #10575
2022-05-16 20:05:58 +03:00
cvybhu
21453ac9a4 cql3: Remove scalar from bind_variable_scalar_prepare_expression
There is now only one function to prepare bind_variable,
so we can remove 'scalar' from its name.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
9b49d27a8d cql3: expr: Remove shape_type from bind_variable
shape_type was used in prepare_expression to differentiate
between a few cases and create the correct receivers.
This was used by the relation class.

Now creating the correct receiver has been delegated to the caller
of prepare_expression and all bind_variables can be handled
in the same simple way.

shape_type is not needed anymore.

Not having it is better because it simplifies things.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
c0fc82d4be cql3: Remove prepare_expression_multi_column
This function was used by multi_column_relation.hh,
but now it isn't needed anymore.

The only way to prepare a bind_variable is now the standard prepare_expression.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
d85f680df3 cql3: Remove relation class
Functionality of the relation class has been replaced by
expr::to_restriction.

Relation and all classes deriving from it can now be removed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
575c4bd76b cql3: Add more tests for expr::printer
Now that parser outputs expressions
it's much easier to check whether
expression printer works correctly.

We can prepare a bunch of strings
which will be parsed and then printed
back to string.
Then we can compare those strings.

It's much easier than creating
expresions to print manually.

The only downside is that this tests
only unprepared version of expression,
so instead of column_value there will
be unresolved identifier, insted of constant
untyped_constant etc.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
51cdbdeacb cql3: Make parser output expression for relations
Parser used to output the where clause as a vector of relations,
but now we can change it to a vector of expressions.

Cql.g needs to be modified to output expressions instead
of relations.

The WHERE clause is kept in a few places in the code that
need to be changed to vector<expression>.

Finally relation->to_restriction is replaced by expr::to_restriction
and the expressions are converted to restrictions where required.

The relation class isn't used anywhere now and can be removed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
Michał Sala
f6bdc4d694 cql3: expr: add printer for expression
expression::printer is used to print CQL expressions
in a pretty way that allows them to be parsed back
to the same representation.

There is a bunch of things that need to be changed when
compared to the current implementation of opreatorr<<(expression)
to output something parsable.

column names should be printed without 'unresolved_identifier()'
and sometimes they need to be quoted to perserve case sensitivity.

I needed to write new code for printing constant values
because the current one did debug printing
(e.g. a set was printed as '1; 2; 3').

A list of IN values should be printed inside () intead of [],
but because it is internally represented as a list it is
by default printed with [].
To fix this a temporary tuple_constructor is created and printed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
c4f846dbc8 cql3: expr: expr::to_restriction: Handle token relations
Implement converting token relations to expressions.

The code is mostly tekken from functions in token_relation.hh,
because we are replicating functionliaty of the functions called
token_relation::new_XX_restrictions.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
5fc5012f9b cql3: expr: expr::to_restriction: Handle multi column relations
Implement converting multi column relations to expressions.

The code is mostly taken from functions in multi_column_relation.hh,
because we are replicating functionality of the functions called
multi_column_relation::new_XX_restriction.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
89950e02b5 cql3: expr: Add expr::to_restriction for single column relations
Add a function that will be used to convert expressions
received from the parser to restrictions.

Currently parser creates relations with expressions inside
and then those relations are converted to restrictions.

Once this function is implemented we will be able to skip
creating relations altogether and convert straight from
expression to restriction. This will allow us to remove
the relation class.

Further functionality will be implemented in the following commits.
This commit implements converting single column relations to expressions.

The code is mostly taken from functions in single_column_relation.hh,
because we are replicating functionality of the functions called
single_column_relation::new_XX_restriction.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:57 +02:00
cvybhu
3e5e5c4a17 cql3: expr: Add prepare_binary_operator
Add a function that allows to prepare
a binary_operator received from the parser.

It resolves columns on the LHS, calculates type of LHS,
and prepares RHS with the correct type.

It will be used by expr::to_restriction.

Some basic type checks are performed, but more throughout
checks will be required in expr::to_restriction to fully
validate a relation.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:15:37 +02:00
cvybhu
5dee55d433 cql3: expr: Change how prepare_expression handles bind_variable
The situation with preparing bind_variable is a bit strange,
there are four shapes of bind variables and receiver behaviour
is not in line with other types.

To prepare a bind_variable for a list of IN values for an int column
the current code requires us to pass a receiver of type int.
This is counterintuitive, to prepare a string we pass
a receiver with string type, so to prepare list<int> we should
pass a receiver of type list<int>, not just int.

This commit changes the behaviour in two ways:
- Shape of bind_variable doesn't matter anymore
- The bind_variable gets the receiver passed to prepare_expression,
  no more list<receiver> magic.

Other variants of bind_variable_x_prepare_expression are not removed yet
because they are needed by prepare_expression_mutlti_column.
They will be removed later, along with bind_variable::shape_type.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:15:36 +02:00
cvybhu
be6e741b6c clq3: expr: Add columns to expr::token struct
The expr::token struct is created when something
like token(p1, p2) occurs in the WHERE clause.

Currently expr::token doesn't keep columns passed
as arguemnts to the token function.

They weren't needed because token() validation
was done inside token_relation.

Now that we want to use only expressions
we need to have columns inside the token struct
and validate that those are the correct columns.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
cvybhu
b99aae7d41 cql3: expr: Modify list_prepare_expression to handle lists of IN values
The standard CQL list type doesn't allow for nulls inside the collection.

However lists of IN values are the exception where bind nullsare allowed,
for example in restrictions like: p IN (1, 2, null)

To be able to use list_prepare_expression with lists of IN values
a flag is added to specify whether nulls should be allowed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
cvybhu
2b5818697a cql3: expr: Add expr::as_if for non-const expressions
expr::as_if is our wrapper for std::get_if.

There was a version for const expression*,
but there weren't one for mutable expression*.

Add the mutable version,
it will be needed in the following commits.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
Nadav Har'El
a2b9a17927 test/cql-pytest: add test for default clustering order of SELECT
Our documentation for SELECT,
https://docs.scylladb.com/getting-started/dml/#ordering-results
says that:

  "The ORDER BY clause lets you select the order of the returned results.
   It takes as argument a list of column names along with the order for
   the column (ASC for ascendant and DESC for descendant,
   **omitting the order being equivalent to ASC**)."

The test in this patch confirms that the last emphasized line is not
accurate - The default order for SELECT is the default order of the table
being read - NOT always ascending order. If the table was created with
descending WITH CLUSTERING ORDER BY, then a SELECT not specifying
an ORDER BY will get this descending order by default.

The test passes on both Scylla and Cassandra, demonstrating that this
behavior is expected and correct - regardless of what our docs say.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220515115030.775813-1-nyh@scylladb.com>
2022-05-16 11:52:02 +02:00
Avi Kivity
45f75ef595 service/memory_limiter.hh: correct license
When split from service/storage_service.hh (4ca2ae13) it accidentally
changed license. Change it back (since it does not contain
Apache derived code, constrain it to AGPL-3.0-or-later).

Closes #10572
2022-05-16 10:01:06 +03:00
Benny Halevy
8a6f8c622d sstables: writer: pass bytes_ostream by reference
The bytes_stream param is passed by value from `write_promoted_index`
(since 0d8463aba5)
causing an uneeded copy.

This can lead to OOM if the promoted index is extremely large.

Pass the bytes_ostream by reference instead to prevent this copy.

Fixes #10569

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10570
2022-05-15 17:53:38 +03:00
Avi Kivity
528ab5a502 treewide: change metric calls from make_derive to make_counter
make_derive was recently deprecated in favor of make_counter, so
make the change throughput the codebase.

Closes #10564
2022-05-14 12:53:55 +02:00
Piotr Sarna
768b5f3f29 utils: mark loading_cache::shrink as noexcept
Current code already assumes (correctly), that shrink() does not
throw, otherwise we risk leaking memory allocated in get_ptr():

```
 ts_value_lru_entry* new_lru_entry = Alloc().template allocate_object<ts_value_lru_entry>();

 // Remove the least recently used items if map is too big.
 shrink();
```

Let's be explicit and mark shrink() and a few helper methods
that it uses as noexcept. Ultimately they are all noexcept anyway,
because polymorphic allocator's deallocation routines don't throw,
and neither do boost intrusive list iterators.

Closes #10565
2022-05-13 18:28:58 +03:00
Nadav Har'El
c51a41a885 test/cql-pytest: test for multiple restrictions on same column
It turns out that there is a difference between how Scylla and Cassandra
handle multiple restrictions on the same column - for example "WHERE
c = 0 and c >0". Cassandra treats all such cases as invalid queries,
whereas Scylla *allows* them.

This test demonstrates this difference (it is marked "scylla_only"
because it's a Scylla-only feature), and also verifies that the results
of such queries on Scylla are correct - i.e., if the two restrictions
conflict the result is empty, and if the two restrictions overlap,
the result can be non-empty.

The test passes, verifying that although Scylla differs from Cassandra
on this, its behavior is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220512165107.644932-1-nyh@scylladb.com>
2022-05-13 08:43:57 +03:00
Avi Kivity
0dd5f02022 build: enable ABSL_PROPAGATE_CXX_STD
Recently Abseil started to ask to enable ABSL_PROPAGATE_CXX_STD,
warning that it will do so itself in the future. Do so, and
specify that we use C++20 to avoid inconsistencies.

Closes #10563
2022-05-13 07:12:03 +02:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Botond Dénes
1f0d3d57eb Merge 'Convert (almost) all uses of flat_mutation_reader_assertions to v2' from Michael Livshin
"Almost" because 2 uses of the v1 asserter remain (as they are deliberate).

Closes #10518

* github.com:scylladb/scylla:
  tests: remove obsolete utility functions
  tests: less trivial flat_reader_assertions{,_v2} conversions
  tests: trivial flat_reader_assertions{,_v2} conversions
  flat_mutation_reader_assertions_v2: improve range tombstone support
2022-05-13 08:04:20 +03:00
Eliran Sinvani
8e8dc2c930 Revert "table: disable_auto_compaction: stop ongoing compactions"
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the  compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10559

Reopens #9313.
2022-05-12 18:51:25 +03:00
Benny Halevy
333f6c5ec9 mutation_partition_view: do_accept_gently: keep clustering_row key on stack
We're hitting a unit test failure as in
https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/1010/artifact/testlog/aarch64_dev/frozen_mutation_test.test_writing_and_reading_gently.918.log
```
unknown location(0): fatal error: in "test_writing_and_reading_gently": std::_Nested_exception<std::runtime_error>: frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{00801806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000} of ks.cf
```

on aarch64 clang 12.0.1, in release and dev modes (but not debug).

This turned out to be a miscompilation
in `position_in_partition_view::for_key(cr.key())`
that returns a position_in_partition_view of the
clustering_key_prefix rvalue that cr.key() returns.

The latter is lost on aarch64 in release mode.
Keeping the key on the stack allows to safely pass
a view to it.

Fixes #10555

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-12 16:26:07 +03:00
Piotr Sarna
eb6f4cc839 Merge 'dependencies: add rust' from Wojciech Mitros
The main reason for adding rust dependency to scylla is the
wasmtime library, which is written in rust. Although there
exist c++ bindings, they don't expose all of its features,
so we want to do that ourselves using rust's cxx.

The patch also includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.

The usage of wasmtime has been slightly modified to avoid
duplicate symbol errors, but as a result of adding a Rust
dependency, it is going to be removed from `configure.py`
completely anyway

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #10341

* github.com:scylladb/scylla:
  docs: document rust
  tests: add rust example
2022-05-12 15:24:58 +02:00
Michael Livshin
00ed4ac74c batchlog_manager: warn when a batch fails to replay
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.

(Would info be more appropriate here than warning?)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10556
2022-05-12 13:34:03 +03:00
Wojciech Mitros
cb5d054a67 docs: document rust
Using Rust in Scylla is not intuitive, the doc explains the entire
process of adding new Rust source files to Scylla. What happens
during compilation is also explained.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-05-11 16:49:31 +02:00
Wojciech Mitros
4ad012cb6a tests: add rust example
The patch includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-05-11 16:49:31 +02:00
Benny Halevy
4a5842787e memtable_list: clear_and_add: let caller clear the old memtables
As a follow up on b8263e550a,
make clear_and_add synchronous yet again, and just return
the swapped list of memtables so that the caller (table::clear)
can clear them gently.

Refs https://github.com/scylladb/scylla/pull/10424#discussion_r867455056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10540
2022-05-11 14:46:30 +02:00
Botond Dénes
eeca9f24e8 Merge 'Docs: improve debugging.md' from Benny Halevy
This series update debugging.md with:
- add an example .gdbinit file
- update recommendation for finding the relocatable packages using a build-id on http://backtrace.scylladb.com/

Closes #10492

* github.com:scylladb/scylla:
  docs: debugging.md: update instructions regarding backtrace.scylladb.com
  docs: debugging.md: add a sample gdbinit file
2022-05-11 14:46:30 +02:00
Nadav Har'El
b2450886d7 Merge 'Debugging.md relocatable package updates' from Botond Dénes
Drop the section about non-relocatable packages. They are not a thing anymore.
Also tweaked the instructions for launching the toolchain container.

Closes #10539

* github.com:scylladb/scylla:
  docs/debugging.md: adjust instructions for using the toolchain
  docs/debugging.md: drop section about handling binaries from non-relocatable packages
2022-05-11 14:46:30 +02:00
Takuya ASADA
a9dfe5a8f4 scylla_sysconfig_setup: handle >=32CPUs correctly
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.

Fixes #10523

Closes #10524
2022-05-11 14:46:30 +02:00
Nadav Har'El
043b1c7f89 Update seastar submodule. Unfortunately, also requires two changes
to Scylla itself to make it still compile - see below

* seastar 5e863627...96bb3a1b (18):
  > install-dependencies: add rocky as a supported distro
  > circleci: relax docker limits to allow running with new toolchain
  > core: memory: Add memory::free_memory() also in Debug mode
  > build: bump up zlib to 1.2.12
  > cmake: add FindValgrind.cmake
  > Merge 'seastar-addr2line: support sct syslogs' from Benny Halevy
  > rpc: lower log level for 'failed to connect' errors
  > scripts: Build validation
  > perftune.py: remove rx_queue_count from mode condition.
  > memory: add attributes to memalign for compatibility with glibc 2.35
  > condition-variable: Fix timeout "when" potentially not killing timer
  > Merge "tests: perf: measure coroutines performance" from Benny
  > Merge: Refine COUNTER metrics
  > Revert "Merge: Refine COUNTER metrics"
  > reactor: document intentional bitwise-on-bool op in smp_pollfn::poll()
  > Merge: Refine COUNTER metrics
  > SLES: additionally check irqbalance.service under /usr/lib
  > rpc_tester: job_cpu: mark virtual methods override

Changes to Scylla also included in this merge:

1. api: Don't export DERIVEs (Pavel Emelyanov)

Newer seastar doesn't have DERIVE metrics, but does have REAL_COUNTER
one. Teach the collectd getter the change.

(for the record: I don't understand how this endpoing works at all,
there's a HISTOGRAM metrics out there that would be attempted to get
exposed with the v.ui() call which's totally wrong)

2. test: use linux_perf_events.{cc,hh} from Seastar

Seastar now has linux_perf_events.{cc,hh}. Remove Scylla's version
of the same files and use Seastar's. Without this change, Scylla
fails to compile when some source files end up including both
versions and seeing double definitions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-11 14:46:30 +02:00
Piotr Sarna
209c2f5d99 sstables: define generation_type for sstables
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.

Closes #10533
2022-05-11 14:46:30 +02:00
Beni Peled
3abe4a2696 Adjust scripts/pull_github_pr.sh to check tests status
Closes #10263

Closes #10264
2022-05-11 14:46:30 +02:00
Botond Dénes
7501a075bd sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view)
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.

Closes #10531
2022-05-11 14:46:30 +02:00
Asias He
77b1db475c locator: Do not enforce public ip address for broadcast_rpc_address
Reported by Felipe Cardeneti:

- Create a 2-node Scylla cluster w/ Ec2MultiRegionSnitch
- Check system.peers table

Scylla (uses public address)
```
cqlsh> select peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------+---------------+--------------------------------------
18.216.98.219 |   us-east-2 | d9443741-a12e-4bbb-91ce-9931cece589c | 172.31.43.122 |   2c | 18.216.98.219 | 95c3fca5-c463-3aba-98c6-1c0b3fac5b58

(1 rows)
```

Cassandra (uses local address):
```
cqlsh> SELECT peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack       | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------------+---------------+--------------------------------------
52.15.104.255 |   us-east-2 | 42c0b717-775f-4998-a420-0388fe8b4e70 | 172.31.42.126 | us-east-2c | 172.31.42.126 | 2207c2a9-f598-3971-986b-2926e09e239d

(1 rows)
```

Config diff:
```
cassandra.yaml:rpc_address: 0.0.0.0
cassandra.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:rpc_address: 0.0.0.0
```

After this patch, if broadcast_rpc_address is unset, Ec2MultiRegionSnitch
will use the public ip address to set broadcast_rpc_address. If
broadcast_rpc_address is set, Ec2MultiRegionSnitch will not modify it.

Fixes #10236

Closes #10519
2022-05-11 14:46:30 +02:00
Tomasz Grabiec
f703e8ded5 Merge 'New failure detector for Raft' from Kamil Braun
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.

We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.

We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.

On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

---

v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c

v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`

v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
  which was timing out listeners (after the sleep). Instead:
    - rely on the loop before the sleep for timing out listeners
    - before calling ping(), check if there is a timed out listener,
      if so abandon the ping, immediately proceed to the timing-out-listeners
      loop, and then immediately proceed to the next iteration of the big `while`
      (without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
  exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
  signalling the condition variable, just do `_alive_start = 0` and signal
  the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
  `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
  Indeed, we want to wait for the stopping code (`destroy_worker()`)
  to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
  finishes so `notify_fiber()` can send the final `mark_dead`
  notifications for this endpoint. There was a race before where
  `notify_fiber()` could finish before it sent those notifications
  (because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
  of tasks by the Scylla reactor, the test could sometimes hang in debug
  mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
  exceptional discarded future in some cases

v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)

v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
    - _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
    - _alive_start is no longer an iterator to _listeners, but an index (size_t)
    - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
    - ping_fiber() signals _alive_start_changed when it changes _alive_start
    - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
    - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
    - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
    - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase

Closes #10437

* github.com:scylladb/scylla:
  service: raft: remove `raft_gossip_failure_detector`
  service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
  service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
  main: start direct failure detector service
  messaging_service: abortable version of `send_gossip_echo`
  message: abortable version of `send_message`
  test: raft: randomized_nemesis_test: remove old failure_detector
  test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
  test: raft: randomized_nemesis_test: ping all shards on each tick
  test: unit test for new failure detector service
  direct_failure_detector: introduce new failure detector service
2022-05-11 14:46:27 +02:00
Benny Halevy
4a40e34577 docs: debugging.md: update instructions regarding backtrace.scylladb.com
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-11 10:23:09 +03:00
Benny Halevy
97b002e13e docs: debugging.md: add a sample gdbinit file
This gdbinit contains recommended settings
commonly useful for debugging scylla core dumps.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-11 10:23:08 +03:00