Compare commits

..

3105 Commits

Author SHA1 Message Date
Yaniv Kaul
953fee7fb0 Update cql-extensions.md 2023-10-25 15:18:55 +03:00
Botond Dénes
6c90d166cc Merge 'build: cmake: avoid using large amount stack of when compiling parser ' from Kefu Chai
this mirrors what we have in `configure.py`, to build the CqlParser with `-O1`
and disable `-fsanitize-address-use-after-scope` when compiling CqlParser.cc
in order to prevent the compiler from emitting code which uses large amount of stack
space at the runtime.

Closes scylladb/scylladb#15819

* github.com:scylladb/scylladb:
  build: cmake: avoid using large amount stack of when compiling parser
  build: cmake: s/COMPILE_FLAGS/COMPILE_OPTIONS/
2023-10-24 16:19:51 +03:00
Nadav Har'El
4b80130b0b Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak
There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in:
- `system_distributed_keyspace::start`,
- `redis:create_keyspace_if_not_exists_impl`,
- `table_helper::setup_keyspace` (for the `system_traces` keyspace).

All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this:
- calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum,
- we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder.

This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME.

Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring.

Fixes #15437

Closes scylladb/scylladb#15594

* github.com:scylladb/scylladb:
  table_helper: announce twice in setup_keyspace
  table_helper: refactor setup_table
  redis: create_keyspace_if_not_exists_impl: fix indentation
  redis: announce once in create_keyspace_if_not_exists_impl
  db: system_distributed_keyspace: fix indentation
  db: system_distributed_keyspace: announce once in start
  tablet_allocator: update on_before_create_column_family
  migration_listener: add parameter to on_before_create_column_family
  alternator: executor: use new prepare_new_column_family_announcement
  alternator: executor: introduce create_keyspace_metadata
  migration_manager: add new prepare_new_column_family_announcement
2023-10-24 15:42:48 +03:00
David Garcia
a5519c7c1f docs: update cofig params design
Closes scylladb/scylladb#15827
2023-10-24 15:41:56 +03:00
Kefu Chai
f8104b92f8 build: cmake: detect rapidxml
we use rapidxml for parsing XML, so let's detect it before using it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15813
2023-10-24 15:12:04 +03:00
Kamil Braun
2a21029ff5 Merge 'make topology_coordinator::run noexcept' from Gleb
Topology coordinator should handle failures internally as long as it
remains to be the coordinator. The raft state monitor is not in better
position to handle any errors thrown by it, all it can do it to restart
the coordinator. The series makes topology_coordinator::run handle all
the errors internally and mark the function as noexcept to not leak
error handling complexity into the raft state monitor.

* 'gleb/15728-fix' of github.com:scylladb/scylla-dev:
  storage_service: raft topology: mark topology_coordinator::run function as noexcept
  storage_service: raft topology: do not throw error from fence_previous_coordinator()
2023-10-24 12:16:36 +02:00
Kefu Chai
4abcec9296 test: add __repr__ for MinIoServer and S3_Server
it is printed when pytest passes it down as a fixture as part of
the logging message. it would help with debugging a object_store test.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15817
2023-10-24 12:35:49 +03:00
Gleb Natapov
dcaaa74cd4 storage_service: raft topology: mark topology_coordinator::run function as noexcept
The function handled all exceptions internally. By making it noexcept we
make sure that the caller (raft_state_monitor_fiber) does not need
handle any exceptions from the topology coordinator fiber.
2023-10-24 10:58:45 +03:00
Gleb Natapov
65bf5877e7 storage_service: raft topology: do not throw error from fence_previous_coordinator()
Throwing error kills the topology coordinator monitor fiber. Instead we
retry the operation until it succeeds or the node looses its leadership.
This is fine before for the operation to succeed quorum is needed and if
the quorum is not available the node should relinquish its leadership.

Fixes #15728
2023-10-24 10:57:48 +03:00
Botond Dénes
0cba973972 Update tools/java submodule
* tools/java 3c09ab97...86a200e3 (1):
  > cassandra-stress: add storage options
2023-10-24 09:41:36 +03:00
Kefu Chai
9347b61d3b build: cmake: avoid using large amount stack of when compiling parser
this mirrors what we have in `configure.py`, to build the CqlParser with -O1
and disable sanitize-address-use-after-scope when compiling CqlParser.cc
in order to prevent the compiler from emitting code which uses large amount of stack
at the runtime.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-24 12:40:20 +08:00
Kefu Chai
3da02e1bf4 build: cmake: s/COMPILE_FLAGS/COMPILE_OPTIONS/
according to
https://cmake.org/cmake/help/latest/prop_sf/COMPILE_FLAGS.html,
COMPILE_FLAGS has been superseded by COMPILE_OPTIONS. so let's
replace the former with the latter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-24 12:40:20 +08:00
Pavel Emelyanov
7c580b4bd4 Merge 'sstable: switch to uuid identifier for naming S3 sstable objects' from Kefu Chai
before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter.

to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier.

so, in this change:

1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>)

Fixes #14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#14406

* github.com:scylladb/scylladb:
  sstable: remove _remote_prefix from s3_storage
  sstable: switch to uuid identifier for naming S3 sstable objects
2023-10-23 21:05:13 +03:00
Pavel Emelyanov
d7031de538 Merge 'test/pylib: extract the env variable related functions out' from Kefu Chai
this series extracts the the env variables related functions out and remove unused `import`s for better readability.

Closes scylladb/scylladb#15796

* github.com:scylladb/scylladb:
  test/pylib: remove duplicated imports
  test/pylib: extract the env variable printing into MinIoServer
  test/pylib: extract _set_environ() out
2023-10-23 21:03:03 +03:00
Aleksandra Martyniuk
0c6a3f568a compaction: delete default_compaction_progress_monitor
default_compaction_progress_monitor returns a reference to a static
object. So, it should be read-only, but its users need to modify it.

Delete default_compaction_progress_monitor and use one's own
compaction_progress_monitor instance where it's needed.

Closes scylladb/scylladb#15800
2023-10-23 16:03:34 +03:00
Anna Stuchlik
55ee999f89 doc: enable publishing docs for branch-5.4
This commit enables publishing documentation
from branch-5.4. The docs will be published
as UNSTABLE (the warning about version 5.4
being unstable will be displayed).

Closes scylladb/scylladb#15762
2023-10-23 15:47:01 +03:00
Avi Kivity
ee9cc450d4 logalloc: report increases of reserves
The log-structured allocator maintains memory reserves to so that
operations using log-strucutured allocator memory can have some
working memory and can allocate. The reserves start small and are
increased if allocation failures are encountered. Before starting
an operation, the allocator first frees memory to satisfy the reserves.

One problem is that if the reserves are set to a high value and
we encounter a stall, then, first, we have no idea what value
the reserves are set to, and second, we have no idea what operation
caused the reserves to be increased.

We fix this problem by promoting the log reports of reserve increases
from DEBUG level to INFO level and by attaching a stack trace to
those reports. This isn't optimal since the messages are used
for debugging, not for informing the user about anything important
for the operation of the node, but I see no other way to obtain
the information.

Ref #13930.

Closes scylladb/scylladb#15153
2023-10-23 13:37:50 +02:00
Tomasz Grabiec
4af585ec0e Merge 'row_cache: make_reader_opt(): make make_context() reentrant ' from Botond Dénes
Said method is called in an allocating section, which will re-try the enclosed lambda on allocation failure. `read_context()` however moves the permit parameter so on the second and later calls, the permit will be in a moved-from state, triggering a `nullptr` dereference and therefore a segfault.

We already have a unit test (`test_exception_safety_of_reads` in `row_cache_test.cc`) which was supposed to cover this, but:
* It only tests range scans, not single partition reads, which is a separate path.
* Turns out allocation failure tests are again silently broken (no error is injected at all). This is because `test/lib/memtable_snapshot_source.hh` creates a critical alloc section which accidentally covers the entire duration of tests using it.

Fixes: #15578

Closes scylladb/scylladb#15614

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: test_exception_safety_of_reads: also cover single-partition reads
  test/lib/memtable_snapshot_source: disable critical alloc section while waiting
  row_cache: make_reader_opt(): make make_context() reentrant
2023-10-23 11:22:13 +02:00
Raphael S. Carvalho
ea6c281b9f replica: Fix major compaction semantics by performing off-strategy first
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes #11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15792
2023-10-23 11:32:03 +03:00
Nadav Har'El
e7dd0ec033 test/cql-pytest: reproduce incompatibility with same-name bind marks
This patch adds a reproducer for a minor compatibility between Scylla's
and Cassandra's handling of a prepared statement when a bind marker with
the same name is used more than once, e.g.,
```
SELECT * FROM tbl WHERE p=:x AND c=:x
```
It turns out that Scylla tells the driver that there is only one bind
marker, :x, whereas Cassandra tells the driver that there are two bind
markers, both named :x. This makes no different if the user passes
a map `{'x': 3}`, but if the user passes a tuple, Scylla accepts only
`(3,)` (assigning both bind markers the same value) and Cassandra
accepts only `(3,3)`.

The test added in this patch demonstrates this incompatibility.
It fails on Scylla, passes on Cassandra, and is marked "xfail".

Refs #15559

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#15564
2023-10-23 11:19:15 +03:00
Aleksandra Martyniuk
a1271d2d5c repair: throw more detailed exception
Exception thrown from row_level_repair::run does not show the root
cause of a failure making it harder to debug.

Add the internal exception contents to runtime_error message.

After the change the log will mention the real cause (last line), e.g.:

repair - repair[92db0739-584b-4097-b6e2-e71a66e40325]: 33 out of 132 ranges failed,
keyspace=system_distributed, tables={cdc_streams_descriptions_v2, cdc_generation_timestamps,
view_build_status, service_levels}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false,
failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_distributed,
cf=cdc_streams_descriptions_v2, range=(8720988750842579417,+inf))
(while cleaning up after seastar::abort_requested_exception (abort requested))

Closes scylladb/scylladb#15770
2023-10-23 11:15:25 +03:00
Botond Dénes
950a1ff22c Merge 'doc: improve the docs for handling failures' from Anna Stuchlik
This PR improves the way of how handling failures is documented and accessible to the user.
- The Handling Failures section is moved from Raft to Troubleshooting.
- Two new topics about failure are added to Troubleshooting with a link to the Handling Failures page (Failure to Add, Remove, or Replace a Node, Failure to Update the Schema).
- A note is added to the add/remove/replace node procedures to indicate that a quorum is required.

See individual commits for more details.

Fixes https://github.com/scylladb/scylladb/issues/13149

Closes scylladb/scylladb#15628

* github.com:scylladb/scylladb:
  doc: add a note about Raft
  doc: add the quorum requirement to procedures
  doc: add more failure info to Troubleshooting
  doc: move Handling Failures to Troubleshooting
2023-10-23 11:09:28 +03:00
Kefu Chai
5a17a02abb build: cmake: add -ffile-prefix-map option
this mirrors what we already have in configure.py.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15798
2023-10-23 10:26:21 +03:00
Botond Dénes
940c2d1138 Merge 'build: cmake: use add_compile_options() and add_link_options() when appropriate ' from Kefu Chai
instead of appending the options to the CMake variables, use the command to do this. simpler this way. and the bonus is that the options are de-duplicated.

Closes scylladb/scylladb#15797

* github.com:scylladb/scylladb:
  build: cmake: use add_link_options() when appropriate
  build: cmake: use add_compile_options() when appropriate
2023-10-23 09:58:10 +03:00
Botond Dénes
c960c2cdbf Merge 'build: extract code fragments into functions' from Kefu Chai
this series is one of the steps to remove global statements in `configure.py`.

not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.

Refs #15379

Closes scylladb/scylladb#15780

* github.com:scylladb/scylladb:
  build: update modeval using a dict
  build: pass args.test_repeat and args.test_timeout explicitly
  build: pull in jsoncpp using "pkgs"
  build: build: extract code fragments into functions
2023-10-23 09:42:37 +03:00
Kefu Chai
0080b15939 build: cmake: use add_link_options() when appropriate
instead of appending to CMAKE_EXE_LINKER_FLAGS*, use
add_link_options() to add more options. as CMAKE_EXE_LINKER_FLAGS*
is a string, and typically set by user, let's use add_link_options()
instead.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 12:06:42 +08:00
Kefu Chai
686adec52e build: cmake: use add_compile_options() when appropriate
instead of appending to CMAKE_CXX_FLAGS, use add_compile_options()
to add more options. as CMAKE_CXX_FLAGS is a string, and typically
set by user, let's use add_compile_options() instead, the options
added by this command will be added before CMAKE_CXX_FLAGS, and
will have lower priority.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 12:06:42 +08:00
Kefu Chai
8756838b16 test/pylib: remove duplicated imports
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:36:05 +08:00
Kefu Chai
6b84bc50c3 test/pylib: extract the env variable printing into MinIoServer
less repeatings this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:36:05 +08:00
Kefu Chai
02cad8f85b test/pylib: extract _set_environ() out
will add _unset_environ() later. extracting this helper out helps
with the readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:36:05 +08:00
Kefu Chai
b36cef6f1a sstable: remove _remote_prefix from s3_storage
since we use the sstable.generation() for the remote prefix of
the key of the object for storing the sstable component, there is
no need to set remote_prefix beforehand.

since `s3_storage::ensure_remote_prefix()` and
`system_kesypace::sstables_registry_lookup_entry()` are not used
anymore, they are removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:08:22 +08:00
Kefu Chai
af8bc8ba63 sstable: switch to uuid identifier for naming S3 sstable objects
before this change, we create a new UUID for a new sstable managed
by the s3_storage, and we use the string representation of UUID
defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for
naming the objects stored on s3_storage. but this representation is
not what we are using for storing sstables on local filesystem when
the option of "uuid_sstable_identifiers_enabled" is enabled. instead,
we are using a base36-based representation which is shorter.

to be consistent with the naming of the sstables created for local
filesystem, and more importantly, to simplify the interaction between
the local copy of sstables and those stored on object storage, we should
use the same string representation of the sstable identifier.

so, in this change:

1. instead of creating a new UUID, just reuse the generation of the
   sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As
   we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined
   by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the
   base36-based one (implemented by
   fmt::formatter<sstables::generation_type>)
4. enable the `uuid_sstable_identifers` cluster feature if it is
   enabled in the `test_env_config`, so that it the sstable manager
   can enable the uuid-based uuid when creating a new uuid for
   sstable.
5. throw if the generation of sstable is not UUID-based when
   accessing / manipulating an sstable with S3 storage backend. as
   the S3 storage backend now relies on this option. as, otherwise
   we'd have sstables with key like s3://bucket/number/basename, which
   is just unable to serve as a unique id for sstable if the bucket is
   shared across multiple tables.

Fixes #14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-23 10:08:22 +08:00
Avi Kivity
f181ac033a Merge 'tools/nodetool: implement additional commands, part 2/N' from Botond Dénes
The following new commands are implemented:
* stop
* compactionhistory

All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.

Refs: https://github.com/scylladb/scylladb/issues/15588

Closes scylladb/scylladb#15649

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement compactionhistory command
  tools/scylla-nodetool: implement stop command
  mutation/json: extract generic streaming writer into utils/rjson.hh
  test/nodetool: rest_api_mock.py: add support for error responses
2023-10-21 00:11:42 +03:00
Botond Dénes
19fc01be23 Merge 'Sanitize API -> task_manager dependency' from Pavel Emelyanov
This is the continuation of 8c03eeb85d

Registering API handlers for services need to

* get the service to handle requests via argument, not from http context (http context, in turn, is going not to depend on anything)
* unset the handlers on stop so that the service is not used after it's stopped (and before API server is stopped)

This makes task manager handlers work this way

Closes scylladb/scylladb#15764

* github.com:scylladb/scylladb:
  api: Unset task_manager test API handlers
  api: Unset task_manager API handlers
  api: Remove ctx->task_manager dependency
  api: Use task_manager& argument in test API handlers
  api: Push sharded<task_manager>& down the test API set calls
  api: Use task_manager& argument in API handlers
  api: Push sharded<task_manager>& down the API set calls
2023-10-20 18:07:20 +03:00
Botond Dénes
4b57c2bf18 tools/scylla-nodetool: implement compactionhistory command 2023-10-20 10:55:38 -04:00
Botond Dénes
a212ddc5b1 tools/scylla-nodetool: implement stop command 2023-10-20 10:04:56 -04:00
Botond Dénes
9231454acd mutation/json: extract generic streaming writer into utils/rjson.hh
This writer is generally useful, not just for writing mutations as json.
Make it generally available as well.
2023-10-20 10:04:56 -04:00
Botond Dénes
6db2698786 test/nodetool: rest_api_mock.py: add support for error responses 2023-10-20 10:04:56 -04:00
Kefu Chai
9f62bfa961 build: update modeval using a dict
instead of updating `modes` in with global statements, update it in
a function. for better readablity. and to reduce the statements in
global scope.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-20 21:37:07 +08:00
Botond Dénes
ad90bb8d87 replica/database: remove "streaming" from dirty memory metric description
We don't have streaming memtables for a while now.

Closes scylladb/scylladb#15638
2023-10-20 13:09:57 +03:00
Kefu Chai
c240c70278 build: pass args.test_repeat and args.test_timeout explicitly
for better readability.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-20 16:53:16 +08:00
Kefu Chai
c2cd11a8b3 build: pull in jsoncpp using "pkgs"
this change adds "jsoncpp" dependency using "pkgs". simpler this
way. it also helps to remove more global statements.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-20 16:53:16 +08:00
Kefu Chai
890113a9cf build: build: extract code fragments into functions
this change extract `get_warnings_options()` out. it helps to
remove more global statements.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-20 16:53:16 +08:00
Patryk Jędrzejczak
fbcd667030 replica: keyspace::create_replication_strategy: remove a redundant parameter
The options parameter is redundant. We always use
`_metadata->strategy_options()` and
`keyspace::create_replication_strategy` already assumes that
`_metadata` is set by using its other fields.

Closes scylladb/scylladb#15776
2023-10-20 10:20:49 +03:00
Botond Dénes
460bc7d8e1 test/boost/row_cache_test: test_exception_safety_of_reads: also cover single-partition reads
The test currently only covers scans. Single partition reads have a
different code-path, make sure it is also covered.
2023-10-20 03:16:57 -04:00
Botond Dénes
ffefa623f4 test/lib/memtable_snapshot_source: disable critical alloc section while waiting
memtable_snapshot_source starts a background fiber in its constructor,
which compacts LSA memory in a loop. The loop's inside is covered with a
critical alloc section. It also contains a wait on a condition variable
and in its present form the critical section also covers the wait,
effectively turning off allocation failure injection for any test using
the memtable_snapshot_source.
This patch disables the critical alloc section while the loop waits on
the condition variable.
2023-10-20 03:16:57 -04:00
Botond Dénes
92966d935a row_cache: make_reader_opt(): make make_context() reentrant
Said lambda currently moves the permit parameter, so on the second and
later calls it will possibly run into use-after-move. This can happen if
the allocating section below fails and is re-tried.
2023-10-20 03:16:57 -04:00
Kefu Chai
11d7cadf0d install-dependencies.sh: drop java deps
the java related build dependencies are installed by

* tools/java/install-dependencies.sh
* tools/jmx/install-dependencies.sh

respectively. and the parent `install-dependencies.sh` always
invoke these scripts, so there is no need to repeat them in the
parent `install-dependenceies.sh` anymore.

in addition to dedup the build deps, this change also helps to
reduce the size of build dependencies. as by default, `dnf`
install the weak deps, unless `-setopt=install_weak_deps=False`
is passed to it, so this change also helps to reduce the traffic
and foot print of the installed packages for building scylla.

see also 9dddad27bf

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15473
2023-10-20 09:43:28 +03:00
Kamil Braun
059d647ee5 test/pylib: scylla_cluster: protect ScyllaCluster.stop with a lock
test.py calls `uninstall()` and `stop()` concurrently from exit
artifacts, and `uninstall()` internally calls `stop()`. This leads to
premature releasing of IP addresses from `uninstall()` (returning IPs to
the pool) while the servers using those IPs are still stopping. Then a
server might obtain that IP from the pool and fail to start due to
"Address already in use".

Put a lock around the body of `stop()` to prevent that.

Fixes: scylladb/scylladb#15755

Closes scylladb/scylladb#15763
2023-10-20 09:30:37 +03:00
Kefu Chai
80c656a08b types: use more readable error message when serializing non-ASCII string
before this change, we print

marshaling error: Value not compatible with type org.apache.cassandra.db.marshal.AsciiType: '...'

but the wording is not quite user friendly, it is a mapping of the
underlying implementation, user would have difficulty understanding
"marshaling" and/or "org.apache.cassandra.db.marshal.AsciiType"
when reading this error message.

so, in this change

1. change the error message to:
     Invalid ASCII character in string literal: '...'
   which should be more straightforward, and easier to digest.
2. update the test accordingly

please note, the quoted non-ASCII string is preserved instead of
being printed in hex, as otherwise user would not be able to map it
with his/her input.

Refs #14320
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15678
2023-10-20 09:25:44 +03:00
Pavel Emelyanov
0c69a312db Update seastar submodule
* seastar bab1625c...17183ed4 (73):
  > thread_pool: Reference reactor, not point to
  > sstring: inherit publicly from string_view formatter
  > circleci: use conditional steps
  > weak_ptr: include used header
  > build: disable the -Wunused-* warnings for checkheaders
  > resource: move variable into smaller lexical scope
  > resource: use structured binding when appropriate
  > httpd: Added server and client addresses to request structure
  > io_queue: do not dereference moved-away shared pointer
  > treewide: explicitly define ctor and assignment operator
  > memory: use `err` for the error string
  > doc: Add document describing all the math behind IO scheduler
  > io_queue: Add flow-rate based self slowdown backlink
  > io_queue: Make main throttler uncapped
  > io_queue: Add queue-wide metrics
  > io_queue: Introduce "flow monitor"
  > io_queue: Count total number of dispatched and completed requests so far
  > io_queue: Introduce io_group::io_latency_goal()
  > tests: test the vector overload for when_all_succeed
  > core: add a vector overload to when_all_succeed
  > loop: Fix iterator_range_estimate_vector_capacity for random iters
  > loop: Add test for iterator_range_estimate_vector_capacity
  > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
  > memory: s/throw()/noexcept/
  > build: enable -Wdeprecated compiler option
  > reactor: mark kernel_completion's dtor protected
  > tests: always wait for promise
  > http, json, net: define-generated copy ctor for polymorphic types
  > treewide: do not define constexpr static out-of-line
  > reactor: do not define dtor of kernel_completion
  > http/exception: stop using dynamic exception specification
  > metrics: replace vector with deque
  > metrics: change metadata vector to deque
  > utils/backtrace.hh: make simple_backtrace formattable
  > reactor: Unfriend disk_config_params
  > reactor: Move add_to_flush_poller() to internal namespace
  > reactor: Unfriend a bunch of sched group template calls
  > rpc_test: Test rpc send glitches
  > net: Implement batch flush support for existing sockets
  > iostream: Configure batch flushes if sink can do it
  > net: Added remote address accessors
  > circleci: update the image to CircleCI "standard" image
  > build: do not add header check target if no headers to check
  > build: pass target name to seastar_check_self_contained
  > build: detect glibc features using CMake
  > build: extract bits checking libc into CheckLibc.cmake
  > http/exception: add formatter for httpd::base_exception
  > http/client: Mark write_body() const
  > http/client: Introduce request::_bytes_written
  > http/client: Mark maybe_wait_for_continue() const
  > http/client: Mark send_request_head() const
  > http/client: Detach setup_request()
  > http/api_docs: copy in api_docs's copy constructor
  > script: do not inherit from object
  > scripts: addr2line: change StdinBacktraceIterator to a function
  > scripts: addr2line: use yield instead defining a class
  > tests: skip tests that require backtrace if execinfo.h is not found
  > backtrace: check for existence of execinfo.h
  > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
  > core: add sleep_abortable instantiation for manual_clock
  > tls: Return EPIPE exception when writing to shutdown socket
  > http/client: Don't cache connection if server advertises it
  > http/client: Mark connection as "keep in cache"
  > core: fix strerror_r usage from glibc extension
  > reactor: access sigevent.sigev_notify_thread_id with a macro
  > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
  > reactor: replace __mode_t with mode_t
  > reactor: change sys/poll.h to posix poll.h
  > rpc: Add unit test for per-domain metrics
  > rpc: Report client connections metrics
  > rpc: Count dead client stats
  > rpc: Add seastar::rpc::metrics
  > rpc: Make public queues length getters

io-scheduler fixes
refs: #15312
refs: #11805

http client fixes
refs: #13736
refs: #15509

rpc fixes
refs: #15462

Closes scylladb/scylladb#15774
2023-10-19 20:52:37 +03:00
Tomasz Grabiec
899ecaffcd test: tablets: Enable for verbose logging in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
To help diagnose #14746 where we experience timeouts due to connection
dropping.

Closes scylladb/scylladb#15773
2023-10-19 16:58:53 +03:00
Raphael S. Carvalho
fded314e46 sstables: Fix update of tombstone GC settings to have immediate effect
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.

Fixes #15643.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15746
2023-10-19 16:27:59 +03:00
Kefu Chai
a6e68d8309 build: cmake: move message/* into message/CMakeLists.txt
messaging_service.cc depends on idl, but many source files in
scylla-main do no depend on idl, so let's

* move "message/*" into its own directory and add an inter-library
  dependency between it and the "idl" library.
* rename the target of "message" under test/manual to "message_test"
  to avoid the name collision

this should address the compilation failure of
```
FAILED: CMakeFiles/scylla-main.dir/message/messaging_service.cc.o
/usr/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere  -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -MF CMakeFiles/scylla-main.dir/message/messaging_service.cc.o.d -o CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -c /home/kefu/dev/scylladb/message/messaging_service.cc
/home/kefu/dev/scylladb/message/messaging_service.cc:81:10: fatal error: 'idl/join_node.dist.hh' file not found
         ^~~~~~~~~~~~~~~~~~~~~~~
```
where the compiler failed to find the included `idl/join_node.dist.hh`,
which is exposed by the idl library as part of its public interface.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15657
2023-10-19 13:33:29 +03:00
Botond Dénes
60145d9526 Merge 'build: extract code fragments into functions' from Kefu Chai
this series is one of the steps to remove global statements in `configure.py`.

not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.

Refs #15379

Closes scylladb/scylladb#15668

* github.com:scylladb/scylladb:
  build: move check for NIX_CC into dynamic_linker_option()
  build: extract dynamic_linker_option(): out
  build: move `headers` into write_build_file()
2023-10-19 13:31:33 +03:00
Avi Kivity
39966e0eb1 Merge 'build: cmake: pass -dynamic-linker to ld' from Kefu Chai
to match the behavior of `configure.py`.

Closes scylladb/scylladb#15667

* github.com:scylladb/scylladb:
  build: cmake: pass -dynamic-linker to ld
  build: cmake: set CMAKE_EXE_LINKER_FLAGS in mode.common.cmake
2023-10-19 13:15:47 +03:00
Jan Ciolek
c256cca6f1 cql3/expr: add more comments in expression.hh
`expression` is a std::variant with 16 different variants
that represent different types of AST nodes.

Let's add documentation that explains what each of these
16 types represents. For people who are not familiar with expression
code it might not be clear what each of them does, so let's add
clear descriptions for all of them.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes scylladb/scylladb#15767
2023-10-19 10:56:38 +03:00
Kefu Chai
b105be220b build: cmake: add join_node.idl.hh to CMake
we add a new verb in 7cbe5e3af8, so
let's update the CMake-based building system accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15658
2023-10-19 10:19:16 +03:00
Nikita Kurashkin
2a7932efa1 alternator: fix DeleteTable return values to match DynamoDB's
It seems that Scylla has more values returned by DeleteTable operation than DynamoDB.
In this patch I added a table status check when generating output.
If we delete the table, values KeySchema, AttributeDefinitions and CreationDateTime won't be returned.
The test has also been modified to check that these attributes are not returned.

Fixes scylladb#14132

Closes scylladb/scylladb#15707
2023-10-19 09:34:16 +03:00
Pavel Emelyanov
ec94cc9538 Merge 'test: set use_uuid to true by default in sstables::test_env ' from Kefu Chai
this series

1. let sstable tests using test_env to use uuid-based sstable identifiers by default
2. let the test who requires integer-based identifier keep using it

this should enable us to perform the s3 related test after enforcing the uuid-based identifier for s3 backend, otherwise the s3 related test would fail as it also utilize `test_env`.

Closes scylladb/scylladb#14553

* github.com:scylladb/scylladb:
  test: set use_uuid to true by default in sstables::test_env
  test: enable test to set uuid_sstable_identifiers
2023-10-19 09:09:38 +03:00
Pavel Emelyanov
0981661f8b api: Unset task_manager test API handlers
So that the task_manager reference is not used when it shouldn't on stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:56:24 +03:00
Pavel Emelyanov
2d543af78e api: Unset task_manager API handlers
So that the task_manager reference is not used when it shouldn't on stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:56:01 +03:00
Pavel Emelyanov
0632ad50f3 api: Remove ctx->task_manager dependency
Now the task manager's API (and test API) use the argument and this
explicit dependency is no longer required

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:55:27 +03:00
Pavel Emelyanov
572c880d97 api: Use task_manager& argument in test API handlers
Now it's there and can be used. This will allow removing the
ctx->task_manager dependency soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:55:13 +03:00
Pavel Emelyanov
0396ce7977 api: Push sharded<task_manager>& down the test API set calls
This is to make it possible to use this reference instead of the ctx.tm
one by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:54:53 +03:00
Pavel Emelyanov
ef1d2b2c86 api: Use task_manager& argument in API handlers
Now it's there and can be used. This will allow removing the
ctx->task_manager dependency soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:54:24 +03:00
Pavel Emelyanov
14e10e7db4 api: Push sharded<task_manager>& down the API set calls
This is to make it possible to use this reference instead of the ctx.tm
one by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:52:46 +03:00
Avi Kivity
7d5e22b43b replica: memtable: don't forget memtable memory allocation statistics
A memtable object contains two logalloc::allocating_section members
that track memory allocation requirements during reads and writes.
Because these are local to the memtable, each time we seal a memtable
and create a new one, these statistics are forgotten. As a result
we may have to re-learn the typical size of reads and writes, incurring
a small performance penalty.

The solution is to move the allocating_section object to the memtable_list
container. The workload is the same across all memtables of the same
table, so we don't lose discrimination here.

The performance penalty may be increased later if log changes to
memory reserve thresholds including a backtrace, so this reduces the
odds of incurring such a penalty.

Closes scylladb/scylladb#15737
2023-10-18 17:43:33 +02:00
Kefu Chai
c8cb70918b sstable: drop unused parse() overload for deletion_time
`deletion_time` is a part of the `partition_header`, which is in turn
a part of `partition`. and `data_file` is a sequence of `partition`.
`data_file` represents *-Data.db component of an SSTable.
see docs/architecture/sstable3/sstables-3-data-file-format.rst.
we always parse the data component via `flat_mutation_reader_v2`, which is in turn
implemented with mx/reader.cc or kl/reader.cc depending on
the version of SSTable to be read.

in other words, we decode `deletion_time` in mx/reader.cc or
kl/reader.cc, not in sstable.cc. so let's drop the overload
parse() for deletion_time. it's not necessary and more importantly,
confusing.

Refs #15116
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15756
2023-10-18 18:41:56 +03:00
Avi Kivity
f3dc01c85e Merge 'Enlight sstable_directory construction' from Pavel Emelyanov
Currently distributed_loader starts sharded<sstable_directory> with four sharded parameters. That's quite bulky and can be made much shorter.

Closes scylladb/scylladb#15653

* github.com:scylladb/scylladb:
  distributed_loader: Remove explicit sharded<erms>
  distributed_loader: Brush up start_subdir()
  sstable_directory: Add enlightened construction
  table: Add global_table_ptr::as_sharded_parameter()
2023-10-18 16:42:04 +03:00
Anna Stuchlik
274cf7a93a doc:remove upgrade guides for unsupported versions
This commit:
- Removes upgrade guides for versions older than 5.0.
  The oldest one is from version 4.6 to 5.0.
- Adds the redirections for the removed pages.

Closes scylladb/scylladb#15709
2023-10-18 15:12:26 +03:00
Kefu Chai
f69a44bb37 test/object_store: redirect to STDOUT and STDERR
pytest changes the test's sys.stdout and sys.stderr to the
captured fds when it captures the outputs of the test. so we
are not able to get the STDOUT_FILENO and STDERR_FILENO in C
by querying `sys.stdout.fileno()` and `sys.stderr.fileno()`.
their return values are not 1 and 2 anymore, unless pytest
is started with "-s".

so, to ensure that we always redirect the child process's
outputs to the log file. we need to use 1 and 2 for accessing
the well-known fds, which are the ones used by the child
process, when it writes to stdout and stderr.

this change should address the problem that the log file is
always empty, unless "-s" is specified.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15560
2023-10-18 14:54:01 +03:00
Yaron Kaikov
b340bd6d9e release: prepare for 5.5.0-dev 2023-10-18 14:40:06 +03:00
Botond Dénes
f7e269ccb8 Merge 'Progress of compaction executors' from Aleksandra Martyniuk
compaction_read_monitor_generator is an existing mechanism
for monitoring progress of sstables reading during compaction.
In this change information gathered by compaction_read_monitor_generator
is utilized by task manager compaction tasks of the lowest level,
i.e. compaction executors, to calculate task progress.

compaction_read_monitor_generator has a flag, which decides whether
monitored changes will be registered by compaction_backlog_tracker.
This allows us to pass the generator to all compaction readers without
impacting the backlog.

Task executors have access to compaction_read_monitor_generator_wrapper,
which protects the internals of  compaction_read_monitor_generator
and provides only the necessary functionality.

Closes scylladb/scylladb#14878

* github.com:scylladb/scylladb:
  compaction: add get_progress method to compaction_task_impl
  compaction: find total compaction size
  compaction: sstables: monitor validation scrub with compaction_read_generator
  compaction: keep compaction_progress_monitor in compaction_task_executor
  compaction: use read monitor generator for all compactions
  compaction: add compaction_progress_monitor
  compaction: add flag to compaction_read_monitor_generator
2023-10-18 12:19:51 +03:00
Kamil Braun
c1486fee40 Merge 'commitlog: drop truncation_records after replay' from Petr Gusev
This is a follow-up for #15279 and it fixes two problems.

First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed.

Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table.

It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP.

To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem.

Fixes #15354

Closes scylladb/scylladb#15532

* github.com:scylladb/scylladb:
  add test_commitlog
  system.truncated: Remove replay_position data from truncated on start
  main.cc: flush only local memtables when replaying schema commitlog
  main.cc: drop redundant supervisor::notify
  system_keyspace: flush if schema commitlog is not available
2023-10-18 11:14:31 +02:00
Gleb Natapov
f80fff3484 gossip: remove unused STATUS_LEAVING gossiper status
The status is no longer used. The function that referenced it was
removed by 5a96751534 and it was unused
back then for awhile already.

Message-Id: <ZS92mcGE9Ke5DfXB@scylladb.com>
2023-10-18 11:13:14 +02:00
Botond Dénes
7f81957437 Merge 'Initialize datadir for system and non-system keyspaces the same way' from Pavel Emelyanov
When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what.

This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables.

Test included.

fixes: #15708
closes: scylladb/scylla-manager#3603

Closes scylladb/scylladb#15723

* github.com:scylladb/scylladb:
  test: Add test for datadir/ layout
  sstable_directory: Indentation fix after previous patch
  db,sstables: Move storage init for system keyspace to table creation
2023-10-18 12:12:19 +03:00
David Garcia
51466dcb23 docs: add latest option to aws_images extension
rollback only latest

Closes scylladb/scylladb#15651
2023-10-18 11:43:21 +03:00
Kefu Chai
203f41dc99 sstable: improve descriptions of capped.*deletion_time
before this change, they reads

> Was local deletion time capped at ...

and

> Was partition tombstone deletion time capped at ...

the "Was" part is confusing. and the first description is not
accurate enough. so let's improve them a little bit.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15108
2023-10-18 09:40:02 +03:00
Kefu Chai
9bc0a9f95e mutation: do not include unused header
the `utils::UUID` class is not used by the implementation of
`canonical_mutation`, so let's remove the include from this source file.

the `#include` was originally added in
5a353486c6, but that commit did
add any code using UUID to this file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15731
2023-10-17 20:38:07 +03:00
Avi Kivity
dfffc022da Merge 'doc: doc: remove recommended image upgrade with OS from previous releases' from Anna Stuchlik
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).

The scope of this commit:

- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
     /upgrade/_common/upgrade-image-opensource.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst

This PR is a continuation of https://github.com/scylladb/scylladb/pull/15739.

**This PR must be backported to branch-5.2 and branch-5.1.**

Closes scylladb/scylladb#15740

* github.com:scylladb/scylladb:
  doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
  doc: remove wrong image upgrade info (4.6-to-5.0)
  doc: remove wrong image upgrade info (5.0-to-5.1)
2023-10-17 18:29:36 +03:00
Anna Stuchlik
9d9fe57efa doc: remove recommended image upgrade with OS
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).

The scope of this commit:
- Remove the information from the 5.1-to.-5.2
  upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
  page.
- Remove outdated info (about previous releases)
  from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
  in the page tree.

Refs: https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#15739
2023-10-17 18:28:52 +03:00
Avi Kivity
f42eb4d1ce Merge 'Store and propagage GC timestamp markers from commitlog' from Calle Wilund
Fixes #14870

(Originally suggested by @avikivity). Use commit log stored GC clock min positions to narrow compaction GC bounds.
(Still requires augmented manual flush:es with extensive CL clearing to pass various dtest, but this does not affect "real" execution).

Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment the first time. Because GC clock is wall
clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds
per segment. This is then (in a rather ugly way) propagated to tombstone_gc_state to narrow the allowed GC bounds for
a CF, based on what is currently left in CL.

Note: this is a rather unoptimized version - no caching or anything. But even so, should not be excessively expensive,
esp. since various other code paths already cache the results.

Closes scylladb/scylladb#15060

* github.com:scylladb/scylladb:
  main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info
  tombstone_gc_state: Add optional callback to augment GC bounds
  commitlog: Add keeping track of approximate lowest GC clock for CF entries
  database: Force new commitlog segment on user initiated flush
  commitlog: Add helper to force new active segment
2023-10-17 18:27:43 +03:00
Anna Stuchlik
7718f76ecd doc: remove outdated info from Materialized Views
This commit removes outdated info from
the Materialized Views page:

- The reference to the outated blog post.
- Irrelevant information about versions.

Fixes https://github.com/scylladb/scylladb/issues/15725

Closes scylladb/scylladb#15742
2023-10-17 18:26:54 +03:00
Anna Stuchlik
dd1207cabb doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.

/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
2023-10-17 16:48:51 +02:00
Anna Stuchlik
526d543b95 doc: remove wrong image upgrade info (4.6-to-5.0)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733
2023-10-17 16:28:19 +02:00
Petr Gusev
a0aee54f2c add test_commitlog
Check that commitlog provides durability in case
of a node reboot:
* truncate table T, truncation_record RP=1000;
* clean shutdown node/reboot machine/restart node, now RP=~0
since segment ids count from boot time;
* write some data to T; crash/restart
* check data is retained
2023-10-17 18:16:50 +04:00
Calle Wilund
6fbd210679 system.truncated: Remove replay_position data from truncated on start
Once we've started clean, and all replaying is done, truncation logs
commit log regarding replay positions are invalid. We should exorcise
them as soon as possible. Note that we cannot remove truncation data
completely though, since the time stamps stored are used by things like
batch log to determine if it should use or discard old batch data.
2023-10-17 18:16:48 +04:00
Petr Gusev
dde36b5d9d main.cc: flush only local memtables when replaying schema commitlog
Schema commitlog can be used only on shard 0, so it's redundant
to flush any other memtables.
2023-10-17 18:15:51 +04:00
Petr Gusev
54dd7cf1da main.cc: drop redundant supervisor::notify
Later in the code we have 'replaying schema commit log',
which duplicates this one. Also,
maybe_init_schema_commitlog may skip schema commitlog
initialization if the SCHEMA_COMMITLOG feature is
not yet supported by the cluster, so this notification
can be misleading.
2023-10-17 18:15:49 +04:00
Petr Gusev
c89ead55ff system_keyspace: flush if schema commitlog is not available
In PR #15279 we removed flushes when writing to a number
of tables from the system keyspace. This was made possible
by switching these tables to the schema commitlog.
Schema commitlog is enabled only when the SCHEMA_COMMITLOG
feature is supported by all nodes in the cluster. Before that
these tables will use the regular commitlog, which is not
durable because it uses db::commitlog::sync_mode::PERIODIC. This
means that we may lose data if a node crashes during upgrade
to the version with schema commitlog.

In this commit we fix this problem by restoring flushes
after writes to the tables if the schema commitlog
is not enabled yet.

The patch also contains a test that demonstrates the
problem. We need flush_schema_tables_after_modification
option since otherwise schema changes are not durable
and node fails after restart.
2023-10-17 18:14:27 +04:00
Anna Stuchlik
9852130c5b doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733
2023-10-17 16:04:16 +02:00
Kefu Chai
77b96f7748 main.cc: do not cast hours to milliseconds
there is no need to explicitly cast an instance of
std::chrono::hours to std::chrono::milliseconds to feed it to a
function which expects std::chrono::milliseconds. the constructor
of of std::chrono::milliseconds is able to do this convert and
create a new instance of std::chrono::milliseconds from another.
std::chrono::duration<> instance.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15734
2023-10-17 17:02:45 +03:00
Kamil Braun
7dcee7de02 test/pylib: implement expected_error for decommission and removenode
You can now pass `expected_error` to `ManagerClient.decommission_node`
and `ManagerClient.remove_node`. Useful in combination with error
injections, for example.

Closes scylladb/scylladb#15650
2023-10-17 16:25:43 +03:00
Calle Wilund
3378c246f7 main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info
Fixes #14870 (yet another alternative solution)

(Originally suggested by @avikivity). Use store GC clock min positions from CL
to narrow compaction GC bounds.

Note: not optimized with caches or anything at this point. Can easily be added
though of course always somewhat risky.
2023-10-17 10:30:40 +00:00
Kefu Chai
031ff755ce test/sstable: verify sstables::parse_path()
check the behavior of sstables::parse_path().
for better test coverage of this function.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15659
2023-10-17 13:28:58 +03:00
Calle Wilund
43a7d83fd0 tombstone_gc_state: Add optional callback to augment GC bounds
Allows potentially narrowing of GC time bounds.
2023-10-17 10:26:41 +00:00
Calle Wilund
560d3c17f0 commitlog: Add keeping track of approximate lowest GC clock for CF entries
Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment
first. Because GC clock is wall clock time and only connected to TTL (not
cell/row timestamps), this gives a fairly accurate view of GC low bounds
per segment.

Includes of course a function to get the all-segment lowest per CF.
2023-10-17 10:26:41 +00:00
Calle Wilund
2429cf656c database: Force new commitlog segment on user initiated flush
Helper for tools - ensures that tests which use nodetool flush
to force data to sstables also remove as much as possible from
commitlog.
2023-10-17 10:26:40 +00:00
Calle Wilund
810d06946f commitlog: Add helper to force new active segment
When called, if active segment holds data, close and replace with pristine one.
2023-10-17 10:26:40 +00:00
Petr Gusev
39789b6527 main.cc: ARM build fix
This is a follow-up for #15720.

Closes scylladb/scylladb#15730
2023-10-17 13:17:32 +03:00
Takuya ASADA
58d94a54a3 scylla_raid_setup: faillback to other paths when UUID not avialable
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.

To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.

To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.

Related #11359

Closes scylladb/scylladb#13803
2023-10-17 12:24:58 +03:00
Tomasz Grabiec
0aef0f900b Merge 'truncation records refactorings' from Petr Gusev
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`

Closes scylladb/scylladb#15583

* github.com:scylladb/scylladb:
  system_keyspace: drop truncation_record
  system_keyspace: remove get_truncated_at method
  table: get_truncation_time: check _truncated_at is initialized
  database: add_column_family: initialize truncation_time for new tables
  database: add_column_family: rename readonly parameter to is_new
  system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
  commitlog_replayer: refactor commitlog_replayer::impl::init
  system_keyspace: drop redundant typedef
  system_keyspace: drop redundant save_truncation_record overload
  table: rename cache_truncation_record -> set_truncation_time
  system_keyspace: get_truncated_position -> get_truncated_positions
2023-10-17 10:55:30 +02:00
Raphael S. Carvalho
da04fea71e compaction: Fix key estimation per sstable to produce efficient filters
The estimation assumes that size of other components are irrelevant,
when estimating the number of partitions for each output sstable.
The sstables are split according to the data file size, therefore
size of other files are irrelevant for the estimation.

With certain data models, like single-row partitions containing small
values, the index could be even larger than data.
For example, assume index is as large as data, then the estimation
would say that 2x more sstables will be generated, and as a result,
each sstable are underestimated to have 2x less keys.

Fix it by only accounting size of data file.

Fixes #15726.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15727
2023-10-17 11:21:11 +03:00
Aleksandra Martyniuk
0ce9db2329 repair: throw abort_requested_exception when abort is requested
If abort is requsted during bootstrap then a node should exit normally.
To achieve so, abort_requested_exception should be thrown as main
handles it gracefully.

In data_sync_repair_task_impl::run exceptions from all shards are
wrapped together into std::runtime_exception and so they aren't
handled as they are supposed to.

Throw abort_requested_exception when shutdown was requested.
Throw abort_requested_exception also if repair::task_manager_module::is_aborted,
so that force_terminate_all_repair_sessions acts the same regardless
the state of the repair.

To maintain consistency do the same for user_requested_repair_task_impl.

Fixes: #15710.

Closes scylladb/scylladb#15722
2023-10-17 10:08:06 +03:00
Kefu Chai
19e724822d test.py: pass self.suite.scylla_env to pytest process
before this change, pytest does not populate its suites's
`scylla_env` down to the forked pytest child process. this works
if the test does not care about the env variables in `scylla_env`.
but object_store is an exception, as it launches scylla instances
by itself. so, without the help of `scylla_env`, `run.find_scylla()`
always find the newest file globbed by `build/*/scylla`. this is not
always what we expect. on the contrary, if we launch object_store's
pytest using `test.py`, there are good chances that object_store
ends up with testing a wrong scylla executable if we have multiple
builds under `build/*/scylla`.

so, in this change, we populate `self.suite.scylla_env` down to
the child process created by `PythonTest`, so that all pytest
based tests can have access to its suites's env variables.
in addition to 'SCYLLA' env variable, they also include the
the env variables required by LLVM code coverage instrumentation.
this is also nice to have.

Fixes #15679
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15682
2023-10-17 09:27:12 +03:00
Petr Gusev
9b1dfad51c main.cc: disable stall detector for debug ARM builds
The stall detector uses glibc backtrace function to
collect backtraces, this causes ASAN failures on ARM.
For now we just disable the stall detector in this
configuration, the ticket about migrating
to libunwind: scylladb/seastar#1878

We increase the value of blocked_reactor_notify_ms to
make sure the stall detector never fires.

Fixes #15389
Fixes #15090

Closes scylladb/scylladb#15720
2023-10-16 21:57:35 +03:00
Pavel Emelyanov
d59cd662f8 test: Add test for datadir/ layout
The test checks that

- for non-system keyspace datadir and its staging/ and upload/ subdirs
  are created when the table is created _and_ that the directory is
  re-populated on boot in case it was explicitly removed

- for system non-virtual tables it checks that the same directory layout
  is created on boot

- for system virtual tables it checks that the directory layout doesn't
  exist

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-16 16:26:48 +03:00
Pavel Emelyanov
c3b3e5b107 sstable_directory: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-16 16:26:37 +03:00
Pavel Emelyanov
059d7c795e db,sstables: Move storage init for system keyspace to table creation
User and system keyspaces are created and populated slightly
differently.

System keyspace is created via system_keyspace::make() which eventually
calls calls add_column_family(). Then it's populated via
init_system_keyspace() which calls sstable_directory::prepare() which,
in turn, optionally creates directories in datadir/ or checks the
directory permissions if it exists

User keyspaces are created with the help of
add_column_family_and_make_directory() call which calls the
add_column_family() mentioned above _and_ calls table::init_storage() to
create directories. When it's populated with init_non_system_keyspaces()
it also calls sstable_directory::prepare() which notices that the
directory exists and then checks the permissions.

As a result, sstable_directory::prepare() initializes storage for system
keyspace only and there's a BUG (#15708) that the upload/ subdir is not
created.

This patch makes the directories creation for _all_ keyspaces with the
table::init_storage(). The change only touches system keyspace by moving
the creation of directories from sstable_directory::prepare() into
system_keyspace::make().

Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-16 16:19:25 +03:00
Patryk Jędrzejczak
7810e8d860 table_helper: announce twice in setup_keyspace
We refactor table_helper::setup_keyspace so that it calls
migration_manager::announce at most twice. We achieve it by
announcing all tables at once.

The number of announcements should further be reduced to one, but
it requires a big refactor. The CQL code used in
parse_new_cf_statement assumes the keyspace has already been
created. We cannot have such an assumption if we want to announce
a keyspace and its tables together. However, we shouldn't touch
the CQL code as it would impact user requests, too.

One solution is using schema_builder instead of the CQL statements
to create tables in table_helper.

Another approach is removing table_helper completely. It is used
only for the system_traces keyspace, which Scylla creates
automatically. We could refactor the way Scylla handles this
keyspace and make table_helper unneeded.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
2b4e1e0f9c table_helper: refactor setup_table
In the following commit, we reduce migration_manager::announce
calls in table_helper::setup_keyspace by announcing all tables
together. To do it, we cannot use table_helper::setup_table
anymore, which announces a single table itself. However, the new
code still has to translate CQL statements, so we extract it to the
new parse_new_cf_statement function to avoid duplication.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
fad71029f0 redis: create_keyspace_if_not_exists_impl: fix indentation
Broken in the previous commit.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
a3044d1f46 redis: announce once in create_keyspace_if_not_exists_impl
We refactor create_keyspace_if_not_exists_impl so that it takes at
most one group 0 guard and calls migration_manager::announce at
most once.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
98d067e77d db: system_distributed_keyspace: fix indentation
Broken in the previous commit.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
5ebc0e8617 db: system_distributed_keyspace: announce once in start
We refactor system_distributed_keyspace::start so that it takes at
most one group 0 guard and calls migration_manager::announce at
most once.

We remove a catch expression together with the FIXME from
get_updated_service_levels (add_new_columns_if_missing before the
patch) because we cannot treat the service_levels update
differently anymore.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
449b4c79c2 tablet_allocator: update on_before_create_column_family
After adding the keyspace_metadata parameter to
migration_listener::on_before_create_column_family,
tablet_allocator doesn't need to load it from the database.

This change is necessary before merging migration_manager::announce
calls in the following commit.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
7653059369 migration_listener: add parameter to on_before_create_column_family
After adding the new prepare_new_column_family_announcement that
doesn't assume the existence of a keyspace, we also need to get
rid of the same assumption in all on_before_create_column_family
calls. After all, they may be initiated before creating the
keyspace. However, some listeners require keyspace_metadata, so we
pass it as a new parameter.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
96d9e768c4 alternator: executor: use new prepare_new_column_family_announcement
We can use the new prepare_new_column_family_announcement function
that doesn't assume the existence of the keyspace instead of the
previous work-around.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
fcd092473c alternator: executor: introduce create_keyspace_metadata
We need to store a new keyspace's keyspace_metadata as a local
variable in create_table_on_shard0. In the following commit, we
use it to call the new prepare_new_column_family_announcement
function.
2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak
7e6017d62d migration_manager: add new prepare_new_column_family_announcement
In the following commits, we reduce the number of the
migration_manager::anounce calls by merging some of them in a way
that logically makes sense. Some of these merges are similar --
we announce a new keyspace and its tables together. However,
we cannot use the current prepare_new_column_family_announcement
there because it assumes that the keyspace has already been created
(when it loads the keyspace from the database). Luckily, this
assumption is not necessary as this function only needs
keyspace_metadata. Instead of loading it from the database, we can
pass it as a parameter.
2023-10-16 14:59:53 +02:00
Aleksandra Martyniuk
198119f737 compaction: add get_progress method to compaction_task_impl
compaction_task_impl::get_progress is used by the lowest level
compaction tasks which progress can be taken from
compaction_progress_monitor.
2023-10-12 17:16:05 +02:00
Aleksandra Martyniuk
39e96c6521 compaction: find total compaction size 2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk
7b3e0ab1f2 compaction: sstables: monitor validation scrub with compaction_read_generator
Validation scrub bypasses the usual compaction machinery, though it
still needs to be tracked with compaction_progress_monitor so that
we could reach its progress from compaction task executor.

Track sstable scrub in validate mode with read monitors.
2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk
3553556708 compaction: keep compaction_progress_monitor in compaction_task_executor
Keep compaction_progress_monitor in compaction_task_executor and pass a reference
to it further, so that the compaction progress could be retrieved out of it.
2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk
37da5a0638 compaction: use read monitor generator for all compactions
Compaction read monitor generators are used in all compaction types.
Classes which did not use _monitor_generator so far, create it with
_use_backlog_tracker set to no, not to impact backlog tracker.
2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk
22bf3c03df compaction: add compaction_progress_monitor
In the following patches compaction_read_monitor_generator will be used
to find progress of compaction_task_executor's. To avoid unnecessary life
prolongation and exposing internals of the class out of compaction.cc,
compaction_progress_monitor is created.

Compaction class keeps a reference to the compaction_progress_monitor.
Inheriting classes which actually use compaction_read_monitor_generator,
need to set it with set_generator method.
2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk
b852ad25bf compaction: add flag to compaction_read_monitor_generator
Following patches will use compaction_read_monitor_generator
to track progress of all types of compaction. Some of them should
not be registered in compaction_backlog_tracker.

_use_backlog_tracker flag, which is by default set to true, is
added to compaction_read_monitor_generator and passed to all
compaction_read_monitors created by this generator.
2023-10-12 17:03:46 +02:00
Wojciech Mitros
055f061706 test: handle fast execution of test_user_function_filtering
Currently, when the test is executed too quickly, the timestamp
insterted into the 'my_table' table might be the same as the
timestamp used in the SELECT statement for comparison. However,
the statement only selects rows where the inserted timestamp
is strictly lower than current timestamp. As a result, when this
comparison fails, we may skip executing the following comparison,
which uses a user-defined function, due to which the statement
is supposed to fail with an error. Instead, the select statement
simply returns no rows and the test case fails.
To fix this, simply use the less or equal operator instead
of using the strictly less operator for comparing timestamps.

Fixes #15616

Closes scylladb/scylladb#15699
2023-10-12 17:04:43 +03:00
Tomasz Grabiec
accac7efd8 test: test_tablets.py: Enable verbose logging
This is in order to aid investigation of falkiness of the test, which
fails due to a timeout during scan after cluster restart in debug mode.

See #14746.

I enable trace-level logging for some scylla-side loggers and
inject logging of sent and received messages on the driver side.

Closes scylladb/scylladb#15696
2023-10-12 17:03:19 +03:00
Jan Ciolek
940e44f887 db/view: change log level of failed view updates to WARN
When a remote view update doesn't succeed there's a log message
saying "Error applying view update...".
This message had log level ERROR, but it's not really a hard error.
View updates can fail for a multitude of reasons, even during normal operation.
A failing view update isn't fatal, it will be saved as a view hint a retried later.

Let's change the log level to WARN. It's something that shouldn't happen too much,
but it's not a disaster either.
ERROR log level causes trouble in tests which assume that an ERROR level message
means that the test has failed.

Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784

For local view updates the log level stays at "ERROR", local view updates shouldn't fail.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes scylladb/scylladb#15640
2023-10-11 18:19:23 +03:00
Israel Fruchter
41c80929eb Update tools/cqlsh submodule
* tools/cqlsh 66ae7eac...426fa0ea (8):
  > Updated Scylla Driver[Issue scylladb/scylla-cqlsh#55]
  > copyutil: closing the local end of pipes after processes starts
  > setup.py: specify Cython language_level explicitly
  > setup.py: pass extensions as a list
  > setup.py: reindent block in else branch
  > setup.py: early return in get_extension()
  > reloc: install build==0.10.0
  > reloc: add --verbose option to build_reloc.sh

Fixes: https://github.com/scylladb/scylla-cqlsh/issues/37

Closes scylladb/scylladb#15685
2023-10-11 17:29:23 +03:00
Aleksandra Martyniuk
5a10bd44bf test_storage_service: use new_test_snapshot fixture
test_storage_service_keyspace_cleanup_with_no_owned_ranges
from test_storage_service.py creates snapshots with tags based
on current time. Thus if a test runs on the same node twice
with time distance short enough, there may be a name collision
between the snapshots from two runs. This will cause the second
run to fail on assertions.

Use new_test_snapshot fixture to drop snapshots after the test.

Delete my_snapshot_tags as it's no longer necessary.

Fixes: #15680.

Closes scylladb/scylladb#15683
2023-10-11 00:53:36 +03:00
Avi Kivity
35849fc901 Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun"
This reverts commit 3d4398d1b2, reversing
changes made to 45dfce6632. The commit
causes some schema changes to be lost due to incorrect timestamps
in some mutations. More information is available in [1].

Reopens: scylladb/scylladb#7620
Reopens: scylladb/scylladb#13957

Fixes scylladb/scylladb#15530.

[1] https://github.com/scylladb/scylladb/pull/15687
2023-10-11 00:32:05 +03:00
Kamil Braun
05ede7a042 test/pylib: always return a response from put_json
In 20ff2ae5e1 mutating endpoints were
changed to use PUT. But some of them return a response, and I forgot to
provide `response_type` parameter to `put_json` (which causes
`RESTClient` to actually obtain the response). These endpoints now
return `None`.

Fix this.

Closes scylladb/scylladb#15674
2023-10-09 14:35:04 +03:00
Kefu Chai
e76a02abc5 build: move check for NIX_CC into dynamic_linker_option()
`employ_ld_trickery` is only used by `dynamic_linker_option()`, so
move it into this function.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-09 11:11:57 +08:00
Kefu Chai
e85fc9f8be build: extract dynamic_linker_option(): out
this change helps to remove more global statements.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-09 11:11:57 +08:00
Kefu Chai
21b61e8f0a build: move headers into write_build_file()
`headers` is only used in this function, so move it closer to where
it is used.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-09 11:11:57 +08:00
Kefu Chai
b3e5c8c348 build: cmake: pass -dynamic-linker to ld
to match the behavior of `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-09 11:07:13 +08:00
Kefu Chai
ce46f7b91b build: cmake: set CMAKE_EXE_LINKER_FLAGS in mode.common.cmake
so that CMakeLists.txt is less cluttered. as we will append
`--dynamic-linker` option to the LDFLAGS.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-09 11:07:13 +08:00
Raphael S. Carvalho
4e6fe34501 tests: Synchronize boost logger for multithreaded tests in sstable_directory_test
The logger is not thread safe, so a multithreaded test can concurrently
write into the log, yielding unreadable XMLs.

Example:
boost/sstable_directory_test: failed to parse XML output '/scylladir/testlog/x86_64/release/xml/boost.sstable_directory_test.sstable_directory_shared_sstables_reshard_correctly.3.xunit.xml': not well-formed (invalid token): line 1, column 1351

The critical (today's unprotected) section is in boost/test/utils/xml_printer.hpp:
```
inline std::ostream&
operator<<( custom_printer<cdata> const& p, const_string value )
{
    *p << BOOST_TEST_L( "<![CDATA[" );
    print_escaped_cdata( *p, value );
    return  *p << BOOST_TEST_L( "]]>" );
}
```

The problem is not restricted to xml, but the unreadable xml file caused
the test to fail when trying to parse it, to present a summary.

New thread-safe variants of BOOST_REQUIRE and BOOST_REQUIRE_EQUAL are
introduced to help multithreaded tests. We'll start patching tests of
sstable_directory_test that will call BOOST_REQUIRE* from multiple
threads. Later, we can expand its usage to other tests.

Fixes #15654.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15655
2023-10-08 15:57:08 +03:00
Kefu Chai
1efd0d9a92 test: set use_uuid to true by default in sstables::test_env
for better coverage of uuid-based sstable identifier. since this
option is enabled by default, this also match our tests with the
default behavior of scylladb.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-07 18:56:47 +08:00
Kefu Chai
50c8619ed9 test: enable test to set uuid_sstable_identifiers
some of the tests are still relying on the integer-based sstable
identifier, so let's add a method to test_env, so that the tests
relying on this can opt-out. we will change the default setting
of sstables::test_env to use uuid-base sstable identifier in the
next commit. this change does not change the existing behavior.
it just adds a new knob to test_env_config. and let the tests
relying on this to customize the test_env_config to disable
use_uuid.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-10-07 18:56:47 +08:00
Avi Kivity
765e193122 Merge 'db/hints: Modernize manager' from Dawid Mędrek
This PR is another step in refactoring the Hinted Handoff module. It aims at modernizing the code by moving to coroutines, using `std::ranges` instead of Boost's ones where possible, and uses other features coming with the new C++ standards.

It also tries to make the code clearer and get rid of confusing elements, e.g. using shared pointers where they shouldn't be used or marking methods as virtual even though nothing derives from the class. It also prevents `manager.hh` from giving direct access to internal structures (`hint_endpoint_manager` in this case).

Refs #15358

Closes scylladb/scylladb#15631

* github.com:scylladb/scylladb:
  db/hints/manager: Reword comments about state
  db/hints/manager: Unfriend space_watchdog
  db/hints: Remove a redundant alias
  db/hints: Remove an unused namespace
  db/hints: Coroutinize change_host_filter()
  db/hints: Coroutinize drain_for()
  db/hints: Clean up can_hint_for()
  db/hints: Clean up store_hint()
  db/hints: Clean up too_many_in_flight_hints_for()
  db/hints: Refactor get_ep_manager()
  db/hints: Coroutinize wait_for_sync_point()
  db/hints: Use std::span in calculate_current_sync_point
  db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints()
  db/hints: Clean up manager::forbid_hints()
  db/hints: Clean up manager::allow_hints()
  db/hints: Coroutinize compute_hints_dir_device_id()
  db/hints: Clean up manager::stop()
  db/hints: Clean up manager::start()
  db/hints/manager: Clean up the constructor
  db/hints: Remove boilerplate drain_lock()
  db/hints: Let drain_for() return a future
  db/hints: Remove ep_managers_end
  db/hints: Remove find_ep_manager
  db/hints: Use manager as API for hint_endpoint_manager
  db/hints: Don't mark have_ep_manager()'s definition as inline
  db/hints: Remove make_directory_initializer()
  db/hints/manager: Order constructors
  db/hints: Move ~manager() and mark it as noexcept
  db/hints: Use reference for storage proxy
  db/hints/manager: Explicitly delete copy constructor
  db/hints: Capitalize constants
  db/hints/manager: Hide declarations
  db/hints/manager: Move the defintions of static members to the header
  db/hints: Move make_dummy() to the header
  db/hints: Don't explicitly define ~directory_initializer()
  db/hints: Change the order of logging in ensure_created_and_verified()
  db/hints: Coroutinize ensure_rebalanced()
  db/hints: Coroutinize ensure_created_and_verified()
  db/hints: Improve formatting of directory_initializer::impl
  db/hints: Do not rely on the values of enums
  db/hints: Move the implementation of directory_initializer
  db/hints: Prefer nested namespaces
  db/hints: Remove an unused alias from manager.hh
  db/hints: Reorder includes in manager.hh and .cc
2023-10-06 17:20:33 +03:00
Anna Stuchlik
5d3584faa5 doc: add a note about Raft
This commit adds a note to specify
that the information on the Handling
Failures page only refers to clusters
with Raft enabled.
Also, the comment is included to remove
the note in future versions.
2023-10-06 16:04:43 +02:00
Pavel Emelyanov
e485c854b2 distributed_loader: Remove explicit sharded<erms>
The sharded replication map was needed to provide sharded for sstable
directory. Now it gets sharded via table reference and thus the erms
thing becomes unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-06 15:57:45 +03:00
Pavel Emelyanov
c2eb1ae543 distributed_loader: Brush up start_subdir()
Drop some local references to class members and line-up arguments to
starting distributed sstable directory. Purely a clean up patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-06 15:57:03 +03:00
Pavel Emelyanov
795dcf2ead sstable_directory: Add enlightened construction
The existing constructor is pretty heavyweight for the distributed
loader to use -- it needs to pass it 4 sharded parameters which looks
pretty bulky in the text editor. However, 5 constructor arguments are
obtained directly from the table, so the dist. loader code with global
table pointer at hand can pass _it_ as sharded parameter and let the
sstable directory extract what it needs.

Sad news is that sstable_directory cannot be switched to just use table
reference. Tools code doesn't have table at hand, but needs the
facilities sstable_directory provides

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-06 15:54:51 +03:00
Pavel Emelyanov
e004469827 table: Add global_table_ptr::as_sharded_parameter()
The method returns seastar::sharded_parameter<> for the global table
that evaluates into local table reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-06 15:53:57 +03:00
Botond Dénes
0ea0982590 Merge 'test/pylib: better code consistency, less boilerplate' from Kamil Braun
Refactor the code to be more consistent -- we often did the same thing in multiple ways depending on the endpoint, such as how we returned errors (some endpoints would return them through exceptions, other would wrap into `aiohttp.web.Response`s). Choose the arguably least boilerplate'y way in each case.

Then reduce the boilerplate even further.

Thanks to these refactors, modifying the framework in the future will require less work and it will be more obvious which of the possible ways to modify it should be picked (i.e. consistent with the existing code.)

Closes scylladb/scylladb#15646

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: reduce `aiohttp` boilerplate
  test/pylib: always return data as JSON from endpoints
  test/pylib: scylla_cluster: catch `HTTPError` in topology change endpoints
  test/pylib: scylla_cluster: do sanity/precondition checks through asserts
  test/pylib: scylla_cluster: return errors through exceptions
  test/pylib: use JSON data to pass `expected_error` in `server_start`
  test/pylib: use PUT instead of GET for mutating endpoints
  test/pylib: rest_client: make `data` optional in `put_json`
  test/pylib: fix some type errors
2023-10-06 14:51:16 +03:00
Dawid Medrek
6fdca0d3a8 db/hints/manager: Reword comments about state
The current comments should be clearer to someone
not familiar with the module. This commit also makes
them abide by the limit of 120 characters per line.
2023-10-06 13:25:30 +02:00
Dawid Medrek
aa38ea3642 db/hints/manager: Unfriend space_watchdog
space_watchdog is a friend of shard hint manager just to
be able to execute one of its functions. This commit changes
that by unfriending the class and exposing the function.
2023-10-06 13:25:30 +02:00
Dawid Medrek
6cd0153954 db/hints: Remove a redundant alias 2023-10-06 13:25:30 +02:00
Dawid Medrek
ddc385bce0 db/hints: Remove an unused namespace 2023-10-06 13:25:30 +02:00
Dawid Medrek
76d414012b db/hints: Coroutinize change_host_filter() 2023-10-06 13:25:30 +02:00
Dawid Medrek
09eb30e6f1 db/hints: Coroutinize drain_for()
This commit turns the function into a coroutine
and makes the code less compact and more readable.
2023-10-06 13:25:30 +02:00
Dawid Medrek
907a572e24 db/hints: Clean up can_hint_for()
This commit gets rid of unnecessary additional calls to functions
and makes all lines abide by the limit of 120 characters.
2023-10-06 13:25:30 +02:00
Dawid Medrek
596e1f9859 db/hints: Clean up store_hint()
This commit makes the function abide by the limit
of 120 characters per line.
2023-10-06 13:25:30 +02:00
Dawid Medrek
8a43f94ca6 db/hints: Clean up too_many_in_flight_hints_for()
This commit makes the return statement more readable.
It also makes the comment abide by the limit of 120 characters per line.
2023-10-06 13:25:30 +02:00
Dawid Medrek
96a5906621 db/hints: Refactor get_ep_manager() 2023-10-06 13:25:30 +02:00
Dawid Medrek
8b591be3c3 db/hints: Coroutinize wait_for_sync_point()
This commit coroutinizes the function and adds
a comment explaining a non-trivial case.
2023-10-06 13:25:27 +02:00
Dawid Medrek
fee3aafd80 db/hints: Use std::span in calculate_current_sync_point
std::span is a lot more flexible than std::vector as it allows
for arbitrary contiguous ranges.
2023-10-06 12:36:05 +02:00
Dawid Medrek
64fd4d6323 db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints() 2023-10-06 12:26:55 +02:00
Dawid Medrek
58cd5c4167 db/hints: Clean up manager::forbid_hints() 2023-10-06 12:26:55 +02:00
Dawid Medrek
f8ed93f5bc db/hints: Clean up manager::allow_hints() 2023-10-06 12:26:52 +02:00
Dawid Medrek
bfe32bcf89 db/hints: Coroutinize compute_hints_dir_device_id() 2023-10-06 12:18:30 +02:00
Dawid Medrek
8f28eb6522 db/hints: Clean up manager::stop()
This commit gets rid of boilerplate in the function,
leverages a range pipe and explicit types to make
the code more readable, and changes the logs to
make it clearer what happens.
2023-10-06 12:18:30 +02:00
Dawid Medrek
a384caece0 db/hints: Clean up manager::start()
This commit coroutinizes the function and makes it less compact.
2023-10-06 12:18:30 +02:00
Dawid Medrek
2db97aaf81 db/hints/manager: Clean up the constructor
fmt::to_string should be preferred to seastar::format.
It's clearer and simpler. Besides that, this commit makes
the code abide by the limit of 120 characters per line.
2023-10-06 12:18:30 +02:00
Dawid Medrek
6c10a86791 db/hints: Remove boilerplate drain_lock() 2023-10-06 12:18:30 +02:00
Dawid Medrek
f1f35ba819 db/hints: Let drain_for() return a future
Currently, the function doesn't return anything.
However, if the futurue doesn't need to be awaited,
the caller can decide that. There is no reason
to make that decision in the function itself.
2023-10-06 12:18:25 +02:00
Dawid Medrek
79e1412f14 db/hints: Remove ep_managers_end
The methods are redundant and are effectively
code boilerplate.
2023-10-06 12:15:04 +02:00
Dawid Medrek
cfbacb29bb db/hints: Remove find_ep_manager
The methods are redundant and are effectively
code boilerplate.
2023-10-06 12:15:04 +02:00
Dawid Medrek
1c70a18fc7 db/hints: Use manager as API for hint_endpoint_manager
This commit makes with_file_update_mutex() a method of hint_endpoint_manager
and introduces db::hints::manager::with_file_update_mutex_for() for accessing
it from the outside. This way, hint_endpoint_manager is hidden and no one
needs to know about its existence.
2023-10-06 12:15:01 +02:00
Dawid Medrek
d068143b83 db/hints: Don't mark have_ep_manager()'s definition as inline
Doing that doesn't allow for external linkage, so
it's not accessible from other files.
2023-10-06 11:54:15 +02:00
Dawid Medrek
58249363bc db/hints: Remove make_directory_initializer()
The function is never used. It's not even implemented.
2023-10-06 11:54:15 +02:00
Dawid Medrek
f47a669f75 db/hints/manager: Order constructors
This commit orders constructors of db::hints::manager for readability.
2023-10-06 11:54:15 +02:00
Dawid Medrek
4663f72990 db/hints: Move ~manager() and mark it as noexcept
The destructor is trivial and there is no reason
to keep in the source file. We mark it as noexcept too.
2023-10-06 11:54:15 +02:00
Dawid Medrek
18a2831186 db/hints: Use reference for storage proxy
This commit makes db::hints::manager store service::storage_proxy
as a reference instead of a seastar::shared_ptr. The manager is
owned by storage proxy, so it only lives as long as storage proxy
does. Hence, it makes little sense to store the latter as a shared
pointer; in fact, it's very confusing and may be error-prone.
The field never changes, so it's safe to keep it as a reference
(especially because copy and move constructors of db::hints::manager
are both deleted). What's more, we ensure that the hint manager
has access to storage proxy as soon as it's created.

The same changes were applied to db::hints::resource_manager.
The rationale is the same.
2023-10-06 11:54:15 +02:00
Dawid Medrek
3c347cc196 db/hints/manager: Explicitly delete copy constructor
This commit explicitly deletes the copy constructor of
db::hints::manager and its copy assignment. They're not
used in the code, and they should not.
2023-10-06 11:54:15 +02:00
Dawid Medrek
ee5a5c1661 db/hints: Capitalize constants
This is a common convention. Follow it for readability.
2023-10-06 11:54:15 +02:00
Dawid Medrek
fd30bac7b1 db/hints/manager: Hide declarations 2023-10-06 11:54:15 +02:00
Dawid Medrek
4b03cba1bf db/hints/manager: Move the defintions of static members to the header
If the variables are accessible from the outside, it makes
sense to also expose their initial values to the user.
This commit moves them to the header and marks as inline.
2023-10-06 11:54:15 +02:00
Dawid Medrek
c3ab28f5e9 db/hints: Move make_dummy() to the header
The function is trivial. It can also be marked as noexcept.
2023-10-06 11:54:15 +02:00
Dawid Medrek
5e333f0a52 db/hints: Don't explicitly define ~directory_initializer()
The destructor is the default destructor, and it is safe
to drop it altogether.
2023-10-06 11:53:02 +02:00
Kamil Braun
bf17aa93e2 test/pylib: scylla_cluster: reduce aiohttp boilerplate
Make handlers return their results directly, without wrapping them into
`aiohttp.web.Response`s. Instead the wrapping is done in a generic way
when defining the routes.
2023-10-06 11:24:13 +02:00
Kamil Braun
d3bc0d47e0 test/pylib: always return data as JSON from endpoints
Some endpoint handlers return JSON, some return text, some return empty
responses.

Reduce the number of different handler types by making the text case a
subcase of the JSON case. This also simplifies some code on the
`ManagerClient` side, which would have to deserialize data from text
(because some endpoint handlers would serialize data into text for no
particular reason). And it will allow reducing boilerplate in later
commits even further.
2023-10-06 11:24:02 +02:00
Dawid Medrek
9f215d3cf1 db/hints: Change the order of logging in ensure_created_and_verified()
The new logging order seems to make more sense, i.e.
we first log that we're creating and validating directories,
and only then do we start doing that.
The previous order when those actions were reversed didn't
match the log's message because the action was already
done when we informed the user of it.
2023-10-06 11:14:41 +02:00
Dawid Medrek
4ad3e8d37b db/hints: Coroutinize ensure_rebalanced() 2023-10-06 11:14:41 +02:00
Dawid Medrek
672cdb5c05 db/hints: Coroutinize ensure_created_and_verified() 2023-10-06 11:14:41 +02:00
Dawid Medrek
a5f14cb130 db/hints: Improve formatting of directory_initializer::impl
The implementation class has been divided into clear sections.
The indentation has also been adjusted to what is commonly
used in the codebase.
2023-10-06 11:14:41 +02:00
Dawid Medrek
500175d738 db/hints: Do not rely on the values of enums
These changes move away from relying on specific
values of enum variants. The code based on the arithmetic
of them is trivial, and there is no reason to not operator==
and operator!= instead. This should make the code less error
prone and easier to understand.
2023-10-06 11:14:41 +02:00
Dawid Medrek
d0b4d9f14f db/hints: Move the implementation of directory_initializer
This commit moves said code to the top of manager.cc
to match its position in the header file. That should
make navigation easier.
2023-10-06 11:14:41 +02:00
Dawid Medrek
b516fe1fc0 db/hints: Prefer nested namespaces
This reduces the amount of boilerplate.
2023-10-06 11:14:41 +02:00
Dawid Medrek
75a85b224b db/hints: Remove an unused alias from manager.hh 2023-10-06 11:14:41 +02:00
Dawid Medrek
fc80c57bec db/hints: Reorder includes in manager.hh and .cc
These changes improve the readability of the included headers.
2023-10-06 11:14:41 +02:00
Kamil Braun
2d4f157216 test/pylib: scylla_cluster: catch HTTPError in topology change endpoints
Exceptions flying from `RESTClient` (which used to communicate with
Scylla's REST API) are in fact not `RuntimeException`s, they are
`HTTPError`s (a type defined in the `rest_client` module). So they would
just fly through our catch branches, and the additional info (such as
log file path) would not be attached. Fix this.
2023-10-06 10:58:34 +02:00
Kamil Braun
e001463c4a test/pylib: scylla_cluster: do sanity/precondition checks through asserts
Some of them are already done that way, so turn the rest to asserts as
well for consistency and code terseness, instead of using the `if` +
`raise` pattern.
2023-10-06 10:58:34 +02:00
Kamil Braun
9072379863 test/pylib: scylla_cluster: return errors through exceptions
Since 2f84e820fd it is possible to return errors from
`ScyllaClusterManager` handlers through exceptions without losing the
contents of these exceptions (the contents arrive at `ManagerClient` and
the test can inspect them, unlike in the past where the client would get
a generic `InternalServerError`)

Change all handlers to return errors through exceptions (like some
already do) and get rid of the `ActionReturn` boilerplate.

When checking for `self.cluster`, do it through assertions, like most of
the handlers already do, instead of using the `if` + `raise` pattern.
2023-10-06 10:58:32 +02:00
Kamil Braun
f848d7b5c0 test/pylib: use JSON data to pass expected_error in server_start
Most other endpoints receive data through request body as JSON, this one
endpoint is an exception for some reason. Make it consistent with
others.
2023-10-06 10:55:45 +02:00
Kamil Braun
20ff2ae5e1 test/pylib: use PUT instead of GET for mutating endpoints
`ScyllaClusterManager` registers a bunch of HTTP endpoints which
`ManagerClient` uses to perform operations on a cluster during
a topology test.

The endpoints were inconsistently using verbs, like using GET for
endpoints that would have side effects. Use PUT for these.
2023-10-06 10:55:45 +02:00
Kamil Braun
69a6910a90 test/pylib: rest_client: make data optional in put_json 2023-10-06 10:55:45 +02:00
Kamil Braun
33463df7d2 test/pylib: fix some type errors 2023-10-06 10:55:45 +02:00
Raphael S. Carvalho
2a81b2e49a types: Avoid unneeded copy in simple_date_type_impl::from_sstring()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15645
2023-10-06 11:05:27 +03:00
Botond Dénes
8c03eeb85d Merge 'Sanitize hints API handlers and remove proxy from http context' from Pavel Emelyanov
This is the continuation of 3e74432dbf.

Registering API handlers for services need to
 - happen next to the corresponding service's start
 - use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job)
 - get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything)

Hints API handlers want to use proxy, but also reference gossiper and capture proxy via http context. This PR fixes both and removes http_contex -> proxy dependency as no longer needed

Closes scylladb/scylladb#15644

* github.com:scylladb/scylladb:
  api: Remove proxy reference from http context
  api,hints: Use proxy instead of ctx
  api,hints: Pass sharded<proxy>& instead of gossiper&
  api,hints: Fix indentation after previous patch
  api,hints: Move gossiper access to proxy
2023-10-06 11:04:27 +03:00
Avi Kivity
854188a486 Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:

1. Large allocations.  Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator  does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.

This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of the query.

Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).

Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111

Closes scylladb/scylladb#15414

* github.com:scylladb/scylladb:
  test/topology_custom: add test_read_repair.py
  replica/mutation_dump: detect end-of-page in range-scans
  tools/scylla-sstable: write: abort parser thread if writing fails
  test/pylib: add REST methods to get node exe and workdir paths
  test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
  service/storage_proxy: add trace points for the actual read executor type
  service/storage_proxy: add trace points for read-repair
  storage_proxy: Add more trace-level logging to read-repair
  database: Fix accounting of small partitions in mutation query
  database, storage_proxy: Reconcile pages with no live rows incrementally
2023-10-05 22:39:34 +03:00
Avi Kivity
197b7590df Update tools/jmx submodule
* tools/jmx d107758...8d15342 (2):
  > Revert "install-dependencies.sh: do not install weak dependencies"
  > install-dependencies.sh: do not install weak dependencies Especially for Java, we really do not need the tens of packages and MBs it adds, just because Java apps can be built and use sound and graphics and whatnot.
2023-10-05 22:36:54 +03:00
Avi Kivity
ee57f69b17 Update tools/java submodule
* tools/java 9dddad27bf...3c09ab97a9 (1):
  > nodetool: parse and forward -h|--host to nodetool
2023-10-05 22:35:58 +03:00
Michael Huang
75109e9519 cql3: Fix invalid JSON parsing for JSON objects with ASCII keys
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.

Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.

Fixes: #7949

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Closes scylladb/scylladb#15499
2023-10-05 22:26:08 +03:00
Avi Kivity
e600f35d1e Merge 'logalloc, reader_concurrency_semaphore: cooperate on OOM kills' from Botond Dénes
Consider the following code snippet:
```c++
future<> foo() {
    semaphore.consume(1024);
}

future<> bar() {
    return _allocating_section([&] {
        foo();
    });
}
```

If the consumed memory triggers the OOM kill limit, the semaphore will throw `std::bad_alloc`. The allocating section will catch this, bump std reserves and retry the lambda. Bumping the reserves will not do anything to prevent the next call to `consume()` from triggering the kill limit. So this cycle will repeat until std reserves are so large that ensuring the reserve fails. At this point LSA gives up and re-throws the `std::bad_alloc`. Beyond the useless time spent on code that is doomed to fail, this also results in expensive LSA compaction and eviction of the cache (while trying to ensure reserves).
Prevent this situation by throwing a distinct exception type which is derived from `std::bad_alloc`. Allocating section will not retry on seeing this exception.
A test reproducing the bug is also added.

Fixes: #15278

Closes scylladb/scylladb#15581

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill
  utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled()
  reader_concurrency_semaphore: use utils::memory_limit_reached exception
  utils: add memory_limit_reached exception
2023-10-05 19:47:21 +03:00
Pavel Emelyanov
967faa97e4 proxy: Coroutinize start_hints_manager()
All the other calls managing hints are coroutinized

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15641
2023-10-05 16:16:27 +02:00
Pavel Emelyanov
162642ac18 api: Remove proxy reference from http context
Now it's unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:16:52 +03:00
Pavel Emelyanov
e76f23994c api,hints: Use proxy instead of ctx
Now hints endpoints use ctx.sp reference, but it has the direct proxy
reference at hand and should prefer it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:16:01 +03:00
Pavel Emelyanov
6ce7ec4a5e api,hints: Pass sharded<proxy>& instead of gossiper&
Proxy is the target service to handle hints API endpoints. Need to pass
it as argument to handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:15:28 +03:00
Pavel Emelyanov
5f521116a2 api,hints: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:15:20 +03:00
Pavel Emelyanov
53891dd9cc api,hints: Move gossiper access to proxy
API handlers should try to avoid using any service other than the "main"
one. For hints API this service is going to be proxy, so no gossiper
access in the handler itself.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:14:26 +03:00
Botond Dénes
96787ec0a5 Merge 'Do not keep excessive info on sstables::entry_descriptor' from Pavel Emelyanov
The descriptor in question is used to parse sstable's file path and return back the result. Parser, among "relevant" info, also parses sstable directory and keyspace+table names. However, there are no code (almost) that needs those strings. And the need to construct descriptor with those makes some places obscurely use empty strings.

The PR removes sstable's directory, keyspace and table names from descriptor and, while at it, relaxes the sstable directory code that makes descriptor out of a real sstable object by (!) parsing its Data file path back.

Closes scylladb/scylladb#15617

* github.com:scylladb/scylladb:
  sstables: Make descriptor from sstable without parsing
  sstables: Do not keep directory, keyspace and table names on descriptor
  sstables: Make tuple inside helper parser method
  sstables: Do not use ks.cf pair from descriptor
  sstables: Return tuple from parse_path() without ks.cf hints
  sstables: Rename make_descriptor() to parse_path()
2023-10-05 15:15:23 +03:00
Petr Gusev
a6087a10bd system_keyspace: drop truncation_record
This is a refactoring commit without observable
changes in behaviour.

The only usage was in get_truncation_records
method which can be inlined.
2023-10-05 15:19:59 +04:00
Petr Gusev
9d350e7532 system_keyspace: remove get_truncated_at method
The only usage is in batchlog_manager, and it
can be replaced with cf.get_truncation_time().

std::optional<std::reference_wrapper<canonical_mutation>>
is replaced with canonical_mutation* since it is
semantically the same but with less type boilerplate.
2023-10-05 15:19:59 +04:00
Petr Gusev
80fa5810a7 table: get_truncation_time: check _truncated_at is initialized 2023-10-05 15:19:59 +04:00
Petr Gusev
59db2703cd database: add_column_family: initialize truncation_time for new tables 2023-10-05 15:19:59 +04:00
Petr Gusev
32a19fd61b database: add_column_family: rename readonly parameter to is_new
We want to make table::_truncated_at optional, so that in
get_truncation_time we can assert that it is initialized.
For existing tables this initialisation will happen in
load_truncation_times function, and for new tables we
want to initialize it in add_column_family like we do
with mark_ready_for_writes.

Now add_column_family function has parameter 'readonly', which is
set by the callers to false if we are creating a fresh new table
and not loading it from sstables. In this commit we rename this
parameter to is_new and invert the passed values.
This will allow us in the next commit to initialize _truncated_at field
for new tables.
2023-10-05 15:19:59 +04:00
Petr Gusev
b70bca71bc system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
load_truncation_times() now works only for
schema tables since the rest is not loaded
until distributed_loader::init_non_system_keyspaces.
An attempt to call cf.set_truncation_time
for non-system table just throws an exception,
which is caught and logged with debug level.
This means that the call cf.get_truncation_time in
paxos_state.cc has never worked as expected.

To fix that we move load_truncation_times()
closer to the point where the tables are loaded.
The function distributed_loader::populate_keyspace is
called for both system and non-system tables. Once
the tables are loaded, we use the 'truncated' table
to initialize _truncated_at field for them.

The truncation_time check for schema tables is also moved
into populate_keyspace since is seems like a more natural
place for it.
2023-10-05 15:19:52 +04:00
Pavel Emelyanov
d112098c08 sstables: Make descriptor from sstable without parsing
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:01 +03:00
Pavel Emelyanov
96651e0ddb sstables: Do not keep directory, keyspace and table names on descriptor
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:01 +03:00
Pavel Emelyanov
6a601be1f3 sstables: Make tuple inside helper parser method
This just moves the std::make_tuple() call into internal static path
parsing helper to make the next patch smaller and nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:01 +03:00
Pavel Emelyanov
14ee59fb04 sstables: Do not use ks.cf pair from descriptor
There's only one place that needs ks.cf pair from the parsed desctipror
-- sstables loader from tools/. This code already has ks.cf from the
tuple returned after parsing and can use them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:01 +03:00
Pavel Emelyanov
62d71d398f sstables: Return tuple from parse_path() without ks.cf hints
There are two path parsers. One of them accepts keyspace and table names
and the other one doesn't. The latter is then supposed to parse the
ks.cf pair from path and put it on the descriptor. This patch makes this
method return ks.cf so that later it will be possible to remove these
strings from the desctiptor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 12:21:00 +03:00
Botond Dénes
498e3ec435 Merge 'Remove _schema field from sstable_set' from Piotr Jastrzębski
All `sstable_set_impl` subclasses/implementations already keep a `schema_ptr` so we can make `sstable_set_impl::make_incremental_selector` function return both the selector and the schema that's being used by it.

That way, we can use the returned schema in `sstable_set::make_incremental_selector` function instead of `sstable_set::_schema` field which makes the field unused and allows us to remove it alltogether and reduce the memory footprint of `sstable_set` objects.

Closes scylladb/scylladb#15570

* github.com:scylladb/scylladb:
  sstable_set: Remove unused _schema field
  sstable_set_impl: Return also schema from make_incremental_selector
2023-10-05 11:46:08 +03:00
Piotr Dulikowski
4340e46c66 storage_service: increase timeout during join procedure to 3 minutes
When joining the cluster in raft topology mode, the new node asks some
existing node in the cluster to put its information to the
`system.topology` table. Later, the topology coordinator is supposed to
contact the joining node back, telling it that it was added to group 0
and accepted, or rejected. Due to the fact that the topology coordinator
might not manage to successfully contact the joining node, in order not
to get stuck it might decide to give up and move the node to left state
and forget about it (this not always happens as of now, but will in the
future). Because of that, the joining node must use a timeout when
waiting for a response because it's not guaranteed that it will ever
receive it.

There is an additional complication: the topology coordinator might be
busy and not notice the request to join for a long time. For example, it
might be migrating tablets or joining other nodes which are in the queue
before it. Therefore, it's difficult to choose a timeout which is long
enough for every case and still not too long.

Such a failure was observed to happen in ARM tests in debug mode. In
order to unblock the CI the timeout is increased from 30 seconds to 3
minutes. As a proper solution, the procedure will most likely have to be
adjusted in a more significant way.

Fixes: #15600

Closes scylladb/scylladb#15618
2023-10-05 10:29:03 +02:00
Pavel Emelyanov
d56f9db121 sstables: Rename make_descriptor() to parse_path()
The method really parses provided path, so the existing name is pretty
confusing. It's extra confusing in the table::get_snapshot_details()
where it's just called and the return value is simply ignored.

Named "parse_..." makes it clear what the method is for.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 11:04:07 +03:00
Botond Dénes
36b00710c1 querier: add more information about the read on semaphore mismatch
Also rephase the messages a bit so they are more uniform.
The goal of this change is to make semaphore mismatches easier to
diagnose, by including the table name and the permit name in the
printout.
While at it, add a test for semaphore mismatch, it didn't have one.

Refs: #15485

Closes scylladb/scylladb#15508
2023-10-05 10:27:53 +03:00
Botond Dénes
19ed3393b3 Merge 'Sanitize tracing start-stop calls' from Pavel Emelyanov
Tracing is one of two global service left out there with its starting and stopping being pretty hairy. In order to de-globalize it and keep its start-stop under control the existing start-stop sequence is worth cleaning. This PR

 * removes create_ , start_ and stop_ wrappers to un-hide the global tracing_instance thing
 * renames tracing::stop() to shutdown() as it's in fact shutdown
 * coroutinizes start/shutdown/stop while at it

Squeezed parts from #14156 that don't reorder start-stop calls

Closes scylladb/scylladb#15611

* github.com:scylladb/scylladb:
  main: Capture local tracing reference to stop tracing
  tracing: Pack testing code
  tracing: Remove stop_tracing() wrapper
  tracing: Remove start_tracing() wrapper
  tracing: Remove create_tracing() wrapper
  tracing: Make shutdown() re-entrable
  tracing: Coroutinize start/shutdown/stop
  tracing: Rename helper's stop() to shutdown()
2023-10-05 10:27:19 +03:00
Avi Kivity
6d5823e8f5 Regenerate frozen toolchain for new Python driver
Update to scylla-driver 3.26.3.

Closes scylladb/scylladb#15629
2023-10-05 10:09:53 +03:00
Michał Chojnowski
330d221deb row_cache: when the constructor fails, clear _partitions in the right allocator
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.

Fix that.

Fixes #15632

Closes scylladb/scylladb#15633
2023-10-04 23:44:45 +02:00
Michał Chojnowski
83b71ed6b2 row_cache_test: fix test_exception_safety_of_update_from_memtable
The test does (among other things) the following:

1. Create a cache reader with buffer of size 1 and fill the buffer.
2. Update the cache.
3. Check that the reader produces the first mutation as seen before
the update (because the buffer fill should have snapshotted the first
mutation), and produces other mutation as seen after the update.

However, the test is not guaranteed to stop after the update succeeds.
Even during a successful update, an allocation might have failed
(and been retried by an allocation_section), which will cause the
body of with_allocation_failures to run again. On subsequent runs
the last check (the "3." above) fails, because the first mutation
is snapshotted already with the new version.

Fix that.

Closes scylladb/scylladb#15634
2023-10-04 23:42:03 +02:00
Tomasz Grabiec
1252d5bd7d Merge 'replica: Clean up storage of tablet on migration' from Raphael "Raph" Carvalho
When a tablet is migrated into a new home, we need to clean its storage (i.e. the compaction group) in the old home. This includes its presence in row cache, which can be shared by multiple tablets living in the same shard.

For exception safety, the following is done first in a "prepare phase" during cache invalidation.

1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the storage of the tablet.

Then once cache is invalidated successfully, we then clear the sstable sets of the the group in the "execution phase", to prevent any background op from incorrectly picking them and also to allow for their deletion.

All the sstables of a tablet are deleted atomically, in order to guarantee that a failure midway won't cause data resurrection if it happens tablet is migrated back into the old home.

Closes scylladb/scylladb#15524

* github.com:scylladb/scylladb:
  replica: Clean up storage of tablet on migration
  replica: Add async gate to compaction_group
  replica: Coroutinize compaction_group::stop()
  replica: Make compaction group flush noexcept
2023-10-04 23:41:32 +02:00
Piotr Jastrzebski
9edf6e4653 sstable_set: Remove unused _schema field
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
2023-10-04 18:50:23 +02:00
Piotr Jastrzebski
ce2be977a6 sstable_set_impl: Return also schema from make_incremental_selector
Define sstable_set_impl::selector_and_schema_t type as a tuple that
contains both a newly created selector and a schema that the selector
is using.

This will allow removal of _schema field from sstable_set class as
the only place it was used was make_incremental_selector.

Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
2023-10-04 18:40:05 +02:00
Raphael S. Carvalho
893ee68251 replica: Clean up storage of tablet on migration
When a tablet is migrated into a new home, we need to clean its
storage (i.e. the compaction group) in the old home.
This includes its presence in row cache, which can be shared by
multiple tablets living in the same shard.

For exception safety, the following is done first in a "prepare
phase" during cache invalidation.

1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the
storage of the tablet.

Then once cache is invalidated successfully, we then clear
the sstable sets of the the group in the "execution phase",
to prevent any background op from incorrectly picking them
and also to allow for their deletion.

All the sstables of a tablet are deleted atomically, in order
to guarantee that a failure midway won't cause data resurrection
if it happens tablet is migrated back into the old home.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-04 12:16:19 -03:00
Raphael S. Carvalho
e23f4cf8c9 main: delete dead initialization code for compaction
this is redundant code that should have be gone a long time ago.

the snippet (which lies above the code being deleted):
            db.invoke_on_all([] (replica::database& db) {
                db.get_tables_metadata().for_each_table([] (table_id, lw_shared_ptr<replica::table> table) {
                    replica::table& t = *table;
                    t.enable_auto_compaction();
                });
            }).get();

provides the same thing as this code being deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15597
2023-10-04 15:57:24 +03:00
Avi Kivity
d217c6f7c1 Merge 'tools/nodetool: implement additional commands, part 1/N' from Botond Dénes
The following new commands are implemented:
* disablebackup
* disablebinary
* disablegossip
* enablebackup
* enablebinary
* enablegossip
* gettraceprobability
* help
* settraceprobability
* statusbackup
* statusbinary
* statusgossip
* version

All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.

Refs: https://github.com/scylladb/scylladb/issues/15588

Closes scylladb/scylladb#15593

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement help operation
  tools/scylla-nodetool: implement the traceprobability commands
  tools/scylla-nodetool: implement the gossip commands
  tools/scylla-nodetool: implement the binary commands
  tools/scylla-nodetool: implement backup related commands
  tools/scylla-nodetool: implement version command
  test/nodetool: introduce utils.check_nodetool_fails_with()
  test/nodetool: return stdout of nodetool invokation
  test/nodetool/rest_api_mock.py: fix request param matching
  tools/scylla-nodetool: compact: remove --partition argument
  tools/scylla-nodetool: scylla_rest_client: add support delete method
  tools/scylla-nodetool: get rid of check_json_type()
  tools/scylla-nodetool: log more details for failed requests
  tools/scylla-*: use operation_option for positional options
  tools/utils: add support for operation aliases
2023-10-04 14:33:16 +03:00
Anna Stuchlik
eb5a9c535a doc: add the quorum requirement to procedures
This commit adds a note to the docs for
cluster management that a quorum is
required to add, remove, or replace a node,
and update the schema.
2023-10-04 13:16:21 +02:00
Anna Stuchlik
bf25b5fe76 doc: add more failure info to Troubleshooting
This commit adds new pages with reference to
Handling Node Failures to Troubleshooting.
The pages are:
- Failure to Add, Remove, or Replace a Node
  (in the Cluster section)
- Failure to Update the Schema
  (in the Data Modeling section)
2023-10-04 12:44:26 +02:00
Anna Stuchlik
8c4f9379d5 doc: move Handling Failures to Troubleshooting
This commit moves the content of the Handling
Failures section on the Raft page to the new
Handling Node Failures page in the Troubleshooting
section.

Background:
When Raft was experimental, the Handling Failures
section was only applicable to clusters
where Raft was explicitly enabled.
Now that Raft is the default, the information
about handling failures is relevant to
all users.
2023-10-04 12:23:33 +02:00
Botond Dénes
62cdc36a74 tools/scylla-nodetool: implement help operation
Nodetool considers "help" to be just another operation. So implement it
as such. The usual --help and --help <command> is also supported.
2023-10-04 05:27:09 -04:00
Botond Dénes
1efabca515 tools/scylla-nodetool: implement the traceprobability commands
gettraceprobability and settraceprobability
2023-10-04 05:27:09 -04:00
Botond Dénes
25d41f72c4 tools/scylla-nodetool: implement the gossip commands
disablegossip, enablegossip and statusgossip
2023-10-04 05:27:09 -04:00
Botond Dénes
5bc25dbebe tools/scylla-nodetool: implement the binary commands
disablebinary, enablebinary and statusbinary
2023-10-04 05:27:09 -04:00
Botond Dénes
2ac1705c90 tools/scylla-nodetool: implement backup related commands
disablebackup, enablebackup and statusbackup
2023-10-04 05:27:09 -04:00
Botond Dénes
91e62413c8 tools/scylla-nodetool: implement version command 2023-10-04 05:27:09 -04:00
Botond Dénes
5ad9b1424c test/nodetool: introduce utils.check_nodetool_fails_with()
Checking that nodetool fails with a given message turned out to be a
common pattern, so extract the logic for checking this into a method of
its own. Refactor the existing tests to use it, instead of the
hand-coded equivalent.
2023-10-04 05:27:09 -04:00
Botond Dénes
644d91fe95 test/nodetool: return stdout of nodetool invokation
So the test can inspect it.
2023-10-04 05:09:49 -04:00
Botond Dénes
dd62299355 test/nodetool/rest_api_mock.py: fix request param matching
Turns out expected request params were dropped on the floor, so any
expected param matched any actual params.
2023-10-04 05:09:41 -04:00
Botond Dénes
4f66e0208b tools/scylla-nodetool: compact: remove --partition argument
This argument is not recognized by the current nodetool either. It is
mentioned only in our documentation, but it should be removed from there
too.
2023-10-04 05:08:33 -04:00
Botond Dénes
2ddf28b8e5 tools/scylla-nodetool: scylla_rest_client: add support delete method 2023-10-04 05:07:03 -04:00
David Garcia
1121a4df04 docs: add groups to reference docs
fix: comment

Closes scylladb/scylladb#15592
2023-10-04 11:42:36 +03:00
Petr Gusev
9711bfde11 commitlog_replayer: refactor commitlog_replayer::impl::init
We don't need map_reduce here since get_truncated_positions returns
the same result on all shards.

We remove 'finally' semantics in this commit since it doesn't seem we
really need it. There is no code that relies on the state of this
data structure in case of exception. An exception will propagate
to scylla_main() and the program will just exit.
2023-10-03 17:11:40 +04:00
Petr Gusev
c94946d566 system_keyspace: drop redundant typedef 2023-10-03 17:11:40 +04:00
Petr Gusev
f7d2300cf9 system_keyspace: drop redundant save_truncation_record overload 2023-10-03 17:11:40 +04:00
Petr Gusev
da1e6751e9 table: rename cache_truncation_record -> set_truncation_time
This is a refactoring commit without observable
changes in behaviour.

There is a truncation_record struct, but in this method we
only care about time, so rename it (and other related methods)
appropriately to avoid confusion.
2023-10-03 17:11:35 +04:00
Botond Dénes
e0c8fee7db Merge 'doc: update the Cassandra compatibility information' from Anna Stuchlik
This PR updates the information on the ScyllaDB vs. Cassandra compatibility. It covers the information from https://github.com/scylladb/scylladb/issues/15563, but there could more more to fix.

@tzach @scylladb/scylla-maint Please review this PR and the page covering our compatibility with Cassandra and let me know if you see anything else that needs to be fixed.

I've added the updates with separate commits in case you want to backport some info (e.g. about AzureSnitch).

Fixes https://github.com/scylladb/scylladb/issues/15563

Closes scylladb/scylladb#15582

* github.com:scylladb/scylladb:
  doc: deprecate Thrift in Cassandra compatibility
  doc: remove row/key cache from Cassandra compatibility
  doc: add AzureSnitch to Cassandra compatibility
2023-10-03 13:31:27 +03:00
Botond Dénes
926da9eeb2 docs: nodetool compact: correct phrase about table arguments
The sentence says that if table args are provided compaction will run on
all tables. This is ambigous, so the sentence is rephrased to specify
that compaction will only run on the provided tables.

Closes scylladb/scylladb#15394
2023-10-03 10:31:03 +02:00
Pavel Emelyanov
c07905b074 main: Capture local tracing reference to stop tracing
Now it's using global reference, but has the local one since recently

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
65b7aa3387 tracing: Pack testing code
There's a finally-chain stopping tracing out there, now it can just use
the deferred stop call and that's it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
4c74425780 tracing: Remove stop_tracing() wrapper
Now it's confusing, as it doesn't stop tracing, but rather shuts it down
on all shards. The only caller of it can be more descriptive without the
wrapper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
61381feaad tracing: Remove start_tracing() wrapper
Callers can make one-like stop with the help of invoke_on_all() overload
that wraps std::invoke

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
89c43f6677 tracing: Remove create_tracing() wrapper
It doesn't make callers' life easier, but hides global tracing instance

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
ce5062eb13 tracing: Make shutdown() re-entrable
Today's shutdown() and its stop() peer are very restrictive in a way
callers should use them. There's no much point in it, making shutdown()
re-entrable, as for other services, will let relaxing callers code here
and in next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:47 +03:00
Pavel Emelyanov
232de8b180 tracing: Coroutinize start/shutdown/stop
They are all simple enough to be in one patch.
Further patching is simpler in coroutinized form.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:46:46 +03:00
Pavel Emelyanov
8234235b94 tracing: Rename helper's stop() to shutdown()
Because it's called on shutdown, not on stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-03 10:02:12 +03:00
Pavel Emelyanov
c4f1929eea s3: Abort multipart upload if finalize request fails
It may happen that wrapping up multipart upload fails too. However,
before sending the request the driver clears the _upload_id field thus
marking the whole process as "all is OK". So in case the finalization
method fails and thrown, the upload context remains on the server side
forever.

Fix this by keeping the _upload_id set, so even if finalization throws,
closing the uploader notices this and calls abort.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15521
2023-10-03 09:47:33 +03:00
Kamil Braun
b68d6ad5e9 api: storage_service: unset reload_raft_topology_state
Every endpoint needs to be unset. Oversight in
992f1327d3.

Closes scylladb/scylladb#15591
2023-10-03 09:12:12 +03:00
Botond Dénes
7dc77d03af tools/scylla-nodetool: get rid of check_json_type()
This check is redundant. Originally it was intended to work around by
rapidjson using an assert by default to check that the fields have the
expected type. But turns out we already configure rapidjson to use a
plain exception in utils/rjson.hh, so check_json_type() is not needed
for graceful error handling.
2023-10-03 02:05:30 -04:00
Botond Dénes
fdecea5480 tools/scylla-nodetool: log more details for failed requests
Instead of the unhelpful "Unexpected reply status", log what the request
was and what is the response status code.
2023-10-03 02:05:30 -04:00
Botond Dénes
adb65e18a1 tools/scylla-*: use operation_option for positional options
Use operation_option to describe positional options. The structure used
before -- app_template::positional_option -- was not a good fit for
this, as it was designed to store a description that is immediately
passed to the boost::program_options subsystem and then discarded.
As such, it had a raw pointer member, which was expected to be
immediately wrapped by boost::shared_ptr<> by boost::program_options.
This produced memory leaks for tools, for options that ended up not
being used. To avoid this altogether, use operation_option, converting
to the app_template::positional_option at the last moment.
2023-10-03 02:05:30 -04:00
Botond Dénes
c252ff4f03 tools/utils: add support for operation aliases
Some operations may have additional names, beyond their "main". Add
support for this.
2023-10-03 02:05:30 -04:00
Botond Dénes
471e125592 Merge 'Use REST API client in object_store test' from Pavel Emelyanov
The test needs to call flush-keyspace API endpoint and currently it does it by hand. Not very convenient.
Also in the future there will be the need for _background_ API kicking, the currently used requests package cannot do it, while pylib REST API can

Closes scylladb/scylladb#15565

* github.com:scylladb/scylladb:
  test/object_store: Use REST client from pylib
  test/pylib: Add flush_keyspace() method to rest client
  test/object_store: Wrap yielded managed cluster
2023-10-03 08:50:55 +03:00
David Garcia
d543b96d18 docs: download iam csv files
docs: automate generation

docs: rm _data dir

fix: windows build

Closes scylladb/scylladb#15276
2023-10-02 12:28:56 +03:00
Botond Dénes
3e74432dbf Merge 'Sanitize storage_proxy API handlers' from Pavel Emelyanov
Registering API handlers for services need to
- happen next to the corresponding service's start
- use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job)
- get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything)

The storage proxy handlers don't follow any of that rules, this PR fixes them

Closes scylladb/scylladb#15584

* github.com:scylladb/scylladb:
  api: Make storage_proxy handlers use proxy argument
  api: Change some static helpers to use proxy instead of ctx
  api: Pass sharded<storage_proxy> reference to storage_proxy handlers
  api: Start (and stop) storage_proxy API earlier
  api: Remove storage_service argument from storage_proxy setup
  api: Move storage_proxy/ endpoint using storage_service
  api: Remove storage_proxy.hh from storage_service.cc
  main: Initialize API server early
2023-10-02 12:28:56 +03:00
Benny Halevy
6dc1ac768d cql-pytest/test_select_from_mutation_fragments: disable compaction on test_table
Use NullCompactionStrategy for the test_table fixture
rather than using the `no_autocompaction_context`.

Besides being simpler, as regular compaction just comes in
the way for all tests that use `SELECT MUTATION_FRAGMENTS`

The latter would be problematic when we start run cql-pytest
test cases in parallel rather than in serial since it
will inadvertantly affect other test cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#15574
2023-10-02 10:28:59 +03:00
Michael Huang
1640f83fdc raft: Store snapshot update and truncate log atomically
In case the snapshot update fails, we don't truncate commit log.

Fixes scylladb/scylladb#9603

Closes scylladb/scylladb#15540
2023-09-29 17:57:49 +02:00
Pavel Emelyanov
2603605cd5 api: Make storage_proxy handlers use proxy argument
And stop using proxy reference from http context. After a while the
proxy dependency will be removed from http context

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
fc4335387a api: Change some static helpers to use proxy instead of ctx
There are some helpers in storage_proxy.cc that get proxy reference from
passed http context argument. Next patch will stop using ctx for that
purpose, so prepare in advance by making the helpers use proxy reference
argument directly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
4910b4d5b7 api: Pass sharded<storage_proxy> reference to storage_proxy handlers
The goals is to make handlers use proxy argument instead of keeping
proxt as dependency on http context (other handlers are mostly such
already)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
bbba691931 api: Start (and stop) storage_proxy API earlier
Handlers can register as early as the used service is started

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
7ef7b05397 api: Remove storage_service argument from storage_proxy setup
The code setting up storage_proxy/ endpoints no longer needs
storage_service and related decoration

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
0eea513663 api: Move storage_proxy/ endpoint using storage_service
The storage_proxy/get_schema_version is served by storage_service, so it
should be in storage_service.cc instead

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:08 +03:00
Pavel Emelyanov
b5eb474d95 api: Remove storage_proxy.hh from storage_service.cc
Proxy is not used in storage service handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:09:01 +03:00
Pavel Emelyanov
abf541cf29 main: Initialize API server early
Surprisingly, but the dependency-less API server context is initialized
somewhere in the middle of main. By that time some "real" services had
already started and should have the ability to register their endpoints,
so API context should be initialized way ahead. This patch places its
initialization next to prometheus init.

One thing that's not nice here is that API port listening remains where
it was before the patch, so for the external ... observer API
initialization doesn't change. Likely API should start listening for
connection early as well, but that's left for future patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:09:01 +03:00
Botond Dénes
ecceb554c3 Merge 'db/hints: Clean up hint_storage.cc' from Dawid Mędrek
This PR is the second step in refactoring the Hinted Handoff module. It cleans up the contents of the file `hint_storage.cc`. The biggest change is the transition from continuations to coroutines.

Refs #15358

Closes scylladb/scylladb#15496

* github.com:scylladb/scylladb:
  db/hints: Alias segment list in hint_storage.cc
  db/hints: Rename rebalance to rebalance_hints
  db/hints: Clean up rebalance() in hint_storage.cc
  db/hints: Coroutinize hint_storage.cc
  db/hints: Clean up remove_irrelevant_shards_directories() in hint_storage.cc
  db/hints: Clean up rebalance_segments() in hint_storage.cc
  db/hints: Clean up rebalance_segments_for() in hint_storage.cc
  db/hints: Clean up get_current_hints_segments() in hint_storage.cc
  db/hints: Rename scan_for_hints_dirs to scan_shard_hint_directories
  db/hints: Clean up scan_for_hints_dirs() in hint_storage.cc
  db/hints: Wrap hint_storage.cc in an anonymous namespace
2023-09-29 08:55:38 +03:00
Botond Dénes
5d8384eff0 Merge 'Fix test_fencing.py::test_fence_hints flakiness' from Kamil Braun
Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages.

Refactor the test a bit while at it.

Fixes: #15285

Closes scylladb/scylladb#15523

* github.com:scylladb/scylladb:
  test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints`
  test: test_fencing.py: reload topology through REST API in `test_fence_hints`
  test: refactor test_fencing.py
  api: storage_service: add REST API to reload topology state
2023-09-28 16:30:23 +03:00
Benny Halevy
3709a43ccc cql-pytest.nodetool: no_autocompaction_context: support ks.tbl syntax
Allow disabling auto-compaction for given table(s)
using either the ks.table syntax or ks:table (as the
api suggests).

The first syntax would likely be more common since
the test tables we automatically create are named
as test_keyspace.test_table so we can pass that name
to `no_autocompaction_context` as is.

test_tools.system_scylla_local_sstable_prepared was
modified to disable auto-compaction only only
the `system.scylla_local` table rather than
the whole `system` keyspace, since it only relies
on this table. Plus, it helps test this change :)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#15575
2023-09-28 13:59:48 +03:00
Kamil Braun
53c01b121a test: test_fencing.py: enable hints_manager=trace logs in test_fence_hints
Enable TRACE level logging on the server that's supposed to send the
hints. Should make it easier to debug failures in the future, if any
happen again.
2023-09-28 11:59:17 +02:00
Kamil Braun
706734f76c test: test_fencing.py: reload topology through REST API in test_fence_hints
Restarting a node in order to reload topology may have side effects that
lead to test flakiness. While the node is shutting down, it gives up
leadership. Before it finishes shutting down, another node may become
Raft group 0 leader, then topology coordinator, then send a topology
command, triggering topology state reload on the shutting down node,
causing its topology version to get updated, allowing it to send a
successful hint before it shuts down and restarts. After it restarts, no
more hints will be sent, so the metrics condition we're waiting for (for
a hint to be sent) will never become true (metrics are not persisted
between restarts).

Instead of restarting, reload topology state through the new REST API.
This also makes the test a bit faster.

Fixes #15285
2023-09-28 11:59:17 +02:00
Kamil Braun
02dd297ba1 test: refactor test_fencing.py
- use `manager.get_cql()` to silence mypy (`manager.cql` is `Optional`)
- extract `metrics.lines_by_prefix('scylla_hints_manager_')` to a helper
  function
- when waiting for conditions on metrics, split the condition into
  safety and liveness part, and fail early if the safety part does not
  hold
- in `exactly_one_hint` send, don't check that `send_errors_metric` is
  `0` (it won't be after the next commit)
2023-09-28 11:59:17 +02:00
Kamil Braun
992f1327d3 api: storage_service: add REST API to reload topology state
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.

The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
2023-09-28 11:59:16 +02:00
Kamil Braun
060f2de14e Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski
This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows:

- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then:
	- If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted.
	- Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down.

The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).

More details about the procedure and the RPC are described in `topology-over-raft.md`.

Fixes: #15152

Closes scylladb/scylladb#15196

* github.com:scylladb/scylladb:
  tests: mark test_blocked_bootstrap as skipped
  storage_service: do not check features in shadow round
  storage_service: remove raft_{boostrap,replace}
  topology_coordinator: relax the check in enable_features
  raft_group0: insert replaced node info before server setup
  storage_service: use join node rpc to join the cluster
  topology_coordinator: handle joining nodes
  topology_state_machine: add join_group0 state
  storage_service: add join node RPC handlers
  raft: expose current_leader in raft::server
  storage_service: extract wait_for_live_nodes_timeout constant
  raft_group0: abstract out node joining handshake
  storage_service: pass raft_topology_change_enabled on rpc init
  rpc: add new join handshake verbs
  docs: document the new join procedure
  topology_state_machine: add supported_features to replica_state
  storage_service: check destination host ID in raft verbs
  group_state_machine: take reference to raft address map
  raft_group0: expose joined_group0
2023-09-28 11:45:09 +02:00
Anna Stuchlik
f4d53978da doc: deprecate Thrift in Cassandra compatibility
This commit adds the information that Thrift is
deprecated (both in ScyllaDB and Cassandra) to
the Cassandra compatibility page.

Refs: https://github.com/scylladb/scylladb/issues/3811
2023-09-28 10:53:59 +02:00
Anna Stuchlik
d1f6832909 doc: remove row/key cache from Cassandra compatibility
This commit removes the misleading "row/key cache"
row from the Indexing and Caching table on
the Cassandra compatibility page.
2023-09-28 10:42:28 +02:00
Anna Stuchlik
9d4ad355c5 doc: add AzureSnitch to Cassandra compatibility
This commit adds AzureSnitch (together with a link
to the AzureSnitch description) to the Cassandra
compatibility page.
In addition, the Sniches table is fixed.
2023-09-28 10:37:14 +02:00
Pavel Emelyanov
0eb8d1b438 test/object_store: Use REST client from pylib
Test cases kick scylla to force keyspaces flush (to have the objects on
object store) by hand. Equip the wrapped cluster object with the REST
API class instance for convenience

The assertion for 200 return status code is dropped, REST client does it
behind the scenes

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-28 11:33:00 +03:00
Petr Gusev
1b2e0d0cc9 system_keyspace: get_truncated_position -> get_truncated_positions
This method can return many replay_positions, so
the plural form is more appropriate.
2023-09-28 12:25:40 +04:00
Pavel Emelyanov
4fdf12b1c7 test/pylib: Add flush_keyspace() method to rest client
Which does POST /storage_service/keyspace_flush/{ks}

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-28 11:19:04 +03:00
Pavel Emelyanov
9ce99a01d5 test/object_store: Wrap yielded managed cluster
Test cases use temporary cluster object which is, in fact, cql cluster.
In the future there will be the need to perform more actions on it
rather than just querying it with cql client, so wrap the cluster with
an extendable object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-28 11:19:03 +03:00
Botond Dénes
08c0456b88 test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill
Check that the cache reader reacts correctly to semaphore's OOM kill
attempt, letting the read to fail, instead of going berserk, trying to
reserve more-and-more memory, until the reserve cannot be satisfied.
2023-09-28 04:12:52 -04:00
Raphael S. Carvalho
707ade21f8 replica: Add async gate to compaction_group
replica::table has the same gate for gating async operations, and
even synchronize stop of table with in-flight writes that will
apply into memory.

compaction group gains the same gate, which will be used when
operations are confined to a single group. table's gate is kept
for table wide operations like query, truncate, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-27 19:13:14 -03:00
Raphael S. Carvalho
57a0b46aa4 replica: Coroutinize compaction_group::stop()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-27 17:36:12 -03:00
Raphael S. Carvalho
de4db3ac19 replica: Make compaction group flush noexcept
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-27 17:36:12 -03:00
Dawid Medrek
a870eeb2ab db/hints: Alias segment list in hint_storage.cc
Naming the type should improve readability.
2023-09-27 18:49:08 +02:00
Dawid Medrek
aba85c9c98 db/hints: Rename rebalance to rebalance_hints
The new name conveys the idea clearly.
2023-09-27 18:49:08 +02:00
Dawid Medrek
64f4b825d3 db/hints: Clean up rebalance() in hint_storage.cc
This commit fixes indentation and formatting after
recent changes in the file.
2023-09-27 18:49:04 +02:00
Dawid Medrek
b662756256 db/hints: Coroutinize hint_storage.cc 2023-09-27 18:47:38 +02:00
Dawid Medrek
17e763a83a db/hints: Clean up remove_irrelevant_shards_directories() in hint_storage.cc
This commit makes the function abide by the limit of 120 characters
per line and stops unnecessarily calling c_str() on seastar::sstring.
2023-09-27 18:45:01 +02:00
Dawid Medrek
73d02cfcef db/hints: Clean up rebalance_segments() in hint_storage.cc
This commit makes the function less compact and turns overly
long lines into shorter ones to improve the readability of
the code.
2023-09-27 18:45:01 +02:00
Dawid Medrek
479f4d1ad3 db/hints: Clean up rebalance_segments_for() in hint_storage.cc
This commit makes the function less compact and abides by the limit
of 120 characters per line; that makes the code more readable.
We start using fmt::to_string instead of seastar::format("{:d"})
to convert strings to integers -- the new way is the preferred one.
The changes also name variables in a more descriptive way.
2023-09-27 18:45:01 +02:00
Dawid Medrek
a1df8dbf1c db/hints: Clean up get_current_hints_segments() in hint_storage.cc
This commit makes the function less compact and abides by the limit
of 120 characters per line. That makes the code more readable.
It also doesn't unnecessarily call c_str() on seastar::sstring.
2023-09-27 18:45:01 +02:00
Dawid Medrek
1fccd34dba db/hints: Rename scan_for_hints_dirs to scan_shard_hint_directories
The new name better conveys which directories the function should scan.
2023-09-27 18:45:01 +02:00
Dawid Medrek
8e94074b85 db/hints: Clean up scan_for_hints_dirs() in hint_storage.cc
There is no need to call c_str() on the name of the directory entry.
In fact, the used overload std::stoi() takes an std::string as its
argument. Providing seastar::sstring instead of const char* is more
efficient because we can allocate just the right amount of memory
and std::memcpy it, i.e. call std::string(const char*, std::size_t).
Using the overload std::string(const char*) would need to first
traverse the string to find the null byte.

This is a small change, all the more because paths don't tend to
be long, but it's some gain nonetheless.

The commit also inserts a few empty lines to make the code less
compact and improve readability as a result.
2023-09-27 18:45:01 +02:00
Dawid Medrek
7c68882578 db/hints: Wrap hint_storage.cc in an anonymous namespace
An anonymous namespace is a safer mechanism than the static
keyword. When adding a new piece of code, it's easy to
forget about adding the static. In that case, that code
might undergo external linkage. However, when code is put
in an anonymous namespace (when it should not), the linker
will immediately detect it (in most cases), and
the programmer will be able to spot and fix their mistake
right away.
2023-09-27 18:41:41 +02:00
Botond Dénes
c0da6bcfb8 utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled()
Said method catches bad-allocs and retries the passed-in function after
raising the reserves. This does nothing to help the function succeed if
the bad alloc was throw from the semaphore, because the kill limit was
reached. In this case the read should be left to fail and terminate.
Now that the semaphore is throwing utils::memory_limit_reached in this
case, we can distinguish this case and just re-throw the exception.
2023-09-27 10:28:00 -04:00
Botond Dénes
6829eaad39 reader_concurrency_semaphore: use utils::memory_limit_reached exception
When the kill limit is triggered.
2023-09-27 10:27:32 -04:00
Botond Dénes
721ffa319d utils: add memory_limit_reached exception
A distinct exception derived from std::bad_alloc, used in cases when
memory didn't really run out, but the process or task reached the memory
limit alloted for it. Using a distinct type for this case allows for LSA
to correctly react to this case.
2023-09-27 10:26:41 -04:00
Piotr Dulikowski
2c17f81f44 tests: mark test_blocked_bootstrap as skipped
With the new procedure to join nodes, testing the scenario in
`test_blocked_bootstrap` becomes very tricky. To recap, the test does
the following:

- Starts a 3-node cluster,
- Shuts down node 1,
- Tries to replace node 1 with node 4, but an error injection is
  triggered which causes node 4 to fail after it joins group 0. Note
  that pre-JOIN_NODE handshake, this would only result in node 4 being
  added to group 0 config, but no modification to the group 0 state
  itself is being done - the joining node is supposed to write a request
  to join.
- Tries to replace node 1 again with node 5, which should succeed.

The bug that this regression test was supposed to check for was that
node 5 would try to resolve all IPs of nodes added to group 0 config.
Because node 4 shuts down before advertising itself in gossip, the node
5 would get stuck.

The new procedure to join group 0 complicates the situation because a
request to join is written first to group 0 and only then the topology
coordinator modifies the group 0 config. It is possible to add an error
injection to the topology coordinator code so that it doesn't change the
group 0 state and proceeds with bootstrapping the node, but it will only
get stuck trying to add the node. If node 5 tries to join in the
meantime, the topology coordinator may switch to it and try to bootstrap
it instead, but this is basically a 50% chance because it depends on the
order of node 4 and node 5's host IDs in the topology_state_machine
struct.

It should be possible to fix the test with error recovery, but until
then it is marked as skipped.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
11ab7c3853 storage_service: do not check features in shadow round
The new joining procedure safely checks compatibility of
supported/enabled features, therefore there is no longer any need to do
it in the gossip shadow round.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
bf5059e83c storage_service: remove raft_{boostrap,replace}
The functionality of `raft_bootstrap` and `raft_replace` is handled by
the new handshake, so those functions can be removed.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
9a829ddf97 topology_coordinator: relax the check in enable_features
Currently, `enable_features` requires that there are no topology in
progress and there are no nodes waiting to be joined. Now, after the new
handshake is implemented, we can drop the second condition because nodes
in `none` state are not a part of group 0 yet.

Additionally, the comments inside `enable_features` are clarified so
that they explain why it's safe to only include normal features when
doing the barrier and calculating features to enable.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
3ee3699a9c raft_group0: insert replaced node info before server setup
Currently, information about replaced node is put into the raft address
map after joining group 0 via `join_group0`. However, the new handshake
which happens when joining group 0 needs to read the group 0 state (so
that it can wait until it sees all normal nodes as UP). Loading the
topology state to memory involves resolving IP addresses of the normal
nodes, so the information about replaced node needs to be inserted
before the handshake happens.

This commit moves insertion of the replace node's data before the call
to `join_group0`.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
41a22f6e3b storage_service: use join node rpc to join the cluster
Now, the storage_service uses new RPCs to join the cluster. A new
handshaker is implemented and passed to group0 in order to make it
happen.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
862b6e61a4 topology_coordinator: handle joining nodes
The topology coordinator is updated to perform verification of joining
nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
5ba2bfa015 topology_state_machine: add join_group0 state
Currently, when the topology coordinator notices a request to join or
replace a node, the node is transitioned to an appropriate state and the
topology is moved to commit_new_generation/write_both_read_old, in a
single group 0 operation. In later commits, the topology coordinator
will accept/reject nodes based on the request, so we would like to have
a separate step - topology coordinator accepts, transitions to bootstrap
state, tells the node that it is accepted, and only then continues with
the topology transition.

This commits adds a new `join_group0` transition state that precedes
`commit_cdc_generation`.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
bb40c2a8b8 storage_service: add join node RPC handlers 2023-09-27 15:53:13 +02:00
Botond Dénes
6d34f99202 Merge 'doc: replace the link to Cassandra compatibility information' from Anna Stuchlik
This PR replaces a link to a section of the ScyllaDB website with little information about ScyllaDB vs. Cassandra with a link to
a documentation section where Cassandra compatibility is covered in detail.

In addition, it removes outdated or irrelevant information about versions from the Cassandra compatibility page.
Now that the documentation is versioned, we shouldn't add such information to the content.

Fixes https://github.com/scylladb/scylla-enterprise/issues/3454

Closes scylladb/scylladb#15562

* github.com:scylladb/scylladb:
  doc: remove outdated/irrelevant version info
  doc: replace the link to Cassandra compatibility
2023-09-27 16:43:28 +03:00
Anna Stuchlik
c19959e226 doc: remove outdated/irrelevant version info
This commit removes outdated or irrelevant
information about versions from the Cassandra
compatibility page.
Now that the documentation is versioned, we
shouldn't add such information to the content.
2023-09-27 14:08:07 +02:00
Kefu Chai
56f68bcf1b build: cmake: compare CMAKE_SYSTEM_PROCESSOR using STREQUAL operator
`if (.. EQUAL ..)` is used to compare numbers so if the LHS is not
a number the condition is evaluated as false, this prevents us from
setting the -march when building for aarch64 targets. and because
crc32 implementation in utils/ always use the crypto extension
intrinsics, this also breaks the build like
```
In file included from /home/fedora/scylla/utils/gz/crc_combine.cc:40:
/home/fedora/scylla/utils/clmul.hh:60:12: error: always_inline function 'vmull_p64' requires target feature 'aes', but would be inlined into functi
on 'clmul_u32' that is compiled without support for 'aes'
    return vmull_p64(p1, p2);
           ^
```

so, in this change,

* compare two strings using `STREQUAL`.
* document the reason why we need to set the -march to the
  specfied argument.
  see also http://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#g_t-march-and--mcpu-Feature-Modifiers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15553
2023-09-27 13:58:51 +02:00
Botond Dénes
508d469fef Merge 'build: extract code fragments into functions' from Kefu Chai
this series is one of the steps to remove global statements in `configure.py`.

not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.

Refs #15379

Closes scylladb/scylladb#15552

* github.com:scylladb/scylladb:
  build: pass `args` explicitly
  build: remove `distro_extra_ldflags`
  build: remove `distro_extra_cflags`
  build: remove `distro_extra_cmake_args`
  build: pass variables explicitly
  build: do not mutate args.user_ldflags
  build: do not mutate args.user_ldflags
  build: use os.makedirs(exist_ok=True)
2023-09-27 13:58:51 +02:00
Anna Stuchlik
61d2730e6d doc: fix section headings that appear on page tree
Some "Additional Information" section headings
appear on the page tree in the left sidebar
because of their incorrect underline.

This commit fixes the problem by replacing title
underline with section underline.

Closes scylladb/scylladb#15550
2023-09-27 13:58:51 +02:00
Anna Stuchlik
53d5635dc3 doc: replace the link to Cassandra compatibility
This commit replaces a link to a section of
the ScyllaDB website with little information
about ScyllaDB vs. Cassandra with a link to
a documentation section where Cassandra
compatibility is covered in detail.
2023-09-27 13:52:34 +02:00
Botond Dénes
2cc37eb89b Merge 'Sanitize storage_service API maintenance' from Pavel Emelyanov
Storage service API set/unset has two flaws.

First, unset doesn't happen, so after storage service is stopped its handlers become "local is not initialized"-assertion and use-after-free landmines.

Second, setting of storage service API carry gossiper and system keyspace references, thus duplicating the knowledge about storage service dependencies.

This PR fixes both by adding the storage service API unsetting and by making the handlers use _only_ storage service instance, not any externally provided references.

Closes scylladb/scylladb#15547

* github.com:scylladb/scylladb:
  main, api: Set/Unset storage_service API in proper place
  api/storage_service: Remove gossiper arg from API
  api/storage_service: Remove system keyspace arg from API
  api/storage_service: Get gossiper from storage service
  api/storage_service: Get token_metadata from storage service
2023-09-27 10:00:54 +03:00
Kefu Chai
67d0c596d3 build: pass args explicitly
instead relying on updating `global()` with `args`, pass the
argument explicitly, this helps us understand the data dependency
in this script better.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
66428220d7 build: remove distro_extra_ldflags
this variable is always empty, so drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
a9af6b71e7 build: remove distro_extra_cflags
this variable is always empty, so drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
c18e996d70 build: remove distro_extra_cmake_args
this variable is always empty, so drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
854ae62774 build: pass variables explicitly
instead of using `global()` pass used variables explicitly this
helps us to understand the data dependency in this script better.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
e537962660 build: do not mutate args.user_ldflags
mutating the member variables in `args` after it is returned from
`arg_parser.parse_args()` is but confusing. let's use a new variable
for tracking the updated `user_cflags`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
9251542761 build: do not mutate args.user_ldflags
mutating the member variables in `args` after it is returned from
`arg_parser.parse_args()` is but confusing. let's use a new variable
for tracking the updated `user_ldflags`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:47 +08:00
Kefu Chai
fd9552de53 build: use os.makedirs(exist_ok=True)
instead of checking the existence of the directory, use the
`exist_ok` parameter, which was introduced back in Python 3.2,
and it has been used in this script.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-27 10:45:46 +08:00
Avi Kivity
301b0a989a Merge ' cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables' from Jan Ciołek
When presented with queries that use the same named bind variables twice, like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```

Scylla generated empty `partition_key_bind_indexes` (`pk_indexes`).
`pk_indexes` tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node.

Scylla should generate pk_indexes which tell the driver that it can use bind variable with `bind_index = 0` to calculate the partition token for this query.

The problem was that `_target_columns` keep only a single target_column for each bind variable.
In the example above `:x` is compared with both `p` and `c`, but `_target_columns` would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column.

To fix it, let's replace `_target_columns` with `_targets`. `_targets` keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.

A `cql-pytest` reproducer is added.

I also added some comments in `prepare_context.hh/cc` to make it easier to read.

Fixes: https://github.com/scylladb/scylladb/issues/15374

Closes scylladb/scylladb#15526

* github.com:scylladb/scylladb:
  cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables
  cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables
  cql3: improve readability of prepare_context
  cql-pytest: test generation of pk indexes during PREPARE
2023-09-26 19:47:04 +03:00
Nadav Har'El
9dea20539d Merge 'Sanitize forward-service shutdown' from Pavel Emelyanov
There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in 3b95f4f107

refs: #2737
refs: #4384

Closes scylladb/scylladb#15545

* github.com:scylladb/scylladb:
  forward_service: Remove .shutdown() method
  forward_service: Set _shutdown in abort-source subscription
  forward_service: Add abort_source to constructor
2023-09-26 18:36:52 +03:00
Kefu Chai
50c937439b reloc: strip.sh: always generate symbol list with posix format
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).

but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".

that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:

```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```

when using objcopy to filter the ELF file to keep only the
symbols we are interested in.

so, in this change

* use the same format when dumping the symbols from unstripped ELF
  file
* include the symbols in the text area -- the code, by checking
  "T" and "t" in the dumped symbols. this was achieved by matching
  the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
  variables which are initialized at compile time. they could
  be also interesting when debugging an application.

Fixes #15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15514
2023-09-26 17:59:40 +03:00
Alexander Turetskiy
024ba84637 cql3: SELECT CAST column names should match Cassandra's
When doing a SELECT CAST(b AS int), Cassandra returns a column named
cast(b as int). Currently, Scylla uses a different name -
system.castasint(b). For Cassandra compatibility, we should switch to
the same name.

fixes #14508

Closes scylladb/scylladb#14800
2023-09-26 17:26:14 +03:00
Aleksandra Martyniuk
f42be12f43 repair: release resources of shard_repair_task_impl
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.

In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.

Release shard_repair_task_impl resources immediately after shard
repair is finished.

Fixes: #15505.

Closes scylladb/scylladb#15506
2023-09-26 17:09:47 +03:00
Piotr Dulikowski
64668e325e raft: expose current_leader in raft::server
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.

If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.

If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
74b01730b4 storage_service: extract wait_for_live_nodes_timeout constant
Like in the non-raft topology path, during the new handshake, the
joining node will wait until all normal nodes are alive. The timeout
used during the wait is extracted to a constant so that it will be
reused in the handshake code, to be introduced in later commits.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
4f82f9fe50 raft_group0: abstract out node joining handshake
Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an
existing group 0 member to add this node to the group, in case the
joining node was not a discovery leader. The new handshake verbs
(JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a
preparation, this commit abstracts away the handshake process.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
c24daf7e88 storage_service: pass raft_topology_change_enabled on rpc init
We will want to conditionally register some verbs based on whether we
are using raft topology or not. This commit serves as a preparation,
passing the `raft_topology_change_enabled` to the function which
initializes the verbs (although there is _raft_topology_change_enabled
field already, it's only initialized on shard 0 later).
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
7cbe5e3af8 rpc: add new join handshake verbs
The `join_node_request` and `join_node_response` RPCs are added:

- `join_node_request` is sent from the joining node to any node in the
  cluster. It contains some initial parameters that will be verified by
  the receiving node, or the topology coordinator - notably, it contains
  a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
  joining node to tell it about the the outcome of the verification.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
dd4579637b docs: document the new join procedure 2023-09-26 15:56:52 +02:00
Piotr Dulikowski
caf1d4938e topology_state_machine: add supported_features to replica_state
The `service::topology_features` struct was introduced in #14955. Its
purpose was to make it possible to load cluster features from
`system.topology` before schema commitlog replay. It contains a map from
host ID to supported feature set for every normal node.

In order not to duplicate logic for loading features,
the `service::topology`'s `replica_state`s do not hold a set of
supported features and users are supposed to refer to the features
in `topology_features`, which is a field in the `topology` struct.
However, accessing features is quite awkward now.

This commit adds `supported_features` field back to the `replica_state`
struct and the `load_topology_state` function initializes them properly.
The logic duplication needed to initialize them is quite small and the
drawbacks that come with it are outweighed by the fact that we now can
refer to node's supported features in a more natural way.

The `topology_features` struct is no longer a field of `topology`, but
it still exists for the purpose of the feature check that happens before
commitlog replay.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
51b0e4d44f storage_service: check destination host ID in raft verbs
In unlucky but possible circumstances where a node is being replaced
very quickly, RPC requests using raft-related verbs from storage_service
might be sent to it - even before the node starts its group 0 server.
In the latter case, this triggers on_internal_error.

This commit adds protection to the existing verbs in storage_service:
they check whether the group 0 is running and whether the received
host_id matches the actual recipient's host_id.

None of the verbs that are modified are in any existing release, so the
added parameter does not have to be wrapped in rpc::optional.
2023-09-26 15:56:51 +02:00
Piotr Dulikowski
0317705f5a group_state_machine: take reference to raft address map
It will be needed to translate host ids to addresses.
2023-09-26 15:46:25 +02:00
Piotr Dulikowski
193e8eba26 raft_group0: expose joined_group0
It will be needed in the next commit to check whether the group 0 server
has been started.
2023-09-26 15:46:25 +02:00
Piotr Jastrzebski
47917bcf22 filter: hash key once per sstable set not sstable
Before this commit the primary key was hashed for bloom filter check
for each sstable.
This commit makes the key be hashed once per sstable set and reused
for bloom filter lookups in all sstables in the set.

I tested this change using perf_simple_query with the following modifications:
1. Create more than one sstable to have sstable set of more than one elements
2. Try to prevent compactions (I wasn't 100% successful)
3. Use a key that's not present to avoid reading from disk

```
diff --git a/test/perf/perf_simple_query.cc b/test/perf/perf_simple_query.cc
index 26dbf1e99..6bd460df2 100644
--- a/test/perf/perf_simple_query.cc
+++ b/test/perf/perf_simple_query.cc
@@ -105,6 +105,8 @@ std::ostream& operator<<(std::ostream& os, const test_config& cfg) {

 static void create_partitions(cql_test_env& env, test_config& cfg) {
     std::cout << "Creating " << cfg.partitions << " partitions..." << std::endl;
+    // Create 10 sstables each with all the data
+    for (unsigned count = 0; count < 10; ++count) {
     for (unsigned sequence = 0; sequence < cfg.partitions; ++sequence) {
         if (cfg.counters) {
             execute_counter_update_for_key(env, make_key(sequence));
@@ -117,6 +119,7 @@ static void create_partitions(cql_test_env& env, test_config& cfg) {
         std::cout << "Flushing partitions..." << std::endl;
         env.db().invoke_on_all(&replica::database::flush_all_memtables).get();
     }
+    }
 }

 static int64_t make_random_seq(test_config& cfg) {
@@ -137,8 +140,18 @@ static std::vector<perf_result> test_read(cql_test_env& env, test_config& cfg) {
         query += " using timeout " + cfg.timeout;
     }
     auto id = env.prepare(query).get0();
-    return time_parallel([&env, &cfg, id] {
-            bytes key = make_random_key(cfg);
+    // Always use the same key that is not present
+    // to make sure we don't read from disk and make
+    // the benchmark CPU bounded.
+    int64_t key_value = 6;
+    bytes key(bytes::initialized_later(), 5*sizeof(key_value));
+    auto i = key.begin();
+    write<uint64_t>(i, key_value);
+    write<uint64_t>(i, key_value);
+    write<uint64_t>(i, key_value);
+    write<uint64_t>(i, key_value);
+    write<uint64_t>(i, key_value);
+    return time_parallel([&env, id, key] {
             return env.execute_prepared(id, {{cql3::raw_value::make_value(std::move(key))}}).discard_result();
         }, cfg.concurrency, cfg.duration_in_seconds, cfg.operations_per_shard, cfg.stop_on_error);
 }
@@ -423,6 +436,10 @@ static std::vector<perf_result> do_cql_test(cql_test_env& env, test_config& cfg)
                 .with_column("C2", bytes_type)
                 .with_column("C3", bytes_type)
                 .with_column("C4", bytes_type)
+		// Try to prevent compaction
+		// to keep the number of sstables high
+		.set_compaction_enabled(false)
+		.set_min_compaction_threshold(2000000000)
                 .build();
     }).get();

@@ -539,6 +556,11 @@ int scylla_simple_query_main(int argc, char** argv) {
             const auto enable_cache = app.configuration()["enable-cache"].as<bool>();
             std::cout << "enable-cache=" << enable_cache << '\n';
             db_cfg->enable_cache(enable_cache);
+	    // Try to prevent compaction
+	    // to keep the number of sstables high
+	    db_cfg->concurrent_compactors(1);
+	    db_cfg->compaction_enforce_min_threshold(true);
+	    db_cfg->compaction_throughput_mb_per_sec(1);

             cql_test_config cfg(db_cfg);
           return do_with_cql_env_thread([&app] (auto&& env) {
```

The following command showed 2-3% improvement on my machine but this
depends on the lenght of the key and the number of sstables in the set.

```
./build/release/scylla perf-simple-query --bypass-cache --flush -c 1
--random-seed=2068087418 --enable-cache false
```

Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>

Closes scylladb/scylladb#15538
2023-09-26 16:27:11 +03:00
Botond Dénes
d5f095d5a4 Merge 'Make interaction of compaction strategy with sstable runs more robust and efficient' from Raphael "Raph" Carvalho
SSTable runs work hard to keep the disjointness invariant, therefore they're
expensive to build from scratch.
For every insertion, it keeps the elements sorted by their first key in
order to reject insertion of element that would introduce overlapping.

Additionally, a sstable run can grow to dozens of elements (or hundreds)
therefore, we can also make interaction with compaction strategies more
efficient by not copying them when building a list of candidates in compaction
manager. And less fragile by filtering out any sstable runs that are not
completely eligible for compaction.

Previously, ICS had to give up on using runs managed by sstable set due to
fragility of the interface (meaning runs are being built from scratch
on every call to the strategy, which is very inefficient, but that had to
be done for correctness), but now we can restore that.

Closes scylladb/scylladb#15440

* github.com:scylladb/scylladb:
  compaction: Switch to strategy_control::candidates() for regular compaction
  tests: Prepare sstable_compaction_test for change in compaction_strategy interface
  compaction: Allow strategy to retrieve candidates either as sstables or runs
  compaction: Make get_candidates() work with frozen_sstable_run too
  sstables: add sstable_run::run_identifier()
  sstables: tag sstable_run::insert() with nodiscard
  sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs
  sstables: Simplify sstable_set interface to retrieve runs
2023-09-26 14:56:05 +03:00
Aleksandra Martyniuk
d799adc536 tasks: change task_manager::task::impl::is_internal()
Most of the time only the roots of tasks tree should be non internal.

Change default implementation of is_internal and delete overrides
consistent with it.

Closes scylladb/scylladb#15353
2023-09-26 14:49:49 +03:00
Avi Kivity
5804386ca6 Merge 'Don't mess with table directories in distributed loader' from Pavel Emelyanov
Distributed loader code still "knows" that table's datadir is a filesystem directory with some structure. For S3-backed sstables this still works because for S3 keyspaces scylla still creates and maintains empty directories in datadir. This set fixes the dist. loader assumptions about that and moves them into sstable directory's lister.

refs: #13020

Closes scylladb/scylladb#15542

* github.com:scylladb/scylladb:
  sstable_directory: Indentation fix after previous patch
  sstable_directory: Simplify filesystem prepare()
  distributed_loader: Remove get_path() method
  distributed_loader: Move directory touching to sstable_directory
  distributed_loader: Move directory existance checks to sstable_directory
  sstable_directory: Move prepare() core to lister
2023-09-26 14:48:23 +03:00
Kefu Chai
8066929960 build: cmake: use if (.. IN_LIST ..) when appropriate
for better readability, and do not create a CMAKE_BUILD_TYPE CACHE
entry if it is already set using `-D`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15543
2023-09-26 14:15:53 +03:00
Pavel Emelyanov
e022a76350 main, api: Set/Unset storage_service API in proper place
Currently the storage-service API handlers are set up in "random" place.
It can happen earlier -- as soon as the storage service itself is ready.

Also, despite storage service is stopped on shutdown, API handlers
continue reference it leading to potential use-after-frees or "local is
not initialized" assertions.

Fix both. Unsetting is pretty bulky, scylladb/seastar#1620 is to help.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:22:13 +03:00
Pavel Emelyanov
78a22c5ae3 api/storage_service: Remove gossiper arg from API
Now all handlers work purely on storage_service and gossiper argument is
no longer needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:46 +03:00
Pavel Emelyanov
8dc6e74138 api/storage_service: Remove system keyspace arg from API
It's not used nowadays

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:25 +03:00
Pavel Emelyanov
27eaff9d44 api/storage_service: Get gossiper from storage service
Some handlers in set_storage_service() have implicit dependency on
gossiper. It's not API that should track it, but storage service itself,
so get the gossiper from service, not from the external argument (it
will be removed soon)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:20:27 +03:00
Pavel Emelyanov
4008ebb1b0 api/storage_service: Get token_metadata from storage service
The API handlers that live in set_storage_service() should be
self-contained and operate on storage-service only. Said that, they
should get the token metadata, when needed, from storage service, not
from somewhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:19:24 +03:00
Tomasz Grabiec
0f22e8d196 storage_service: Fixed missed notificaiton on tablet metadata update
There can be 2 waiters now (coordinator and CDC generation publisher),
so signal() is not enough.

Change made in c416c9ff33 missed to
update this site.

Closes scylladb/scylladb#15527
2023-09-26 10:37:57 +02:00
Jan Ciolek
e5f0468761 cql/prepare_expr: fix wrong receiver in field_selection_test_assignment
When preparing a `field_selection`, we need to prepare the UDT value,
and then verify that it has this field.

`field_selection_test_assignment` prepares the UDT value using the same
receiver as the whole `field_selection`. This is wrong, this receiver
has the type of the field, and not the UDT.

It's impossible to create a receiver for the UDT. Many different UDTs
can produce an `int` value when the field `a` is selected.
Therefore the receiver should be `nullptr`.

No unit test is added, as this bug doesn't currently cause any issues.
Preparing a column value doesn't do any type checks, so nothing fails.
Still it's good to fix it, just to be correct.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes scylladb/scylladb#14788
2023-09-26 11:15:00 +03:00
Pavel Emelyanov
0e0f9a57c6 forward_service: Remove .shutdown() method
It's now empty and has no value

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:39:22 +03:00
Pavel Emelyanov
a251b9893f forward_service: Set _shutdown in abort-source subscription
Currently the bit is set in .shutdown() method which is called early on
stop. After the patch the bit it set in the abort-source subscription
callback which is also called early on stop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:38:34 +03:00
Pavel Emelyanov
b18c54f56c forward_service: Add abort_source to constructor
It will be used by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:38:26 +03:00
Raphael S. Carvalho
8997fe0625 compaction: Switch to strategy_control::candidates() for regular compaction
Now everything is prepared for the switch, let's do it.

Now let's wait for ICS to enjoy the set of changes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
761a37022f tests: Prepare sstable_compaction_test for change in compaction_strategy interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
02f1f24f27 compaction: Allow strategy to retrieve candidates either as sstables or runs
That's needed for upcoming changes that will allow ICS to efficiently
retrieve sstable runs.

Next patch will remove candidates from compaction_strategy's interface
to retrieve candidates using this one instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
ff8510445d compaction: Make get_candidates() work with frozen_sstable_run too
This is done in preparation for ICS to retrieve candidates as
sstable runs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
4b193c04dd sstables: add sstable_run::run_identifier()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
8235889b8a sstables: tag sstable_run::insert() with nodiscard
sstable_run may reject insertion of a sstable if it's going
to break the disjoint invariant of the run, but it's important
that the caller is aware of it, so it can act on it like
generating a new run id for the sstable so it can be inserted
in another run. the tag is important to avoid unknown
problems in this area.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Raphael S. Carvalho
0fe2630d70 sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs
Users of all_sstable_runs() don't want to mutate the runs, but rather
work with their content. So let's avoid copy and make the intention
explicit with the new frozen_sstable_run used as return type
for the interface.

This will guarantee that ICS will be able to fetch uncompacting
runs efficiently.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:20 -03:00
Raphael S. Carvalho
9f6c3369d2 sstables: Simplify sstable_set interface to retrieve runs
This interface selects all runs that store at least one of the
sstables in the vector.

But that's very fragile, to the point that even ICS had to
stop using it. A better interface is to return all runs
managed by the set and allow compaction manager to do its
filtering.

We want to use it in ICS to avoid the overhead of rebuilding
sstable runs which may be expensive as sorting is performed
to guarantee the disjoint invariant.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:04:20 -03:00
Tomasz Grabiec
19ff4b730f storage_service: Avoid SIGSEGV when tablet cleanup is invoked on non-0 shard
We access group0, which is only set on shard 0.

Closes scylladb/scylladb#15469
2023-09-25 20:59:27 +03:00
Pavel Emelyanov
901bbf21e9 Merge 'build: extract code fragments into functions' from Kefu Chai
more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.

Refs https://github.com/scylladb/scylladb/issues/15379

Closes scylladb/scylladb#15515

* github.com:scylladb/scylladb:
  build: extract get_os_ids() out
  build: extract find_ninja() out
  build: extract thrift_uses_boost_share_ptr() out
2023-09-25 20:57:59 +03:00
Botond Dénes
caeddb9c88 tools/utils: return a distinct error-code on unknown operation
Currently, the tools loosely follow the following convention on
error-codes:
* return 1 if the error is with any of the command-line arguments
* return 2 on other errors

This patch changes the returned error-code on unknown operation/command
to 100 (instead of the previous 1). The intent is to allow any wrapper
script to determine that the tool failed because the operation is
unrecognized and not because of something else. In particular this
should enable us to write a wrapper script for scylla-nodetool, which
dispatches commands still un-implemented in scylla-nodetool, to the java
nodetool.
Note that the tool will still print an error message on an unknown
operation. So such wrapper script would have to make sure to not let
this bleed-through when it decides to forward the operation.

Closes scylladb/scylladb#15517
2023-09-25 20:56:44 +03:00
Pavel Emelyanov
99cbb6b733 sstable_directory: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 20:34:52 +03:00
Pavel Emelyanov
7ab03e33a2 sstable_directory: Simplify filesystem prepare()
When FS lister gets prepared it

- checks if the directory exists
- creates if it it doesn't or bais out if it's quarantine one
- goes and checks the directory's owner and mode

The last step is excessive if the directory didn't exist on entry and
was created.

Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 20:03:19 +03:00
Pavel Emelyanov
0232f939dc distributed_loader: Remove get_path() method
It's no longer used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 20:03:06 +03:00
Pavel Emelyanov
9c3e055d22 distributed_loader: Move directory touching to sstable_directory
This is continuation of the previous patch -- when populating a table,
creating directories should be (optionally) performed by the lister
backend, not by the generic loader.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 20:01:53 +03:00
Pavel Emelyanov
2678cc2ae8 distributed_loader: Move directory existance checks to sstable_directory
The loader code still "knows" that tables' sstables live in directories
on datadir filesystem, but that's not always so. So whether or not the
directory with sstables exists should be checked by sstable directory's
component lister, not the loader.

After this change potentially missing quarantine directory will be
processed with the sstable directory with empty result, but that's OK,
empty directories should be already handled correctly, so even if the
directory lister doesn't produce any sstables because it found no files,
or because it just skipped scanning doesn't make any difference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 19:59:41 +03:00
Pavel Emelyanov
603f3ca042 sstable_directory: Move prepare() core to lister
Current sstable_directory::prepare() code checks the sstable directory
existance which only makes sense for filesystem-backed sstables.
S3-backed don't (well -- won't) have any directories in datadir, so the
check should be moved into component lister.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-25 19:58:53 +03:00
Jan Ciolek
649b634c63 cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables
Issue #15374 has been fixed, so these tests can be enabled.
Duplicate bind variable names are now handled correctly.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-09-25 17:19:07 +02:00
Jan Ciolek
3cff10f756 cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables
When presented with queries that use the same named bind variables twice,
like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```

Scylla generated empty partition_key_bind_indexes (pk_indexes).
pk_indexes tell the driver which bind variables it should use to calculate the partition
token for a query. Without it, the driver is unable to determine the token and it will
send the query to a random node.

Scylla should generate pk_indexes which tell the driver that it can use bind variable
with bind_index = 0 to calculate the partition token for a query.

The problem was that _target_columns keep only a single target_column for each bind variable.
In the example above :x is compared with both p and c, but _target_columns would contain
only one of them, and Scylla wasn't able to tell that this bind variable is compared with
a partition key column.

To fix it, let's replace _target_columns with _targets. _targets keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.

Fixes: #15374

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-09-25 17:18:53 +02:00
Jan Ciolek
a993ae31f8 cql3: improve readability of prepare_context
This commits adds a few comments and changes a few variable names
so that it's easier to figure out what the code does.

When I first started looking at this part of the code it wasn't
obvious what's going on - what are _specs, how are they different
from _target_columns? What happens when a variable doesn't have a name?

I hope that this change will make it easier to understand for future readers.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-09-25 17:18:53 +02:00
Anna Stuchlik
4afe2b9d9f doc: add RBNO to glossary
This commit adds Repair Based Node Operations
to the ScyllaDB glossary.

Fixes https://github.com/scylladb/scylladb/issues/11959

Closes scylladb/scylladb#15522
2023-09-25 18:16:53 +03:00
Jan Ciolek
f3ecd279f2 cql-pytest: test generation of pk indexes during PREPARE
Add some tests that test whether `pk indexes` are generated correctly.
When a driver asks to prepare a statement, Scylla's response includes
the metadata for this prepared statement.
In this metadata there's `pk indexes`, which tells the driver which
bind variable values it should use to calculate the partition token.

For a query like:
SELECT * FROM t WHERE p2 = ? AND p1 = ? AND p3 = ?

The correct pk_indexes would be [1, 0, 2], which means
"To calculate the token calculate Hash(bind_vars[1] | bind_vars[0] | bind_vars[2])".

More information is available in the specification:
1959502d8b/doc/native_protocol_v4.spec (L699-L707)

Two tests are marked as xfail because of #15374 - Scylla doesn't correctly handle using the same
named variable in multiple places. This will be fixed soon.

I couldn't find a good place for these tests, so I created a new file - test_prepare.py.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-09-25 17:12:17 +02:00
Pavel Emelyanov
652153c291 Merge 'populate_keyspace: use datadir' from Benny Halevy
Currently the datadir is ignored.
Use it to construct the table's base path.

Fixes scylladb/scylladb#15418

Closes scylladb/scylladb#15480

* github.com:scylladb/scylladb:
  distributed_loader: populate_keyspace: access cf by ref
  distributed_loader: table_populator: use datadir for base_path
  distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed
  distributed_loader: populate_keyspace: fixup indentation
  distributed_loader: populate_keyspace: iterate over datadirs in the inner loop
  test: sstable_directory_test: add test_multiple_data_dirs
  table: init_storage: create upload and staging subdirs on all datadirs
2023-09-25 13:40:50 +03:00
Nadav Har'El
1a5debac5c test/cql-pytest: cleaner reproducer for spurious static row returned
Issue #10357 is about a SELECT with a filter on a regular column which
incorrectly returns a static row without regular columns set (so the
filter would not have matched). We already have four tests reproducing
this issue, but each of them is a small part of a large tests translated
from Cassandra, making it hard to understand the scope of this bug.

So in this patch we add two new tests, one passing and one xfailing,
which clarify the scope of this bug. It turns out that the bug only
occurs when a partition has no clustering rows and only has a static
row. If the partition does have clustering rows - even if those don't
match the filter - the bug doesn't happen. The xfailing test is just
two statements long - a single INSERT and a single SELECT

Refs #10357.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#15120
2023-09-25 11:01:22 +03:00
Raphael S. Carvalho
914cbc11cf reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.

1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests

2) turns out it stops first the gate used for closing readers that will
become inactive.

3) proceeds to wait for in-flight reads by closing the reader permit gate.

4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.

5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.

By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.

Fixes #15534.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15535
2023-09-25 08:55:50 +03:00
Nadav Har'El
be942c1bce Merge 'treewide: rename s3 credentials related variable and option names' from Kefu Chai
in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability.

Closes scylladb/scylladb#15529

* github.com:scylladb/scylladb:
  main.cc: rename aws option
  utils/s3/creds: rename aws_config member variables
2023-09-24 14:03:47 +03:00
Nadav Har'El
4e1e7568d8 Merge 'cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe' from Michał Jadwiszczak
So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index.

This commit adds UDT/UDF/UDA to generic describe.

Fixes: #14170

Closes scylladb/scylladb#14334

* github.com:scylladb/scylladb:
  docs:cql: add information  about generic describe
  cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc
  cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe
2023-09-24 13:03:04 +03:00
Kefu Chai
f3f31f0c65 main.cc: rename aws option
- s/aws_key/aws_access_key_id/
- s/aws_secret/aws_secret_access_key/
- s/aws_token/aws_session_token/

rename them to more popular names, these names are also used by
boto's API. this should improve the readability and consistency.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-23 14:31:32 +08:00
Kefu Chai
ac3406e537 utils/s3/creds: rename aws_config member variables
- s/key/access_key_id/
- s/secret/secret_access_key/
- s/token/session_token/

so they are more aligned with the AWS document.
for instance, in
https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#ConstructingTheAuthenticationHeader
AWSAccessKeyId is used in the "Authorization" header.

this would help with the readability and maintainability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-23 14:28:07 +08:00
Benny Halevy
7bd131d212 distributed_loader: populate_keyspace: access cf by ref
There is no need to hold on to the table's
shared ptr since it's held by the global table ptr
we got in the outer loop.

Simplify the code by just getting the local table reference
from `gtable`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:51:41 +03:00
Benny Halevy
a8e7981bb6 distributed_loader: table_populator: use datadir for base_path
Currently the datadir is ignored.
Use it to construct the table's base path.

Fixes scylladb/scylladb#15418

Note that scylla still doesn't work correctly
with multiple data directories due to #15510.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:51:39 +03:00
Benny Halevy
14da3e4218 distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed
Currently, mark_ready_for_writes is called too early,
after the first data dir is processed, then the next
datadir will hit an assert in `table::mark_ready_for_writes`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:50:53 +03:00
Benny Halevy
84510370e1 distributed_loader: populate_keyspace: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:50:52 +03:00
Benny Halevy
87d438b234 distributed_loader: populate_keyspace: iterate over datadirs in the inner loop
It is more efficient to iterate over multiple data directories
in the inner loop rather than the outer loop.

Following patch will make use of the datadir in
table_populator.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:50:24 +03:00
Benny Halevy
2591f5f935 test: sstable_directory_test: add test_multiple_data_dirs
Add a basic regression test that starts the cql test env
with multiple data directories.

It fails without the previous patch:
table: init_storage: create upload and staging subdirs on all datadirs

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:24:54 +03:00
Benny Halevy
2937552e5b table: init_storage: create upload and staging subdirs on all datadirs
We need to have a complete directory structure for each table
and each configured datadir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:24:54 +03:00
David Garcia
762ca61ad9 docs: format db reference as list
docs: limit reference max_depth

docs: change reference description order

Closes scylladb/scylladb#15205
2023-09-22 19:25:01 +03:00
Kamil Braun
99d83808cc Merge 'test/topology_custom/test_select_from_mutation_fragments.py: use async api and clean-up' from Botond Dénes
Also, while at it, add copyright/license blurbs for tests that were missing it.

Closes scylladb/scylladb#15495

* github.com:scylladb/scylladb:
  test/topology_custom: add copyright/license blurb to tests
  test/topology_custom: test_select_from_mutation_fragments.py: use async query api
2023-09-22 10:59:48 +02:00
Botond Dénes
4acde0fb4b test/topology_custom: add test_read_repair.py 2023-09-22 02:53:15 -04:00
Botond Dénes
d007a0ec16 replica/mutation_dump: detect end-of-page in range-scans
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
2023-09-22 02:53:15 -04:00
Botond Dénes
e723fb3017 tools/scylla-sstable: write: abort parser thread if writing fails
Currently if writing the sstable fails, e.g. because the input data is
out-of-order, the json parser thread hangs because its output is no
longer consumed. This results in the entire application just freezing.
Fix this by aborting the parsing thread explicitely in the
json_mutation_stream_parser destructor. If the parser thread existed
successfully, this will be a no-op, but on the error-path, this will
ensure that the parser thread doesn't hang.
2023-09-22 02:53:15 -04:00
Botond Dénes
70e26e5a10 test/pylib: add REST methods to get node exe and workdir paths 2023-09-22 02:53:15 -04:00
Botond Dénes
8bd5f67039 test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
To support the equivalent (roughly) of the following nodetool commands:
* nodetool refresh
* nodetool flush
* nodetool compact
2023-09-22 02:53:15 -04:00
Botond Dénes
d62a83683e service/storage_proxy: add trace points for the actual read executor type
There is currently a trace point for when the read executor is created,
but this only contains the initial replica set and doesn't mention which
read executor is created in the end. This patch adds trace points for
each different return path, so it is clear from the trace whether
speculative read can happen or not.
2023-09-22 02:53:15 -04:00
Botond Dénes
d3aabf7896 service/storage_proxy: add trace points for read-repair
Currently the fact that read-repair was triggered can only be inferred
from seeing mutation reads in the trace. This patch adds an explicit
trace point for when read repair is triggered and also when it is
finished or retried.
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
1bcac74976 storage_proxy: Add more trace-level logging to read-repair
Extremely helpful in debugging.
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
8b7623f49e database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
17c1cad4b4 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-22 02:53:14 -04:00
Botond Dénes
91a8100b3f Merge 'Validate compaction strategy options in prepare' from Aleksandra Martyniuk
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

Fixes: #14710.

Closes scylladb/scylladb#15091

* github.com:scylladb/scylladb:
  cql3: statements: delete execute override
  cql3: statements: call check_restricted_table_properties in prepare
  cql3: statements: pass data_dictionary::database to check_restricted_table_properties
2023-09-22 09:49:19 +03:00
Kefu Chai
be7363a621 build: extract get_os_ids() out
this helper is only used by pkgname(), so move it closer to its
sole caller.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-22 14:12:12 +08:00
Kefu Chai
0af50b2709 build: extract find_ninja() out
more structured this way. and the data dependency is more clear
with this change. this also allows us to quickly identify the parts
which should/can be reused when migrating to the CMake based building
system.

Refs #15379

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-22 13:23:45 +08:00
Kefu Chai
2e901bae2f build: extract thrift_uses_boost_share_ptr() out
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.

Refs #15379

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-22 13:23:45 +08:00
Michael Huang
a684e51e4d cql3: fix bad optional access when executing fromJson function
Fix fromJson(null) to return null, not a error as it did before this patch.
We use "null" as the default value when unwrapping optionals
to avoid bad optional access errors.

Fixes: scylladb#7912

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Closes scylladb/scylladb#15481
2023-09-21 20:18:49 +03:00
Avi Kivity
61440d20c3 Merge 'Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes https://github.com/scylladb/scylladb/issues/14992.

Closes scylladb/scylladb#15400

* github.com:scylladb/scylladb:
  test: Verify that off-strategy can do incremental compaction
  compaction: Clear pending_replacement list when tombstone GC is disabled
  compaction: Enable incremental compaction on off-strategy
  compaction: Extend reshape type to allow for incremental compaction
  compaction: Move reshape_compaction in the source
  compaction: Enable incremental compaction only if replacer callback is engaged
2023-09-21 20:12:19 +03:00
Gleb Natapov
c94a9cf731 storage_service: raft topology: fence off write from old topology coordinator before starting a new one
Make sure that all writes started by the old coordinator are completed or
will eventually fail before starting a new coordinator.

Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>
2023-09-21 17:26:45 +02:00
Avi Kivity
1da6a939fe Merge 'Track memory usage of S3 object uploads' from Pavel Emelyanov
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.

Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.

This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.

Closes scylladb/scylladb#15497

* github.com:scylladb/scylladb:
  s3::client: Track memory in client uploads
  code: Configure s3 clients' memory usage
  s3::client: Construct client with shared semaphore
  sstables::storage_manager: Introduce config
2023-09-21 18:24:42 +03:00
Botond Dénes
a0c5dee2aa utils/logalloc: introduce logalloc::bad_alloc
This new exception type inherits from std::bad_alloc and allows logalloc
code to add additional information about why the allocation failed. We
currently have 3 different throw sites for std::bad_alloc in logalloc.cc
and when investigating a coredump produced by --abort-on-lsa-bad-alloc,
it is impossible to determine, which throw-site activated last,
triggering the abort.
This patch fixes that by disambiguating the throw-sites and including it
in the error message printed, right before abort.

Refs: #15373

Closes scylladb/scylladb#15503
2023-09-21 17:43:53 +03:00
Raphael S. Carvalho
91efd878d7 test: Verify that off-strategy can do incremental compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-21 11:15:46 -03:00
Raphael S. Carvalho
9d92374b20 compaction: Clear pending_replacement list when tombstone GC is disabled
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.

Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-21 11:15:46 -03:00
Raphael S. Carvalho
42050f13a0 compaction: Enable incremental compaction on off-strategy
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes #14992.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-21 11:15:46 -03:00
Raphael S. Carvalho
db9ce9f35a compaction: Extend reshape type to allow for incremental compaction
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-21 11:15:12 -03:00
Raphael S. Carvalho
33a0f42304 compaction: Move reshape_compaction in the source
That's in preparation to next change that will make reshape
inherit from regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-21 11:11:13 -03:00
Botond Dénes
3b95f4f107 Merge 'Sanitize view-update-generator start-stop sequence' from Pavel Emelyanov
The v.u.g. start stop is now spread over main() code heavily.

1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it
2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen
3. early on stop the v.u.g. is fully stopped

The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start.

Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires.

Closes scylladb/scylladb#15466

* github.com:scylladb/scylladb:
  view_update_generator: Stop for real later
  view_update_generator: Add logging to do_abort()
  view_update_generator: Move abort kicking to do_abort()
  view_update_generator: Add early abort subscription
2023-09-21 17:01:27 +03:00
Pavel Emelyanov
6e972f8505 repair: Shutdown repair on nodetool drain too
Currently repair shutdown only happens on stop, but it looks like
nodetool drain can call shutdown too to abort no longer relevant repair
tasks if any. This also makes the main()'s deferred shutdown/stop paths
cleaner a little bit

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15438
2023-09-21 16:58:23 +03:00
Kefu Chai
2392b6a179 doc: start unordered list with an empty line
otherwise, sphinx would render them as a single block instead of
as an unordererd list.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15504
2023-09-21 14:35:09 +03:00
Aleksandra Martyniuk
6c7eb7096e cql3: statements: delete execute override
Delete overriden create_table_statement::execute as it only calls its
direct parent's (schema_altering_statement) execute method anyway.
2023-09-21 13:24:26 +02:00
Aleksandra Martyniuk
60fdc44bce cql3: statements: call check_restricted_table_properties in prepare
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.
2023-09-21 13:21:51 +02:00
Aleksandra Martyniuk
ec98b182c8 cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.
2023-09-21 13:20:45 +02:00
Pavel Emelyanov
0ae0f75a04 view_update_generator: Stop for real later
Now the v.u.g.::stop() code waits for the generator bacground fiber to
stop for real. This can happen much later, all the necessary precautions
not to produce more work for the generator had been taken in do_abort().

This keeps the v.u.g. start-stop in one place, except for the call to
invoke_on_all(v.u.g.::start()) which will be handler separately.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:34:23 +03:00
Pavel Emelyanov
becd960ae8 view_update_generator: Add logging to do_abort()
Just tell the logs that the guy is aborting
refs: #10941

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:34:21 +03:00
Pavel Emelyanov
967ebacaa4 view_update_generator: Move abort kicking to do_abort()
When v.u.g. stops is first aborts the generation background fiber by
requesting abort on the internal abort source and signalling the fiber
in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last
in main(), so this move doesn't change much -- when stop_signal fires,
it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing
that would happen after it before real ::stop() is called that depends
on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:32:45 +03:00
Pavel Emelyanov
e34220ebb7 view_update_generator: Add early abort subscription
Subscribe v.u.g. to the main's stop_signal. For now a no-op callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:32:45 +03:00
Kefu Chai
0819788207 utils/s3: use structured binding when appropriate
and use `sstring::starts_with()`, for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15487
2023-09-21 13:26:49 +03:00
Kefu Chai
c364efb998 utils/s3: auth using AWS_SESSION_TOKEN
when accessing AWS resources, uses are allowed to long-term security
credentials, they can also the temporary credentials. but if the latter
are used, we have to pass a session token along with the keys.
see also https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
so, if we want to programatically get authenticated, we need to
set the "x-amz-security-token" header,
see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#UsingTemporarySecurityCredentials

so, in this change, we

1. add another member named `token` in `s3::endpoint_config::aws_config`
   for storing "AWS_SESSION_TOKEN".
2. populate the setting from "object_storage.yaml" and
  "$AWS_SESSION_TOKEN" environment variable.
3. set "x-amz-security-token" header if
   `s3::endpoint_config::aws_config::token` is not empty.

this should allow us to test s3 client and s3 object store backend
with S3 bucket, with the temporary credentials.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15486
2023-09-21 13:26:11 +03:00
Botond Dénes
7f03ef07c8 Merge 'build: use default value of --with-* option s' from Kefu Chai
in this series, we use the default values of options specifying the paths to tools for better readability. and also to ease the migration to CMake

Refs #15379

Closes scylladb/scylladb#15500

* github.com:scylladb/scylladb:
  build: do not check for args.ragel_exec
  build: set default value of --with-antlr3 option
2023-09-21 10:51:08 +03:00
Pavel Emelyanov
e6fe18ca55 s3: Handle piece flushing exception
When a piece is uploaded it's first flushed, then upload-copy is issued.
Both happen in the background and if piece flush calls resolves with
exception the exception remains unhandled. That's OK, since upload
finalization code checks that some pieces didn't complete (for whatever
reason) and fails the whole uploading, however, the ignored exception is
reported in logs. Not nice.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15491
2023-09-21 10:39:04 +03:00
Botond Dénes
ac8005a102 Merge 'build: extract code fragments into functions' from Kefu Chai
more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.

Refs #15379

Closes scylladb/scylladb#15501

* github.com:scylladb/scylladb:
  build: extract check_for_lz4() out
  build: extract check_for_boost() out
  build: extract check_for_minimal_compiler_version() out
  build: extract write_build_file() out
2023-09-21 09:36:14 +03:00
Botond Dénes
f6575344df Merge 'Collect dangling object-store sstables' from Pavel Emelyanov
Sstables in transitional states are marked with the respective 'status' in the registry. Currently there are two of such -- 'creating' and 'removing'. And the 'sealed' status for sstables in use.

On boot the distributed loader tries to garbage collect the dangling sstables. For filesystem storage it's done with the help of temorary sstables' dirs and pending deletion logs. For s3-backed sstables, the garbage collection means fetching all non-sealed entries and removing the corresponding objects from the storage.

Test included (last patch)

fixes #13024

Closes scylladb/scylladb#15318

* github.com:scylladb/scylladb:
  test: Extend object_store test to validate GC works
  sstable_directory: Garbage collect S3 sstables on reboot
  sstable_directory: Pass storage to garbage_collect()
  sstable_directory: Create storage instance too
2023-09-21 09:15:00 +03:00
Benny Halevy
e8f720315d gossiper: run: hold background_gate when sending gossip in background
So it would be waited on in shutdown().

Although gossiper::run holds the `_callback_running` semaphore
which is acquired in `do_stop_gossiping`, the gossip messages
it initiates in the background are never waited on.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#15493
2023-09-21 08:54:35 +03:00
Kefu Chai
fe4caeb77f utils/s3/client: do not allocate rapidxml::xml_document on stack
as the size of `rapidxml::xml_document` size quite large, let's
allocate it on the heap. otherwise GCC 13.2.1 warns us like:
```
utils/s3/client.cc: In function ‘seastar::sstring s3::parse_multipart_copy_upload_etag(seastar::sstring&)’:
utils/s3/client.cc:455:9: warning: stack usage is 66208 bytes [-Wstack-usage=]
  455 | sstring parse_multipart_copy_upload_etag(sstring& body) {
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15472
2023-09-21 08:51:08 +03:00
Kefu Chai
8802364b5b build: extract check_for_lz4() out
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.

Refs #15379

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 11:11:00 +08:00
Kefu Chai
cb02a56421 build: extract check_for_boost() out
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.

Refs #15379

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 11:05:31 +08:00
Kefu Chai
9996503f56 build: extract check_for_minimal_compiler_version() out
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 11:05:31 +08:00
Kefu Chai
7236b81efc build: extract write_build_file() out
more structured this way. also, this will allow us to switch over
to the CMake building system.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 10:33:05 +08:00
Kefu Chai
f3d6e91287 build: do not check for args.ragel_exec
args.ragel_exec defaults to "ragel" already, so unless user specifies
an empty ragel using `--with-ragel=""`, we won't have an
`args.ragel_exec` which evaluates to `False`, so drop this check.

Refs #15379

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 10:28:57 +08:00
Kefu Chai
4632609a1c build: set default value of --with-antlr3 option
so we don't need to check if this option is specified.
this option will also be used even after switching to CMake.

Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-21 10:28:57 +08:00
Pavel Emelyanov
fc5306c5e8 s3::client: Track memory in client uploads
When uploading an object part, client spawns a background fiber that
keeps the buffers with data on the http request's write_body() lambda
capture. This generates unbound usage of memory with uploaded buffers
which is not nice. Even though s3 client is limited with http's client
max-connections parallelism, waiting for the available connection still
happens with buffers held in memory.

This patch makes the client claim the background memory from the
provided semaphore (which, in turn, sits on the shard-wide storage
manager instance). Once body writing is complete, the claimed units are
returned back to the semaphore allowing for more background writes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-20 17:50:29 +03:00
Pavel Emelyanov
182a5348d4 code: Configure s3 clients' memory usage
This sets the real limits on the memory semaphore.

- scylla sets it to 1% of total memory, 10Mb min, 100Mb max
- tests set it to 16Mb
- perf test sets it to all available memory

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-20 17:50:29 +03:00
Pavel Emelyanov
b299757884 s3::client: Construct client with shared semaphore
The semaphore will be used to cap memory consumption by client. This
patch makes sure the reference to a semaphore exists as an argument to
client's constructor, not more than that.

In scylla binary, the semaphore sits on storage_manager. In tests the
semaphore is some local object. For now the semaphore is unused and is
initialized locked as this patch just pushes the needed argument all the
way around, next patches will make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-20 17:50:07 +03:00
Pavel Emelyanov
f40b4e3e84 sstables::storage_manager: Introduce config
Just an empty config that's fed to storage_manager when constructed as a
preparation for further heavier patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-20 17:42:59 +03:00
Botond Dénes
f2df8cf484 test/topology_custom: add copyright/license blurb to tests
Most tests were missing this, fix it.
2023-09-20 10:41:31 -04:00
Botond Dénes
3e5fe6e0a6 test/topology_custom: test_select_from_mutation_fragments.py: use async query api
cql.execute_async() can now execute paged queries, use it instead of a
blocking API.
While at it, clean-up the test:
* remove unneded wait on ring0 settle
* address flake8 concerns:
    - unused imports
    - unused variables
    - style
2023-09-20 10:41:31 -04:00
Botond Dénes
a56a4b6226 Merge 'compaction_backlog_tracker: do not allow moving registered trackers' from Benny Halevy
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.

Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.

The move-assignment operator has a similar problem,
therefore it is deleted in this series, and the function is
renamed to `transfer_backlog` that just doesn't deal with the
moved-from registration.  This is safe since it's only used internally
by the compaction manager.

Fixes #15248

Closes scylladb/scylladb#15445

* github.com:scylladb/scylladb:
  compaction_state: store backlog_track in std::optional
  compaction_backlog_tracker: do not allow moving registered trackers
2023-09-20 16:41:10 +03:00
Kefu Chai
6fc171b9cf main: use fallback parameter when converting a YAML node
as yaml-cpp returns an invalid node when the node to be indexed
does not exist all. but it allows us to provide a fallback value
which is returned when the node is not valid. so, let's just use
this helper for accessing a node which does not necessarily exist.

simpler this way

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15488
2023-09-20 16:01:45 +03:00
Nadav Har'El
f239849563 Merge 'doc: add a note that counters do not support TTL' from Anna Stuchlik
This PR adds the information that counters do not support data expiration with TTL, plus the link to the TTL page.

Fixes https://github.com/scylladb/scylladb/issues/15479

Closes scylladb/scylladb#15489

* github.com:scylladb/scylladb:
  doc: improve TTL limitation info on Counters page
  doc: add a note that counters do not support TTL
2023-09-20 15:49:44 +03:00
Anna Stuchlik
5073609366 doc: improve TTL limitation info on Counters page
This commit improves the information about
counters not supporting TTL on the Counters
page.
2023-09-20 14:38:35 +02:00
Anna Stuchlik
715b1a80c7 doc: add a note that counters do not support TTL
This commit adds the information that counters
do not support data expiration wtih TTL, plus
the link to the TTL page.

Fixes https://github.com/scylladb/scylladb/issues/15479
2023-09-20 13:28:33 +02:00
Benny Halevy
72a5ac9ce7 gossiper: get_or_create_endpoint_state: create empty endpoint_state
Currently, the endpoint address is set as the new
endpoint_state RPC_ADDRESS.  This is wrong since
it should be assigned with the `broadcast_rpc_address`
rather than `broadcast_address`.
This was introduced in b82c77ed9c

Instead just create an empty endpoint_state.
The RPC_ADDRESS (as well as HOST_ID) application states
are set later.

Fixes scylladb/scylladb#15458

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#15475
2023-09-20 13:20:44 +02:00
Avi Kivity
47a1dc8d01 Update seastar submodule
* seastar 576ee47d...bab1625c (13):
  > build: s/{dpdk_libs}/${dpdk_libs}/
  > build: build with dpdk v23.07
  > scripts: Fix escaping of regexes in addr2line
  > linux-aio: print more specific error when setup_aio fails
  > linux-aio: correct the error message raised when io_setup() fails
  > build: reenable -Warray-bound compiling option
  > build: error out if find_program() fails
  > build: enable systemtap only if it is available
  > build: check if libucontext is necessary for using ucontext functions
  > smp: reference correct variable when fetch_or()
  > build: use target_compile_definitions() for adding -D...
  > http/client: pass tls_options to tls::connect()
  > Merge 'build, process: avoid using stdout or stderr as C++ identifiers' from Kefu Chai

Frozen toolchain regenerated for new Seastar depdendencies.

configure.py adjusted for new Seastar arch names.

Closes scylladb/scylladb#15476
2023-09-20 10:43:40 +02:00
Tomasz Grabiec
3d4398d1b2 Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).

If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).

When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.

We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.

Fixes: #7620
Fixes: #13957

Closes scylladb/scylladb#15331

* github.com:scylladb/scylladb:
  test: add test for group 0 schema versioning
  test/pylib: log_browsing: fix type hint
  feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
  schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
  migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
  schema_tables: use schema version from group 0 if present
  migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
  migration_manager: migration_request handler: assume `canonical_mutation` support
  system_keyspace: make `get/set_scylla_local_param` public
  feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
  schema_tables: refactor `scylla_tables(schema_features)`
  migration_manager: add `std::move` to avoid a copy
  schema_tables: remove default value for `reload` in `merge_schema`
  schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
  system_keyspace: fix outdated comment
2023-09-20 10:43:40 +02:00
Botond Dénes
45dfce6632 Merge 'compaction: change behaviour of compaction task executors' from Aleksandra Martyniuk
Compaction tasks executors serve two different purposes - as compaction
manager related entity they execute compaction operation and as task
manager related entity they track compaction status.

When one role depends on the other, as it currently is for
compaction_task_impl::done() and compaction_task_executor::compaction_done(),
requirements of both roles need to be satisfied at the same time in each
corner case. Such complexity leads to bugs.

To prevent it, compaction_task_impl::done() of executors no longer depends
on compaction_task_executor::compaction_done().

Fixes: #14912.

Closes scylladb/scylladb#15140

* github.com:scylladb/scylladb:
  compaction: warn about compaction_done()
  compaction: do not run stopped compaction
  compaction: modify lowest compaction tasks' run method
  compaction: pass do_throw_if_stopping to compaction_task_executor
2023-09-19 15:15:14 +03:00
Botond Dénes
844a0e426f Merge 'Mark counters with skip when empty' from Amnon Heiman
This series mark multiple high cardinality counters with skip_when_empty flag.
After this patch the following counters will not be reported if they were never used:
```
scylla_transport_cql_errors_total
scylla_storage_proxy_coordinator_reads_local_node
scylla_storage_proxy_coordinator_completed_reads_local_node
scylla_transport_cql_errors_total
```
Also marked, the CAS related CQL operations.
Fixes #12751

Closes scylladb/scylladb#13558

* github.com:scylladb/scylladb:
  service/storage_proxy.cc: mark counters with skip_when_empty
  cql3/query_processor.cc: mark cas related metrics with skip_when_empty
  transport/server.cc: mark metric counter with skip_when_empty
2023-09-19 15:02:39 +03:00
Benny Halevy
7ca91d719c compaction_state: store backlog_track in std::optional
So that replacing it will destroy the previous tracker
and unregister it before assigning the new one and
then registering it.

This is safer than assiging it in place.

With that, the move assignment operator is not longer
used and can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-19 13:59:54 +03:00
Benny Halevy
4ad4b632b8 compaction_backlog_tracker: do not allow moving registered trackers
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.

Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.

With that in mind, when move-assigning a compaction_backlog_tracker
the existing tracker can remain registered.

Fixes scylladb/scylladb#15248

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-19 13:24:36 +03:00
Benny Halevy
e784930dd7 storage_service: fix comment about when group0 is set
Since 8598cebb11
it is set earlier, before join_cluster.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-ID: <20230919063951.1424924-1-bhalevy@scylladb.com>
2023-09-19 13:20:58 +03:00
Kefu Chai
ba002de263 build: enable more warnings
these options for disabling warnings are not necessary anymore, for
one of the following reasons:

* the code which caused the warning were either fixed or removed
* the toolchain were updated, so the false alarms do not exist
  with the latest frozen toolchain.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15450
2023-09-19 13:02:34 +03:00
Kefu Chai
484d02da14 cql3: expr: do not use multi-line comment
do not use muti-line comment. this silences the warning from GCC:
```
In file included from ./cql3/prepare_context.hh:19,
                 from ./cql3/statements/raw/parsed_statement.hh:14,
                 from build/debug/gen/cql3/CqlParser.hpp:62,
                 from build/debug/gen/cql3/CqlParser.cpp:44:
./cql3/expr/expression.hh:490:1: error: multi-line comment [-Werror=comment]
  490 | /// Custom formatter for an expression.  Supports multiple modes:\
      | ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15471
2023-09-19 12:00:09 +03:00
Kefu Chai
4b53a70d76 build: cmake: add tests target
this target mirrors the target named `{mode}e-test` in the
`build.ninja` build script created by `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15448
2023-09-19 11:20:02 +03:00
Kefu Chai
da7de887d6 build: cmake: bump the minimum required CMake version
because we should have a frozeon toolchain built with fedora38, and f38
provides cmake v3.27.4, we can assume the availability of cmake v3.27.4
when building scylla with the toolchain.

in this change, the minimum required CMake version is changed to
3.27.

this also allows us to simplify the implementation of
`add_whole_archive()`, and remove the buggy branch for supporting
CMake < 3.24, as we should have used `${name}` in place of `auth` there.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15446
2023-09-19 10:57:57 +03:00
Botond Dénes
111cdce2e1 Merge 'db/hints: Modularize manager.hh' from Dawid Mędrek
This PR modularizes `manager.{hh, cc}` by dividing the files into separate smaller units. The changes improve overall readability of code and help reason about it. Each file has a specific purpose now.

This is the first step in refactoring the Hinted Handoff module.

Refs scylladb/scylla#15358

Closes scylladb/scylladb#15378

* github.com:scylladb/scylladb:
  db/hints: Remove unused aliases from manager.hh
  db/hints: Rename end_point_hints_manager
  db/hints: Rename sender to hint_sender
  db/hints: Move the rebalancing logic to hint_storage
  db/hints: Move the implementation of sender
  db/hints: Move the declaration of sender to hint_sender.hh
  db/hints: Move sender::replay_allowed() to the source file
  db/hints: Put end_point_hints_manager in internal namespace
  db/hints: Move the implementation of end_point_hints_manager
  db/hints: Move the declaration of end_point_hints_manager
  db/hints: Move definitions of functions using shard hint manager
  db/hints: Introduce hint_storage.hh
  db/hints: Extract the logger from manager.cc
  db/hints: Extract common types from manager.hh
2023-09-19 10:56:16 +03:00
Raphael S. Carvalho
6cc85068d7 compaction: Enable incremental compaction only if replacer callback is engaged
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-18 17:57:11 -03:00
Michael Huang
62a8a31be7 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428
2023-09-18 23:17:01 +03:00
Avi Kivity
ab6988c52f Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:

The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.

Additionally, a test is added that reproduces both of the mentioned
cases.

CVE-2023-33972

Fixes #15467.

* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
  auth: do not grant permissions to creator without actually creating
  transport: add is_schema_change() method to result_message
2023-09-18 21:47:28 +03:00
Avi Kivity
16a72a81fc Merge 'build: cmake: add "dist-server-debuginfo" target' from Kefu Chai
this target mirrors the "dist-server-debuginfo-{mode}" target in the `build.ninja` created by `configure.py`.

Closes scylladb/scylladb#15441

* github.com:scylladb/scylladb:
  build: cmake: add "dist-server-debuginfo" target
  build: cmake: remove debian dep from relocatable pkg
2023-09-18 20:54:21 +03:00
Avi Kivity
146e49d0dd Merge 'Rewrap keyspace population loop' from Pavel Emelyanov
Populating of non-system keyspaces is now done by listing datadirs and assuming that each subdir found is a keyspace. For S3-backed keyspaces this is also true, but it's a bug (#13020). The loop needs to walk the list of known keyspaces instead, and try to find the keyspace storage later, based on the storage option.

Closes scylladb/scylladb#15436

* github.com:scylladb/scylladb:
  distributed_loader: Indentation fix after previous patch
  distributed_loader: Generalize datadir parallelizm loop
  distributed_loader: Provide keyspace ref to populate_keyspace
  distributed_loader: Walk list of keyspaces instead of directories
2023-09-18 20:51:01 +03:00
Kefu Chai
cf5400bc75 cql.g: always initialize returned values
always initialize returned values. the branches which
return these unitiailized returned values handles the
unmatched cases, so this change should not have any
impact on the behavior.

ANTLR3's C++ code generator does not assign any value
to the value, if it runs into failure or encounter
exceptions. for instance, following rule assigns the
value of `isStatic` to `isStaticColumn` only if
nothing goes wrong.

```
cfisStatic returns [bool isStaticColumn]
    @init{
        bool isStatic = false;
    }
    : (K_STATIC { isStatic=true; })?
    {
        $isStaticColumn = isStatic;
    }
    ;
```

as shown in the generated C++ code:
```c++
                switch (alt118)
                {
            	case 1:
            	    // build/debug/gen/cql3/Cql.g:989:8: K_STATIC
            	    {
            	         this->matchToken(K_STATIC, &FOLLOW_K_STATIC_in_cfisStatic5870);
            	        if  (this->hasException())
            	        {
            	            goto rulecfisStaticEx;
            	        }
            	        if (this->hasFailed())
            	        {
            	            return isStaticColumn;
            	        }

            	        if ( this->get_backtracking()==0 )
            	        {
            	             isStaticColumn=isStatic;

            	        }

            	    }
            	    break;

                }
```

when `this->hasException()` or `this->hasFailed()`,
`isStaticColumn` is returned right away without being
initialized, because we don't assign any initial value
to it, neither do we customize the exception handling
for this rule.

and, the parser bails out when its smells something bad
after it tries to match the specified rule. also, the
parser is a stateful tokenizer, its failure state is
carried by the parser itself. also, the matchToken()
*could* fail when trying to find the matched token,
this is the runtime behavior of parser, that's why the
compiler cannot be certain that the error path won't
be taken.

anyway, let's always initialize the values without
default constructor. the return values whose type
is of scoped enum are initialized with zero
initialization, because their types don't provide an
"invalid" value.

this change should silence warnings like:

```
clang++ -MD -MT build/debug/gen/cql3/CqlParser.o -MF build/debug/gen/cql3/CqlParser.o.d -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1   -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen --std=gnu++20  -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere  -DBOOST_TEST_DYN_LINK   -DNOMINMAX -DNOMINMAX -fvisibility=hidden  -Wall -Werror -Wextra -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-c++11-narrowing -Wno-ignored-qualifiers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-implicit-int-float-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DFMT_DEPRECATED_OSTREAM -Wno-parentheses-equality -O1 -fno-sanitize-address-use-after-scope -c -o build/debug/gen/cql3/CqlParser.o build/debug/gen/cql3/CqlParser.cpp
build/debug/gen/cql3/CqlParser.cpp:26645:28: error: variable 'perm' is uninitialized when used here [-Werror,-Wuninitialized]
                    return perm;
                           ^~~~
build/debug/gen/cql3/CqlParser.cpp:26616:5: note: variable 'perm' is declared here
    auth::permission perm;
    ^
build/debug/gen/cql3/CqlParser.cpp:52577:28: error: variable 'op' is uninitialized when used here [-Werror,-Wuninitialized]
                    return op;
                           ^~
build/debug/gen/cql3/CqlParser.cpp:52518:5: note: variable 'op' is declared here
    oper_t op;
    ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15451
2023-09-18 16:45:50 +03:00
Kefu Chai
ece45c9f70 build: cmake: use find_program(.. REQUIRED) when appropriate
instead of checking the availability of a required program, let's
use the `REQUIRED` argument introduced by CMake 3.18, simpler this
way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15447
2023-09-18 16:35:46 +03:00
Kefu Chai
9de00c1c5a build: cmake: add node_ops
node_ops source files was extracted into /node_ops directory in
d0d0ad7aa4, so let's update the building system accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15442
2023-09-18 16:27:02 +03:00
Kefu Chai
4d285590f0 utils/config_file: document config_file::value_status
add doxygen style comment to document `value_status` members.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15277
2023-09-18 16:20:06 +03:00
Benny Halevy
8a56050507 main: handle abort_requested_exception on startup
Handle abort_requested_exception exactly like
sleep_aborted, as an expected error when startup
is aborted mid-way.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#15443
2023-09-18 15:05:52 +03:00
Botond Dénes
f7557a4891 Merge 'updating presto integration page documentation' from Guy Shtub
null

Closes scylladb/scylladb#15342

* github.com:scylladb/scylladb:
  Update integration-presto.rst
  Update integration-presto.rst
  Update docs/using-scylla/integrations/integration-presto.rst
  updating presto integration page
2023-09-18 14:41:16 +03:00
Botond Dénes
edb50c27ec Merge 'Use sstable_state in sstables populator' from Pavel Emelyanov
Some time ago populating of tables from sstables was reworked to use sstable states instead of full paths (#12707). Since then few places in the populator was left that still operate on the state-based subdirectory name. This PR collects most of those dangling ends

refs: #13020

Closes scylladb/scylladb#15421

* github.com:scylladb/scylladb:
  distributed_loader: Print sstable state explicitly
  distributed_loader: Move check for the missing dir upper
  distributed_loader: Use state as _sstable_directories key
2023-09-18 14:38:49 +03:00
Kefu Chai
054beb6377 tests: tablets: do not compare signed integer with unsigned integer
when compiling the tests with -Wsign-compare, the compiler complains like:
```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere  -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -MF test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o.d -o test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -c /home/kefu/dev/scylladb/test/boost/tablets_test.cc
/home/kefu/dev/scylladb/test/boost/tablets_test.cc:1335:53: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
            for (int log2_tablets = 0; log2_tablets < tablet_count_bits; ++log2_tablets) {
                                       ~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
```

in this case, it should be safe to use an signed int as the loop
variable to be compared with `tablet_count_bits`, but let's just
appease the compiler so we can enable the warning option project-wide
to prevent any potential issues caused by signed-unsigned comparision.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15449
2023-09-18 13:17:16 +02:00
Kamil Braun
bc6f7d1b20 Merge 'raft topology: add garbage collection for internal CDC generations table' from Patryk Jędrzejczak
We add garbage collection for the `CDC_GENERATIONS_V3` table to prevent
it from endlessly growing. This mechanism is especially needed because
we send the entire contents of `CDC_GENERATIONS_V3` as a part of the
group 0 snapshot.

The solution is to keep a clean-up candidate, which is one of the
already published CDC generations. The CDC generation publisher
introduced in #15281 continually uses this candidate to remove all
generations with timestamps not exceeding the candidate's and sets a new
candidate when needed.

We also add `test_cdc_generation_clearing.py` that verifies this new
mechanism.

Fixes #15323

Closes scylladb/scylladb#15413

* github.com:scylladb/scylladb:
  test: add test_cdc_generation_clearing
  raft topology: remove obsolete CDC generations
  raft topology: set CDC generation clean-up candidate
  topology_coordinator: refactor publish_oldest_cdc_generation
  system_keyspace: introduce decode_cdc_generation_id
  system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
2023-09-18 11:30:10 +02:00
Pavel Emelyanov
30959fc9b1 lsa, test: Extend memory footprint test with per-type total sizes
When memory footprint test is over it prints total size taken by row
cache, memtable and sstables as well as individual objects' sizes. It's
also nice to know the details on the row-cache's individual objects.
This patch extends the printing with total size of allocated object
types according to migrator_fn types.

Sample output:

    mutation footprint:
     - in cache:     11040928
     - in memtable:  9142424
     - in sstable:
       mc:   2160000
       md:   2160000
       me:   2160000
     - frozen:       540
     - canonical:    827
     - query result: 342

     sizeof(cache_entry) = 64
     sizeof(memtable_entry) = 64
     sizeof(bptree::node) = 288
     sizeof(bptree::data) = 72
     -- sizeof(decorated_key) = 32
     -- sizeof(mutation_partition) = 96
     -- -- sizeof(_static_row) = 8
     -- -- sizeof(_rows) = 24
     -- -- sizeof(_row_tombstones) = 40

     sizeof(rows_entry) = 144
     sizeof(evictable) = 24
     sizeof(deletable_row) = 72
     sizeof(row) = 16
     radix_tree::inner_node::node_sizes =  48 80 144 272 528 1040
     radix_tree::leaf_node::node_sizes =  120 216 416 816 3104
     sizeof(atomic_cell_or_collection) = 16
     btree::linear_node_size(1) = 24
     btree::inner_node_size = 216
     btree::leaf_node_size = 120
    LSA stats:
      N18compact_radix_tree4treeI13cell_and_hashjE9leaf_nodeE: 360
      N5bplus4dataIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 5040
      N5bplus4nodeIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 19296
      17partition_version: 952416
      N11intrusive_b4nodeI10rows_entryXadL_ZNS1_5_linkEEENS1_11tri_compareELm12ELm20ELNS_10key_searchE0ELNS_10with_debugE0EEE: 317472
      10rows_entry: 1429056
      12blob_storage: 254

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15434
2023-09-18 11:23:18 +02:00
Guy Shtub
5d833b2ee7 Update integration-presto.rst 2023-09-18 11:29:38 +03:00
Botond Dénes
bb7121a1fb Merge 'tools/scylla-nodetools: do not create unowned bpo::value ' from Kefu Chai
in other words, do not create bpo::value unless transfer it to an
option_description.

`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.

before this change, we initialize `operations_with_func` as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.

so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.

after this change, we create the option map as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options map as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.

this change is quite similar to 374bed8c3d

Fixes https://github.com/scylladb/scylladb/issues/15429

Closes scylladb/scylladb#15430

* github.com:scylladb/scylladb:
  tools/scylla-nodetools: reindent
  tools/scylla-nodetools: do not create unowned bpo::value
2023-09-18 11:09:46 +03:00
Kefu Chai
a51b14d4c4 sstables/metadata_collector: drop unused functions
column_stats::update_local_deletion_time() is not used anywhere,
what is being used is
`column_stats::update_local_deletion_time_and_tombstone_histogram(time_point)`.
while `update_local_deletion_time_and_tombstone_histogram(int32_t)`
is only used internally by a single caller.

neither is `column_stats::update(const deletion_time&)` used.

so let's drop them. and merge
`update_local_deletion_time_and_tombstone_histogram(int32_t)`
into its caller.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15189
2023-09-18 10:18:56 +03:00
Botond Dénes
b97778e4b2 Merge 'create-relocatable-package.py: do not assume "build" build directory' from Kefu Chai
in this series, we do not assume the existence of "build" build directory. and prefer using the version files located under the directory specified with the `--build-dir` option.

Refs #15241

Closes scylladb/scylladb#15402

* github.com:scylladb/scylladb:
  create-relocatable-package.py: prefer $build_dir/SCYLLA-RELEASE-FILE
  create-relocatable-package.py: create SCYLLA-RELOCATABLE-FILE with tempfile
2023-09-18 09:07:37 +03:00
Kefu Chai
a03dc92cb5 tools/scylla-nodetools: reindent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-18 13:57:37 +08:00
Kefu Chai
ed41c725f3 tools/scylla-nodetools: do not create unowned bpo::value
in other words, do not create bpo::value unless transfer it to an
option_description.

`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.

before this change, we initialize `operations_with_func` as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.

so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.

after this change, we create the option map as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options map as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.

this change is quite similar to 374bed8c3d

Fixes #15429
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-18 13:57:37 +08:00
Kefu Chai
b350596656 docs: correct the code sample for checking service status
```console
$ journalctl --user start scylla-server -xe
Failed to add match 'start': Invalid argument
```

`journalctl` expects a match filter as its positional arguments.
but apparently, start is not a filter. we could use `--unit`
to specify a unit though, like:

```console
$ journalctl --user --unit scylla-server.service -xe
```

but it would flood the stdout with the logging messages printed
by scylla. this is not what a typical user expects. probably a better
use experience can be achieved using

```console
$ systemctl --user status scylla-server
```
which also print the current status reported by the service, and
the command line arguments. they would be more informative in typical
use cases.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15390
2023-09-18 08:37:42 +03:00
Avi Kivity
67a0c865cf tools: toolchain: prepare: don't overwrite existing images
The docker/podman tooling is destructive: it will happily
overwrite images locally and on the server. If a maintainer
forgets to update tools/toolchain/image, this can result
in losing an older toolchain container image.

To prevent that, check that the image name is new.

Closes scylladb/scylladb#15397
2023-09-18 08:35:01 +03:00
Kefu Chai
a04fa0b41e conf: update commented out experimental_features
update commented out experimental_features to reflect the latest
experimental features:

- in 4f23eec4, "raft" was renamed to "consistent-topology-changes".
- in 2dedb5ea, "alternator-ttl" was moved out of experimental features.
- in 5b1421cc, "broadcast-tables" was added to experimental features.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15407
2023-09-18 08:31:01 +03:00
Guy Shtub
b8693636b8 Update integration-presto.rst
Removing link to forum, will be added as general footer
2023-09-18 06:50:11 +03:00
Guy Shtub
7d0691b348 Update docs/using-scylla/integrations/integration-presto.rst
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-09-18 06:46:02 +03:00
Kefu Chai
2a780553f8 build: cmake: add "dist-server-debuginfo" target
this target mirrors the "dist-server-debuginfo-{mode}" target in
the `build.ninja` created by `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-16 16:51:21 +08:00
Kefu Chai
38e697943f build: cmake: remove debian dep from relocatable pkg
`create-relocatable-package.py` does not use or include
`${CMAKE_CURRENT_BINARY_DIR}/debian`. so there is no
need to include this directory as a dependency.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-16 16:51:21 +08:00
Kamil Braun
5add0e1734 test: add test for group 0 schema versioning
Perform schema changes while mixing nodes in RECOVERY mode with nodes in
group 0 mode:
- schema changes originating from RECOVERY node use
  digest-based schema versioning.
- schema changes originating from group 0
  nodes use persisted versions committed through group 0.

Verify that schema versions are in sync after each schema change, and
that each schema change results in a different version.

Also add a simple upgrade test, performing a schema change before we
enable Raft (which also enables the new versioning feature) in the
entire cluster, then once upgrade is finished.

One important upgrade test is missing, which we should add to dtest:
create a cluster in Raft mode but in a Scylla version that doesn't
understand GROUP0_SCHEMA_VERSIONING. Then start upgrading to a version
that has this patchset. Perform schema changes while the cluster is
mixed, both on non-upgraded and on upgraded nodes. Such test is
especially important because we're adding a new column to the
`system.scylla_local` table (which we then redact from the schema
definition when we see that the feature is disabled).
2023-09-15 18:36:11 +02:00
Avi Kivity
4eb4ac4634 scripts: pull_gitgub_pr.sh: absolutize project reference
pull_gitgub_pr.sh adds a "Closes #xyz" tag so github can close
the pull request after next promotion. Convert it to an absolute
refefence (scylladb/scylladb#xyz) so the commit can be cherry-picked
into another repository without the reference dangling.

Closes #15424
2023-09-15 19:29:50 +03:00
Kefu Chai
1e6b2eb4c8 tools/scylla-nodetool: mark format string as constexpr
this change change `const` to `constexpr`. because the string literal
defined here is not only immutable, but also initialized at
compile-time, and can be used by constexpr expressions and functions.

this change is introduced to reduce the size of the change when moving
to compile-time format string in future. so far, seastar::format() does
not use the compile-time format string, but we have patches pending on
review implementing this. and the author of this change has local
branches implementing the changes on scylla side to support compile-time
format string, which practically replaces most of the `format()` calls
with `seastar::format()`.

without this change, if we use compile-time format check, compiler fails
like:

```
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: error: call to consteval function 'fmt::basic_format_string<char, const char *const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char *, 0>' is not a constant expression
            .description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
                                           ^
/usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression
  FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
                                                                  ^
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: note: in call to 'basic_format_string(description_template)'
            .description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
                                           ^
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:258:16: note: declared here
    const auto description_template =
               ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15432
2023-09-15 19:28:38 +03:00
Kefu Chai
6c75dc4be8 tools/scylla-nodetool: do not compare unsigned with int
change the loop variable to `int` to silence warning like

```
/home/kefu/.local/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -I/home/kefu/dev/scylladb/build/cmake/gen -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere  -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -MF tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o.d -o tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -c /home/kefu/dev/scylladb/tools/scylla-nodetool.cc
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:215:28: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare]
    for (unsigned i = 0; i < argc; ++i) {
                         ~ ^ ~~~~
```

`i` is used as the index in a plain C-style array, it's perfectly fine
to use a signed integer as index in this case. as per C++ standard,

> The expression E1[E2] is identical (by definition) to *((E1)+(E2))

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15431
2023-09-15 19:28:14 +03:00
Kamil Braun
52903ef456 test/pylib: log_browsing: fix type hint 2023-09-15 17:58:54 +02:00
Kamil Braun
c2beee348a feature_service: enable GROUP0_SCHEMA_VERSIONING in Raft mode
As promised in earlier commits:
Fixes: #7620
Fixes: #13957

Also modify two test cases in `schema_change_test` which depend on
the digest calculation method in their checks. Details are explained in
the comments.
2023-09-15 17:54:36 +02:00
Pavel Emelyanov
e61f4e0abb distributed_loader: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-15 17:51:03 +03:00
Pavel Emelyanov
bb4ddbb996 distributed_loader: Generalize datadir parallelizm loop
Population of keyspaces happens first fo system keyspaces, then for
non-system ones. Both methods iterate over config datadirs to populate
from all configured directories. This patch generalizes this loop into
the populate_keyspace() method.

(indentation is deliberately left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-15 17:49:53 +03:00
Pavel Emelyanov
0430ebf851 distributed_loader: Provide keyspace ref to populate_keyspace
The method in question tries to find keyspace reference on the database
by the given keyspace name. However, one of the callers aready has the
keyspace reference at hands and can just pass it. The other calls can
find the keyspace on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-15 17:49:03 +03:00
Pavel Emelyanov
e1262e46eb distributed_loader: Walk list of keyspaces instead of directories
When populating non-system keyspaces the dist. loader lists the
directories with keyspaces in datadirs, then tries to call
populate_keyspace() with the found name. If the keyspace in question is
not found on the database, a warning is printed and population
continues.

S3-backed keyspaces are nowadays populated with this process just
because there's a bug #13020 -- even such keyspaces still create empty
directories in datadirs. When the bug gets fixed, population would omit
such keyspaces. This patch prepares this by making population walk the
known keyspaces from the database. BTW, population of system keyspaces
already works by iterating over the list of known keyspaces, not the
datadir subdirectories.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-15 17:44:46 +03:00
Kefu Chai
30ef69fcb2 docs/dev/object_store: add more samples
in hope to lower the bar to testing object store.

* add language specifier for better readability of the document.
  to highlight the config with YAML syntax
* add more specific comment on the AWS related settings
* explain that endpoint should match in the CREATE KEYSPACE
  statement and the one defined by the YAML configuration.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15433
2023-09-15 17:35:17 +03:00
Kamil Braun
947c419421 schema_tables: don't delete version cell from scylla_tables mutations from group 0
As explained in the previous commit, we use the new
`committed_by_group0` flag attached to each row of a `scylla_tables`
mutation to decide whether the `version` cell needs to be deleted or
not.

The rest of #13957 is solved by pre-existing code -- if the `version`
column is present in the mutation, we don't calculate a hash for
`schema::version()`, but take the value from the column:

```
table_schema_version schema_mutations::digest(db::schema_features sf)
const {
    if (_scylla_tables) {
        auto rs = query::result_set(*_scylla_tables);
        if (!rs.empty()) {
            auto&& row = rs.row(0);
            auto val = row.get<utils::UUID>("version");
            if (val) {
                return table_schema_version(*val);
            }
        }
    }

    ...
```

The issue will therefore be fixed once we enable
`GROUP0_SCHEMA_VERSIONING`.
2023-09-15 14:32:52 +02:00
Kamil Braun
ce68ee0950 migration_manager: add committed_by_group0 flag to system.scylla_tables mutations
As described in #13957, when creating or altering a table in group 0
mode, we don't want each node to calculate `schema::version()`s
independently using a hash algorithm. Instead, we want to all nodes to
use a single version for that table, commited by the group 0 command.

There's even a column ready for this in `system.scylla_tables` --
`version`. This column is currently being set for system tables, but
it's not being used for user tables.

Similarly to what we did with global schema version in earlier commits,
the obvious thing to do would be to include a live cell for the `version`
column in the `system.scylla_tables` mutation when we perform the schema
change in Raft mode, and to include a tombstone when performing it
outside of Raft mode, for the RECOVERY case.

But it's not that simple because as it turns out, we're *already*
sending a `version` live cell (and also a tombstone, with timestamp
decremented by 1) in all `system.scylla_tables` mutations. But then we
delete that cell when doing schema merge (which begs the question
why were we sending it in the first place? but I digress):
```
        // We must force recalculation of schema version after the merge, since the resulting
        // schema may be a mix of the old and new schemas.
        delete_schema_version(mutation);
```
the above function removes the `version` cell from the mutation.

So we need another way of distinguishing the cases of schema change
originating from group 0 vs outside group 0 (e.g. RECOVERY).

The method I chose is to extend `system.scylla_tables` with a boolean
column, `committed_by_group0`, and extend schema mutations to set
this column.

In the next commit we'll decide whether or not the `version` cell should
be deleted based on the value of this new column.
2023-09-15 14:32:52 +02:00
Kamil Braun
59912ca3b0 schema_tables: use schema version from group 0 if present
As promised in the previous commit, if we persisted a schema version
through a group 0 command, use it after a schema merge instead of
calculating a digest.

Ref: #7620

The above issue will be fixed once we enable the
`GROUP0_SCHEMA_VERSIONING` feature.
2023-09-15 14:32:52 +02:00
Kamil Braun
7ab7588d59 migration_manager: store group0_schema_version in scylla_local during schema changes
We extend schema mutations with an additional mutation to the
`system.scylla_local` table which:
- in Raft mode, stores a UUID under the `group0_schema_version` key.
- outside Raft mode, stores a tombstone under that key.

As we will see in later commits, nodes will use this after applying
schema mutations. If the key is absent or has a tombstone, they'll
calculate the global schema digest on their own -- using the old way. If
the key is present, they'll take the schema version from there.

The Raft-mode schema version is equal to the group 0 state ID of this
schema command.

The tombstone is necessary for the case of performing a schema change in
RECOVERY mode. It will force a revert to the old digest-based way.

Note that extending schema mutations with a `system.scylla_local`
mutation is possible thanks to earlier commits which moved
`system.scylla_local` to schema commitlog, so all mutations in the
schema mutations vector still go to the same commitlog domain.
2023-09-15 14:32:45 +02:00
Kamil Braun
06c141f585 migration_manager: migration_request handler: assume canonical_mutation support
Support for `canonical_mutation`s was added way back in Scylla 3.2. The
migration request handler was checking whether the remote supports
`canonical_mutation`s to handle rolling upgrades, and if not, it would
use `frozen_mutation`s instead.

We no longer need that second branch, since we don't support skipping
versions during upgrades (certainly everything would burn if we tried a
3.2->5.4 upgrade).

Leave a sanity check but otherwise delete the other branch.
2023-09-15 14:29:45 +02:00
Pavel Emelyanov
cce2752b64 Merge 'node_ops: move node_ops related classes to node_ops/' from Aleksandra Martyniuk
Move node_ops related classes to node_ops/ so that they
are consistently grouped and could be access from
many modules.

Closes #15351

* github.com:scylladb/scylladb:
  node_ops: extract classes related to node operations
  node_ops: repair: move node_ops_id to node_ops directory
2023-09-15 15:12:00 +03:00
Kamil Braun
3ab244e6d9 system_keyspace: make get/set_scylla_local_param public
We'll use it outside `system_keyspace` code in later commit.
2023-09-15 13:04:04 +02:00
Kamil Braun
72cd457d53 feature_service: add GROUP0_SCHEMA_VERSIONING feature
This feature, when enabled, will modify how schema versions
are calculated and stored.

- In group 0 mode, schema versions are persisted by the group 0 command
  that performs the schema change, then reused by each node instead of
  being calculated as a digest (hash) by each node independently.
- In RECOVERY mode or before Raft upgrade procedure finishes, when we
  perform a schema change, we revert to the old digest-based way, taking
  into account the possibility of having performed group0-mode schema
  changes (that used persistent versions). As we will see in future
  commits, this will be done by storing additional flags and tombstones
  in system tables.

By "schema versions" we mean both the UUIDs returned from
`schema::version()` and the "global" schema version (the one we gossip
as `application_state::SCHEMA`).

For now, in this commit, the feature is always disabled. Once all
necessary code is setup in following commits, we will enable it together
with Raft.
2023-09-15 13:04:04 +02:00
Kamil Braun
dc4e20d835 schema_tables: refactor scylla_tables(schema_features)
The `scylla_tables` function gives a different schema definition
for the `system_schema.scylla_tables` table, depending on whether
certain schema features are enabled or not.

The way it was implemented, we had to write `θ(2^n)` amount
of code and comments to handle `n` features.

Refactor it so that the amount of code we have to write to handle `n`
features is `θ(n)`.
2023-09-15 13:04:04 +02:00
Kamil Braun
2d561eecbc migration_manager: add std::move to avoid a copy 2023-09-15 13:04:04 +02:00
Kamil Braun
4376854473 schema_tables: remove default value for reload in merge_schema
To avoid bugs like the one fixed in the previous commit.
2023-09-15 13:04:04 +02:00
Kamil Braun
48164e1d09 schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.
2023-09-15 13:04:04 +02:00
Kamil Braun
9017b998ca system_keyspace: fix outdated comment 2023-09-15 13:04:04 +02:00
Anna Stuchlik
fb635dccaa doc: add info - support for FIPS-compliant systems
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.

This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.

Closes #15415
2023-09-15 11:08:34 +02:00
Patryk Jędrzejczak
840e1c5185 test: add test_cdc_generation_clearing
We add a test for the new CDC generation garbage collection
mechanism.
2023-09-15 09:28:32 +02:00
Patryk Jędrzejczak
0cc54e0da7 raft topology: remove obsolete CDC generations
We make the CDC generation publisher continually remove the
obsolete CDC generation data to prevent CDC_GENERATIONS_V3 from
endlessly growing. To achieve this, we use the clean-up candidate.
If it exists and can be safely removed, we remove it together with
all older CDC generations. We also mark the lack of a new
candidate. The next published CDC generation will become one.

Note this solution does not have any guarantee about "when"
it removes obsolete generations. Formally, it guarantees that
if there is a candidate that can be removed and the CDC generation
publisher attempts to remove it, all generations up to the
candidate are removed. In practice, when a new generation appears,
the publisher makes a new candidate or tries to remove an old
candidate, so obsolete generations can stay for a long time only
if no generation appears for a long time. But it is fine because
we only want to prevent CDC_GENERATIONS_V3 from growing too much.
Moreover, providing any guarantees would require a new wake-up
mechanism for the publisher, which would be hard to implement.
2023-09-15 09:26:58 +02:00
Patryk Jędrzejczak
e375e769b9 raft topology: set CDC generation clean-up candidate
We want to use the clean-up candidates to remove the obsolete CDC
generation data, but first, we need to set suitable generations as
a candidate when there is no candidate. Since CDC generations must
be published before we remove them, a generation that is being
published is a good candidate.
2023-09-15 09:23:59 +02:00
Patryk Jędrzejczak
b84e097c28 topology_coordinator: refactor publish_oldest_cdc_generation
In the following commits, we add a new task for the CDC generation
publisher -- clearing obsolete CDC generation data. This task
can be done together with the publishing under one group 0 guard.
We refactor publish_oldest_cdc_generation to make it possible.
Now, this function is more like a command builder. It takes guard
by const reference and updates the vector of mutations and the
reason string. The CDC generation publisher uses them directly to
update the topology at the end after finishing building the
command. This logic will be more visible after adding the clearing
task.
2023-09-15 09:04:23 +02:00
Dawid Medrek
fbbb9f879a db/hints: Remove unused aliases from manager.hh 2023-09-15 04:17:08 +02:00
Dawid Medrek
d46437a87b db/hints: Rename end_point_hints_manager
This commit renames `end_point_hints_manager` to `hint_endpoint_manager`
to be consistent with other names used in the module (they all start
with `hint_`).
2023-09-15 03:46:15 +02:00
Dawid Medrek
6d1eee448b db/hints: Rename sender to hint_sender
We rename the structure to highlight what exactly its purpose is.
2023-09-15 03:46:15 +02:00
Dawid Medrek
4ad0f8907c db/hints: Move the rebalancing logic to hint_storage
This commit continues modularizing manager.hh.
2023-09-15 03:46:15 +02:00
Dawid Medrek
999484466d db/hints: Move the implementation of sender
This commit continues modularizing manager.hh.
After moving the declaration of sender to a dedicated
header file, these changes move its implementation to
a separate source file.
2023-09-15 03:46:15 +02:00
Dawid Medrek
17aabf6b9a db/hints: Move the declaration of sender to hint_sender.hh
This commit is yet another step in modularizing manager.hh.
We move the declaration of sender to a dedicated file.
Its implementation will follow in a future commit.
2023-09-15 03:46:15 +02:00
Dawid Medrek
1a7262ed6e db/hints: Move sender::replay_allowed() to the source file
The premise of these changes is the fact that we cannot have
a cycle of #includes.

Because the declaration of `sender` is going to be moved to
a separate header file in a future commit, and because that
header file is going to be included in the file where
`end_point_hints_manager` is declared, we will need to rely
on `end_point_hints_manager` being an incomplete type there.

A consequence of that is that we cannot access any of
`end_point_hints_manager`'s methods.

This commit prepares the ground for it by moving
the definition of the function to the source file where
`end_point_hints_manager` will be a complete type.
2023-09-15 03:46:15 +02:00
Dawid Medrek
ad2a36bd45 db/hints: Put end_point_hints_manager in internal namespace 2023-09-15 03:46:15 +02:00
Dawid Medrek
507054012d db/hints: Move the implementation of end_point_hints_manager
This commit continues moving end_point_hints_manager to its
dedicated files. After moving the declaration of the class,
these changes move the implementation.
2023-09-15 03:46:15 +02:00
Dawid Medrek
f72c423984 db/hints: Move the declaration of end_point_hints_manager
This commit is yet another step in modularizing manager.hh.
We move the declaration of the class to a dedicated header file.
The implementation will follow in a future commit.
2023-09-15 03:46:15 +02:00
Dawid Medrek
854cc0c939 db/hints: Move definitions of functions using shard hint manager
We move definitions of inline methods of end_point_hints_manager
and sender accessing shard hint manager to the source file,
effectively un-inlining them. We need to do that to prepare for
moving said structures out of manager.hh. This commit is yet
another step in modularizing manager.hh.
2023-09-15 03:45:57 +02:00
Dawid Medrek
db08a85f5d db/hints: Introduce hint_storage.hh
This commit moves types used by shard hint manager
and related to storing hints on disk to another file.
It is yet another step in modularizing manager.hh.
2023-09-15 02:28:10 +02:00
Dawid Medrek
4814b3b19a db/hints: Extract the logger from manager.cc
This commit extracts the logger used in manager.cc
to prepare the ground for modularization of manager.hh
into separate smaller files. We want to preserve
the logging behavior (at least for the time being),
which means new files should use the same logger.
These changes serve that purpose.
2023-09-15 02:24:20 +02:00
Dawid Medrek
efd6d1f57a db/hints: Extract common types from manager.hh
Currently, data structures used in manager.hh
use their own aliases for gms::inet_address.
It is clear they all should use the same type
and having different names for it only reduces
readability of the code. This commit introduces
a common alias -- endpoint_id -- and gets rid
of the other ones.

This commit is also the first step in modularizing
manager.hh by extracting common types to another
file.
2023-09-15 02:23:30 +02:00
Botond Dénes
b87660f90c tools/scylla-sstable: log where schema was obtained from
Currently, we only log anything about what was tried w.r.t. obtaining
the schema if it failed. Add a log message to the success path too, so
in case the wrong schema was successfully loaded, the user can find the
problem.
The log message is printed with debug-level, so it doesn't distrurb
output by default.

Fixes: #15384

Closes #15417
2023-09-14 23:09:30 +03:00
Botond Dénes
0f8b297d07 Merge 'build: cmake: add targets for building deb and rpm packages' from Kefu Chai
in this series,

- the build of unstripped package is fixed, and
- the targets for building deb and rpm packages are added. these targets builds deb and rpm packages from the unstripped package.

Closes #15403

* github.com:scylladb/scylladb:
  build: cmake: add targets for building deb and rpm packages
  build: cmake: correct the paths used when building unstripped pkg
2023-09-14 18:22:30 +03:00
Kefu Chai
60db7f8ae3 doc: do not suggest "-node xxx" when running c-s
cassandra-stress connects to "localhost" by default. that's exactly the
use case when we install scylla using the unified installer. so do not
suggest "-node xxx" option. the "xxx" part is but confusing.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15411
2023-09-14 18:21:46 +03:00
Petr Gusev
6c3cc7d6e0 test_fence_hints: increase timeouts
We saw failures on CI in debug mode, probably the machine
running the test is shared, and we starved for some resources.

Fix #15285

Closes #15388
2023-09-14 16:22:50 +02:00
Avi Kivity
d9a453e72e Merge 'Introduce a scylla-native nodetool' from Botond Dénes
This series introduces a scylla-native nodetool.  It is invokable via the main scylla executable as the other native tools we have. It uses the seastar's new `http::client` to connect to the specified node and execute the desired commands.
For now a single command is implemented: `nodetool compact`, invokable as `scylla nodetool compact`. Once all the boilerplate is added to create a new tool, implementing a single command is not too bad, in terms of code-bloat. Certainly not as clean as a python implementation would be, but good enough. The advantages of a C++ implementation is that all of us in the core team know C++ and that it is shipped right as part of the scylla executable..

Closes #14841

* github.com:scylladb/scylladb:
  test: add nodetool tests
  test.py: add ToolTestSuite and ToolTest
  tools/scylla-nodetool: implement compact operation
  tools/scylla-nodetool: implement basic scylla_rest_api_client
  tools: introduce scylla-nodetool
  utils: export dns_connection_factory from s3/client.cc to http.hh
  utils/s3/client: pass logger to dns_connection_factory in constructor
  tools/utils: tool_app_template::run_async(): also detect --help* as --help
2023-09-14 17:20:40 +03:00
Avi Kivity
a3d73bfba7 Merge 'Add support for decommission with tablets' from Tomasz Grabiec
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.

Closes #15197

* github.com:scylladb/scylladb:
  tablets, raft topology: Add support for decommission with tablets
  tablet_allocator: Compute load sketch lazily
  tablet_allocator: Set node id correctly
  tablet_allocator: Make migration_plan a class
  tablets: Implement cleanup step
  storage_service, tablets: Prevent stale RPCs from running beyond their stage
  locator: Introduce tablet_metadata_guard
  locator, replica: Add a way to wait for table's effective_replication_map change
  storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
  raft topology: Add break in the final case clause
  raft topology: Fix SIGSEGV when trace-level logging is enabled
  raft topology: Set node state in topology
  raft topology: Always set host id in topology
2023-09-14 17:16:23 +03:00
Kamil Braun
0564d000c6 Merge 'Validate compaction strategy options' from Aleksandra Martyniuk
When a column family's schema is changed new compaction
strategy type may be applied.

To make sure that it will behave as expected, compaction
strategy need to contain only the allowed options and values.
Methods throwing exception on invalid options are added.

Fixes: #2336.

Closes #13956

* github.com:scylladb/scylladb:
  test: add test for compaction strategy validation
  compaction: unify exception messages
  compaction: cql3: validate options in check_restricted_table_properties
  compaction: validate options used in different compaction strategies
  compaction: validate common compaction strategy options
  compaction: split compaction_strategy_impl constructor
  compaction: validate size_tiered_compaction_strategy specific options
  compaction: validate time_window_compaction_strategy specific options
  compaction: add method to validate min and max threshold
  compaction: split size_tiered_compaction_strategy_options constructor
  compaction: make compaction strategy keys static constexpr
  compaction: use helpers in validate_* functions
  compaction: split time_window_compaction_strategy_options construtor
  compaction: add validate method to compaction_strategy_options
  time_window_compaction_strategy_options: make copy and move-able
  size_tiered_compaction_strategy_options: make copy and move-able
2023-09-14 16:11:52 +02:00
Pavel Emelyanov
4370e6c8d0 distributed_loader: Print sstable state explicitly
When populating from a particular directory, populator code converts
state to subdir name, then prints the path. The conversion is pretty
much artificial, it's better to provide printer for state and print
state explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-14 16:41:26 +03:00
Pavel Emelyanov
b19e6a68f8 distributed_loader: Move check for the missing dir upper
The quarantine directory can be missing on the datadir and that's OK. In
order to check that and skip population the populator code uses two-step
logic -- first it checks if the directory exists and either puts or not
the sstable_directory object into the map. Later it checks the map and
decide whether to throw or not if the directory is missing.

Let's keep both check and throw in one place for brevity.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-14 16:39:56 +03:00
Pavel Emelyanov
74eef029e2 distributed_loader: Use state as _sstable_directories key
The populator maintains a map of path -> sstable_directory pairs one for
each subdirectory for every sstable state. The "path" is in fact not
used by the logic as it's just a subdirectory name for the state and the
rest of the core operates on state. So it's good to make the map of
directories also be indexed by the state.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-14 16:38:03 +03:00
Benny Halevy
a5a22fe5b7 tools/scylla-sstable: load_sstables: handle load errors
Currently, exceptions thrown from `sst->load` are unhandled,
resulting in, e.g.:
```
ERROR 2023-09-12 08:02:58,124 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (SSTable /home/bhalevy/.dtest/dtest-dxg4xdxg/test/node1/data/ks/cf-a3009f20512911ee8000d81cd2da3fd7/me-3g9b_0e0x_39vtt1y2rcqrffz55j-big-Data.db uses org.apache.cassandra.dht.Murmur3Partitioner partitioner which is different than com.scylladb.dht.CDCPartitioner partitioner used by the database)
```

Log the errors and exit the tool with non-zero status
in this case.

Fixes #15359

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15376
2023-09-14 14:27:38 +03:00
Tomasz Grabiec
551cc0233d tablets, raft topology: Add support for decommission with tablets
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.

Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.

If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
2023-09-14 13:05:49 +02:00
Tomasz Grabiec
8565af4dd3 tablet_allocator: Compute load sketch lazily
This allows any node to act as a target later.
2023-09-14 13:04:49 +02:00
Tomasz Grabiec
1c595ab7f4 tablet_allocator: Set node id correctly
It was unset and unused.
2023-09-14 13:04:49 +02:00
Tomasz Grabiec
389573543e tablet_allocator: Make migration_plan a class
It will be extended with more fields so that load balancer can
communicate more information to the coordinator.
2023-09-14 13:04:47 +02:00
Tomasz Grabiec
d5539e080d tablets: Implement cleanup step
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.

The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.

This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
5cf035878d storage_service, tablets: Prevent stale RPCs from running beyond their stage
Example scenario:

  1. coordinator A sends RPC #1 to trigger streaming
  2. coordinator fails over to B
  3. coordinator B performs streaming successfully
  4. RPC #1 arrives and starts streaming
  5. coordinator B commits the transition to the post-streaming stage
  6. coordinator B executes global token metadata barrier

We end up with streaming running despite the fact that the current
coordinator moved on. Currently, this won't happen, because streaming
holds on to erm. But we want to change that (see #14995), so that it
does not block barriers for migrations of other tablets. The same
problem applies to tablet cleanup.

The fix is to use tablet_metadata_guard around such long running
operations, which will keep hold to erm so that in the above scenario
coordinator B will wait for it in step 6. The guard ensures that erm
doesn't block other migrations because it switches to the latest erm
if it's compatible. If it's not, it signals abort_source for the guard
so that such stale operation aborts soon and the barrier in step 6
doesn't wait for long.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
6a62aca3a9 locator: Introduce tablet_metadata_guard
Will be used to synchronize long-running tablet operations with
topology coordinator.

It blocks barriers like erm_ptr, but refreshes if change is
irrelevant, so behaves as if the erm_ptr's scope was narrowed down to
a single tablet.
2023-09-14 12:45:10 +02:00
Patryk Jędrzejczak
c0fd42ead4 system_keyspace: introduce decode_cdc_generation_id
The decode_cdc_generations_ids function allows us to decode
a vector of CDC generation IDs. After adding cleanup_candidate
to CDC_GENERATIONS_V3, we need a similar function that decodes
a single ID.
2023-09-14 12:09:14 +02:00
Patryk Jędrzejczak
6db325fb69 system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
In the following commits, we implement a garbage collection for
CDC_GENERATIONS_V3. The first step is introducing the clean-up
candidate. It will be continually updated by the CDC generation
publisher and used to remove obsolete data.
2023-09-14 12:09:10 +02:00
Tomasz Grabiec
532ec84210 locator, replica: Add a way to wait for table's effective_replication_map change 2023-09-14 12:08:54 +02:00
Tomasz Grabiec
2c6785dc8f storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
It will be shared with cleanup_tablet().

Minor changes:
  - ditch the redundant optional<> around shared_future<>
2023-09-14 12:08:52 +02:00
Tomasz Grabiec
e2c1f904c8 raft topology: Add break in the final case clause
To be safe in case we add more cases.
2023-09-14 12:07:59 +02:00
Tomasz Grabiec
97f3f496bd raft topology: Fix SIGSEGV when trace-level logging is enabled
rs.ring may be disengaged.
2023-09-14 12:07:59 +02:00
Tomasz Grabiec
a4c91a5ee7 raft topology: Set node state in topology
Will be examined by the load balancer.
2023-09-14 12:07:59 +02:00
Tomasz Grabiec
56e1a72c8f raft topology: Always set host id in topology
Before, it was updated only for normal nodes. We need it for
bootstrapping nodes too.  Otherwise, algorithms, e.g. load balancer,
will be confused by observing nodes in topology without host id set.

This will become a problem when load balancer is invoked concurrently
with bootstrap, which currently is not the case, but will be after
later patches.

We should maintain that all nodes in topology have a host id.
2023-09-14 12:07:59 +02:00
Botond Dénes
3e2d8ca94d test: add nodetool tests
Testing the new scylla nodetool tool.
The tests can be run aginst both implementations of nodetool: the
scylla-native one and the cassandra one. They all pass with both
implementations.
2023-09-14 05:25:14 -04:00
Botond Dénes
56f7b2f45d test.py: add ToolTestSuite and ToolTest
A test suite for python pytests, testing tools, and hence not needing a
scylla cluster setup for them.
2023-09-14 05:25:14 -04:00
Botond Dénes
60dc2e9303 tools/scylla-nodetool: implement compact operation
Equivalent of nodetool compact.
The following arguments are accepted:
* split-output,s (unused)
* user-defined (error is raised)
* start-token,st (unused)
* end-token,et (unused)
* partition (unused)

The partition argument is mentioned only in our doc, our nodetool
doesn't recognize it. I added it nevertheless (it is ignored).
Split-output doesn't work with our current nodetool, attempting to use
it will result in an error. The option is parsed but an error is used if
used.
2023-09-14 05:25:14 -04:00
Botond Dénes
d67e22b876 tools/scylla-nodetool: implement basic scylla_rest_api_client
Add --host and --port parameters, parse and resolve these and
establish a connection to the provided host.
Add a simple get() and post() method, parsing the returned data as json.

Add the following compatibility arguments:
* password,pw
* password-file,pwf
* username,u
* print-port,pp

These are parsed and silently ignored, as they are specific to JMX and
aren't needed when connecting to the REST API.
Since boost program options doesn't support multi-char short-form
switches, as well as the -short=value syntax, the argv has to be massaged
into a form which boost program options can digest. This is achieved by
substituting all incompatible option formats and syntax with the
equivalent boost program options compatible one.
This mechanism is also used to make sure -h is translated to --host, not
--help. The help message is unfortunately still ambiguous, displaying
both with -h. This will be addressed in a follow-up.
2023-09-14 05:25:14 -04:00
Botond Dénes
eb1beca1b6 tools: introduce scylla-nodetool
This patch only introuces the bare skeleton of the tool, plus the wiring
into main.
No operations are added yet, they will be added in later patches.
2023-09-14 05:25:14 -04:00
Botond Dénes
bf2fad3c00 utils: export dns_connection_factory from s3/client.cc to http.hh
So others can use it too. Move headers only used by said class too.
2023-09-14 05:25:14 -04:00
Botond Dénes
17fd57390e utils/s3/client: pass logger to dns_connection_factory in constructor
We want to publish this class in a header so it can be used by others,
but it uses the s3 logger. We don't want future users to pollute the s3
logs, so allow users to pass their own loggers to the factory.
2023-09-14 05:25:14 -04:00
Botond Dénes
4dd373b8d3 tools/utils: tool_app_template::run_async(): also detect --help* as --help
Don't try to lookup the current operation if the first argument is
--help*. This allows --help-seastar and --help-loggers to work.
2023-09-14 05:25:14 -04:00
Kamil Braun
47b18ae908 migration_manager: log when performing read barrier in get_schema_for_write
Will be useful for debugging problems with timing out queries if they
are caused by slow schema sync read barriers.

Ref: #15357

Closes #15396
2023-09-14 11:44:24 +03:00
Kamil Braun
bff9cedef9 Merge 'system_keyspace: remove flushes when writing to system tables' from Petr Gusev
There are several system tables with strict durability requirements.
This means that if we have written to such a table, we want to be sure
that the write won't be lost in case of node failure. We currently
accomplish this by accompanying each write to these tables with
`db.flush()` on all shards. This is expensive, since it causes all the
memtables to be written to sstables, which causes a lot of disk writes.
This overheads can become painful during node startup, when we write the
current boot state to `system.local`/`system.scylla_local` or during
topology change, when `update_peer_info`/`update_tokens` write to
`system.peers`.

In this series we remove flushes on writes to the `system.local`,
`system.peers`, `system.scylla_local` and `system.cdc_local` tables and
start using schema commitlog for durability.

Fixes: #15133

Closes #15279

* github.com:scylladb/scylladb:
  system_keyspace: switch CDC_LOCAL to schema commitlog
  system_keyspace: scylla_local: use schema commitlog
  database.cc: make _uses_schema_commitlog optional
  system_keyspace: drop load phases
  database.hh: add_column_family: add readonly parameter
  schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards
  system_keyspace: switch system.peers to schema commitlog
  system_keyspace: switch system.local to schema commitlog
  main.cc: move schema commitlog replay earlier
  sstables_format_selector: extract listener
  sstables_format_selector: wrap when_enabled with seastar::async
  main.cc: inline and split system_keyspace.setup
  system_keyspace: refactor save_system_schema function
  system_keyspace: move initialize_virtual_tables into virtual_tables.hh
  system_keyspace: remove unused parameter
  config.cc: drop db::config::host_id
  main.cc:: extract local_info initialization into function
  schema.cc: check static_props for sanity
  system_keyspace: set null sharder when configuring schema commitlog
  system_keyspace: rename static variables
  system_keyspace: remove redundant wait_for_sync_to_commitlog
2023-09-14 10:39:20 +02:00
Kefu Chai
25457fca38 Update tools/cqlsh submodule
* tools/cqlsh 66ae7eac...e651e12e (6):
  > setup.py: specify Cython language_level explicitly
  > setup.py: pass extensions as a list
  > setup.py: reindent block in else branch
  > setup.py: early return in get_extension()
  > reloc: install build==0.10.0
  > reloc: add --verbose option to build_reloc.sh

Closes #15401
2023-09-14 10:30:07 +02:00
Kefu Chai
60c293ed7d doc/dev: correct the path to object_storage.yaml
we get the path object storage config like:

```c++
db::config::get_conf_sub("object_storage.yaml").native()
```
so, the default path should be $SCYLLA_CONF/object_storage.yaml.

in this change, it is corrected.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15406
2023-09-14 10:40:55 +03:00
Botond Dénes
cc16502691 Merge 'Add metrics to S3 client' from Pavel Emelyanov
The added metrics include:

- http client metrics, which include the number of connections, the number of active connections and the number of new connections made so far
- IO metrics that mimic those for traditional IO -- total number of object read/write ops, total number of get/put/uploaded bytes and individual IO request delay (round-trip, including body transfer time)

fixes: #13369

Closes #14494

* github.com:scylladb/scylladb:
  s3/client: Add IO stats metrics
  s3/client: Add HTTP client metrics
  s3/client: Split make_request()
  s3/client: Wrap http client with struct group_client
  s3/client: Move client::stats to namespace scope
  s3/client: Keep part size local variable
2023-09-14 09:49:08 +03:00
Kefu Chai
88a7bf2853 build: cmake: add targets for building deb and rpm packages
Refs #15241

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 13:00:04 +08:00
Kefu Chai
93faac0a0c build: cmake: correct the paths used when building unstripped pkg
in a0dcbb09c3, the newly introduced unstripped package does not build
at all. it was using the wrong paths. so, let's correct them.

Refs #15241

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 13:00:04 +08:00
Kefu Chai
16eea4569d create-relocatable-package.py: prefer $build_dir/SCYLLA-RELEASE-FILE
similar to d9dcda9dd5, we need to
use the version files located under $build_dir instead "build".
so let's check the existence of $build_dir/SCYLLA-RELEASE-FILE,
and then fallback to the ones under "build".

Refs #15241

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 12:45:40 +08:00
Kefu Chai
6dc6b39609 create-relocatable-package.py: create SCYLLA-RELOCATABLE-FILE with tempfile
this change serves two purposes:

1. so we don't assume the existence of '$PWD/build' directory. we should
   not assume this. as the build directory could be any diectory, it
   does not have to be "build".
2. we don't have to actually create a file under $build_dir. what we
   need is but an empty file. so tempfile serves this purpose just well.

Refs #15241

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 12:45:15 +08:00
Botond Dénes
fa88ed76a5 Merge 'build: cmake: add packaging support' from Kefu Chai
a new target "dist-unified" is added, so that CMake can build unified
package, which is a bundle of all subcomponents, like cqlsh, python3,
jmx and tools.

Fixes #15241

Closes #15398

* github.com:scylladb/scylladb:
  build: cmake: build unified package
  build: cmake: put stripped_dist_pkg under $build/dist
2023-09-14 07:02:44 +03:00
Yaniv Kaul
6c67c270c8 Update node exporter to v1.6.1
Fixes: https://github.com/scylladb/scylladb/issues/15044

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes #15045

[avi: toolchain regenerated; also pulls in clang-16.0.6-3]

Ref #15090

Closes #15392
2023-09-14 01:04:14 +03:00
Petr Gusev
082cd3bc8e system_keyspace: switch CDC_LOCAL to schema commitlog 2023-09-13 23:17:20 +04:00
Petr Gusev
a683cebb02 system_keyspace: scylla_local: use schema commitlog
We remove flush from set_scylla_local_param_as
since it's now redundant. We add it to
save_local_enabled_features as features need to
be available before schema commitlog replay.

We skip the flush if save_local_enabled_features
is called from topology_state_load when the features
are migrated to system.topology and we don't need
strict durability.
2023-09-13 23:17:20 +04:00
Petr Gusev
ce0ee32d5a database.cc: make _uses_schema_commitlog optional
This field on the null shard is properly initialized
in maybe_init_schema_commitlog function, until then
we can't make decisions based on its value. This problem
can happen e.g. if add_column_family function is called
with readonly=false before maybe_init_schema_commitlog.
It will call commitlog_for to pass the commitlog to
mark_ready_for_writes and commitlog_for reads _uses_schema_commitlog.

In this commit we add protection against this case - we
trigger internal_error if _uses_schema_commitlog is read
before it is initialized.

maybe_init_schema_commitlog() was added to cql_test_env
to make boost tests work with the new invariant.
2023-09-13 23:17:20 +04:00
Petr Gusev
beb29f094b system_keyspace: drop load phases
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before  schema commitlog initialization for
SCHEMA_COMMITLOG feature.

In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.

In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.

The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
2023-09-13 23:17:20 +04:00
Petr Gusev
47ffc66c7f database.hh: add_column_family: add readonly parameter
Previously, creating a table or view in
schema_tables.cc/merge_tables_and_views was a two-step process:
first adding a column family (add_column_family function) and
then marking it as ready for writes (mark_table_as_writable).
There is an yield between these stages, this means
someone could see a table or view for which the
mark_table_as_writable method had not yet been called,
and start writing to it.

This problem was demonstrated by materialised view dtests.
A view is created on all nodes. On some nodes it will be created
earlier than on others and the view rebuild process will start
writing data to that view on other nodes, where mark_table_as_writable
has not yet been called.

In this patch we solve this problem by adding a readonly parameter
to the add_column_family method. When loading tables from disk,
this flag is set to true and the mark_table_as_writable
is called only after all sstables have been loaded.
When creating a new table, this flag is set to false,
mark_table_as_writable is called from inside add_column_family
and the new table becomes visible already as writable.
2023-09-13 23:17:20 +04:00
Petr Gusev
7e52014633 schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards
db.get_notifier().create_view triggers view rebuild, this
process writes to the table on all shards and thus can
access partially created table, e.g the one where
mark_table_ready_for_writes was not yet called.
2023-09-13 23:17:20 +04:00
Petr Gusev
0e5f9ae9a4 system_keyspace: switch system.peers to schema commitlog
Also, we remove flushes on writes as durability
is now guaranteed by the commitlog.
2023-09-13 23:17:20 +04:00
Petr Gusev
7881ce1e09 system_keyspace: switch system.local to schema commitlog
Schema commitlog lives only on the zero shard,
so we need to turn on use_null_sharder option.

Also, we remove flushes on writes as durability
is now guaranteed by the commitlog.
2023-09-13 23:17:20 +04:00
Petr Gusev
cbfc512667 main.cc: move schema commitlog replay earlier
We want to switch system.local table to
schema commitlog, but this table is used
in host_id initialization (initialize_local_info),
so we need to replay schema commitlog before.

In this commit we gather all the actions
related to early system_keyspace initialization
in one place, before initialize_local_info_thread.

The calls to save_system_schema and recalculate_schema_version
are tied to legacy_schema_migrator::migrate and
initialize_virtual_tables calls, so they are done
separately after legacy_schema_migrator::migrate.
2023-09-13 23:17:11 +04:00
Petr Gusev
a0653590b5 sstables_format_selector: extract listener
In the following commits we want to move schema
commitlog replay earlier, but the current sstable
format should be selected before the replay.
The current sstable format is stored in system.scylla_local,
so we can't read it until system tables are loaded.
This problem is similar to the enabled_features.

To solve this we split sstables_format_selector in two
parts. The lower level part, sstables_format_selector,
knows only about database and system_keyspace. It
will be moved before system_keyspace initialization,
and the on_system_tables_loaded method will
be called on it when the system_keyspace has loaded its tables.

The higher level part, sstables_format_listener, is responsible
for subscribing to feature_services and gossipier and is started
later, at the same place as sstables_format_selector before this commit.
2023-09-13 23:04:50 +04:00
Petr Gusev
7104fc8a7e sstables_format_selector: wrap when_enabled with seastar::async
The listener may fire immediately, we must be in a thread
context for this to work.

In the next commits we are going to move
enable_features_on_startup above
sstables_format_selector::start in scylla_main, so we
need to fix this beforehand.
2023-09-13 23:00:16 +04:00
Petr Gusev
2a0b228d17 main.cc: inline and split system_keyspace.setup
Our goal is to switch system.local table to schema
commitlog and stop doing flushes when we write to it.
This means it would be incorrect to read from this
table until schema commitlog is replayed.

On the other hand, we need truncation records
to be loaded before we start replaying schema
commitlog, since commitlog_replayer relies on them.

In this commit we inline the system_keyspace::setup
function and split its content into two parts. In
the first part, before schema commitlog replay,
we load truncation records. It's safe to load
them before schema commitlog replay since we intend
to let the flushes on writes to system.truncated
table. In the second part, after schema commitlog replay,
we do the rest of the job - build_bootstrap_info and
db::schema_tables::save_system_schema.

We decided to inline this function since there is
very low cohesion between the actions it's performing.
It's just simpler to reason about them individually.
2023-09-13 23:00:15 +04:00
Petr Gusev
f0bc9f2d93 system_keyspace: refactor save_system_schema function
This is a refactoring commit without observable changes
in behaviour.

Previously, there were two related functions in db::schema_tables:
save_system_keyspace_schema(qp) and save_system_schema(qp, ks).
The first called the second passing "system_schema" as
the second argument. Outside of schema_tables module we
don't need two functions, we just need a way to say
'persist system schema objects in the appropriate tables/keyspaces'.
In this commit we change the function save_system_schema
to have this meaning. Internally it calls save_system_schema_to_keyspace
twice with "system_schema" and "system", since that's what we need
in the single call site of this function in system_keyspace::setup.
In subsequent commits we are going to move this call out of the
system_keyspace::setup.
2023-09-13 23:00:15 +04:00
Petr Gusev
e395086557 system_keyspace: move initialize_virtual_tables into virtual_tables.hh
This is a readability refactoring commit without observable changes
in behaviour.

initialize_virtual_tables logically belongs to virtual_tables module,
and it allows to make other functions in virtual_tables.cc
(register_virtual_tables, install_virtual_readers)
local to the module, which simplifies the matters a bit.

all_virtual_tables() is not needed anymore, all the references to
registered virtual tables are now local to virtual_tables module
and can just use virtual_tables variable directly.
2023-09-13 23:00:15 +04:00
Petr Gusev
c4787a160b system_keyspace: remove unused parameter 2023-09-13 23:00:15 +04:00
Petr Gusev
b90011294d config.cc: drop db::config::host_id
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.

Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.

We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
2023-09-13 23:00:15 +04:00
Petr Gusev
d15c961a2f main.cc:: extract local_info initialization into function
This is a refactoring commit without observable changes
in behaviour.

The scylla main function is huge and incomprehensible.
There are a lot of hidden dependencies between actions
that it performs, and it's too difficult to reason about
them.

In this commit, we've extracted a small part of it into
its own function. We're hoping that, moving forward,
the rest of the code can be modified in a similar manner.
2023-09-13 23:00:15 +04:00
Petr Gusev
c59dae9a73 schema.cc: check static_props for sanity
wait_for_sync_to_commitlog is redundant
for schema commitlog since all writes
to it automatically sync due to
db::commitlog::sync_mode::BATCH option.
2023-09-13 23:00:15 +04:00
Petr Gusev
a03fbc3781 system_keyspace: set null sharder when configuring schema commitlog
The schema commitlog lives only on the null shard, it
makes no sense to set use_schema_commitlog
without use_null_sharder.

We also extract the function enable_schema_commitlog which
sets all the needed properties.
2023-09-13 23:00:15 +04:00
Petr Gusev
d32191a353 system_keyspace: rename static variables
'raft_tables' in set_use_schema_commitlog
initialization was misleading. Other variables have
also been renamed for consistency.
2023-09-13 23:00:15 +04:00
Petr Gusev
cda49b06dc system_keyspace: remove redundant wait_for_sync_to_commitlog
Tables with schema commitlog already sync every
write, wait_for_sync_to_commitlog makes sense
only for the regular commitlog.

Technically there are nothing wrong with allowing
both options, but it's confusing. Being strict
and accurate about the meaning of the options
reduces the chance of errors due to misunderstanding.

This is preparation for the next commits, where
we will start generating an error if the combination
of options doesn't make sense.
2023-09-13 23:00:15 +04:00
Kefu Chai
268f75c931 build: cmake: build unified package
a new target "dist-unified" is added, so that CMake can build unified
package, which is a bundle of all subcomponents, like cqlsh, python3,
jmx and tools.

Fixes #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 00:41:46 +08:00
Kefu Chai
f39129f93c build: cmake: put stripped_dist_pkg under $build/dist
more consistent this way, as other tarballs are also located under
this directory.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-14 00:39:06 +08:00
Tomasz Grabiec
c27d212f4b api, storage_service: Recalculate table digests on relocal_schema api call
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.

Use like this:

  curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema

Fixes #15380

Closes #15381
2023-09-13 18:27:57 +03:00
Avi Kivity
0a5d9532f9 Merge 'Sanitize batchlog manager start/stop' from Pavel Emelyanov
This code is now spread over main and differs in cql_test_env. The PR unifies both places and makes the manager start-stop look standard

refs: #2795

Closes #15375

* github.com:scylladb/scylladb:
  batchlog_manager: Remove start() method
  batchlog_manager: Start replay loop in constructor
  main, cql_test_env: Start-stop batchlog manager in one "block"
  batchlog_manager: Move shard-0 check into batchlog_replay_loop()
  batchlog_manager: Fix drain() reentrability
2023-09-13 18:20:56 +03:00
Aleksandra Martyniuk
14598fdfdd test: add test for compaction strategy validation 2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
53ecc29cd7 compaction: unify exception messages
Use fmt::format in exception messages in all methods validating
compaction strategies.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
ac08b57555 compaction: cql3: validate options in check_restricted_table_properties
Check whether valid compaction strategy options are set for the given
strategy type in check_restricted_table_properties.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
44744d6229 compaction: validate options used in different compaction strategies
For each compaction strategy, validate whether options values are valid.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
0ed39af221 compaction: validate common compaction strategy options
Add compaction_strategy_impl::validate_options to validate common
compaction strategy options.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
a2e6081984 compaction: split compaction_strategy_impl constructor
Split compaction_strategy_impl constructor into methods that will
be reused for validation.

Add additional checks providing that options' values are legal.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
5c72bcd40e compaction: validate size_tiered_compaction_strategy specific options 2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
7e5b6ea09a compaction: validate time_window_compaction_strategy specific options 2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
84fd90e472 compaction: add method to validate min and max threshold
Add compaction_strategy_impl::validate_min_max_threshold method
that will be used to validate min and max threshold values
for different compaction methods.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
50c1bb555b compaction: split size_tiered_compaction_strategy_options constructor
Split size_tiered_compaction_strategy_options constructor into
methods that will be reused for validation.

Add additional checks providing that options' values are legal.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
702c19f941 compaction: make compaction strategy keys static constexpr 2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
e3d8f71a88 compaction: use helpers in validate_* functions
To be consistent with other compaction_strategy_options,
time_window_compaction_strategy_options uses compaction_strategy_impl::get_value
and cql3::statements::property_definitions::to_long helpers for
parsing.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
c8c3c0e6a6 compaction: split time_window_compaction_strategy_options construtor
Split time_window_compaction_strategy_options constructor into
functions that will be reused for validation.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
a01dd1351e compaction: add validate method to compaction_strategy_options
Add temporarily empty validate method to compaction_strategy_options.
The method will validate the options and help determining whether
only the allowed options were set.
2023-09-13 16:59:40 +02:00
Benny Halevy
e5cf6f0897 time_window_compaction_strategy_options: make copy and move-able
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-13 16:59:40 +02:00
Benny Halevy
c9475d6fe0 size_tiered_compaction_strategy_options: make copy and move-able
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-13 16:59:40 +02:00
Pavel Emelyanov
f9b09d4549 migration_manager: Register RPC verbs on start
There's a dedicated call to register migration manager's verbs somewhere
in the middle of main. However, until messaging service listening starts
it makes no difference when to register verbs.

This patch moves the verbs registration into mig. manager constructor
thus making it called it with sharded<migration_manager>::start().

Unregistration happens in migration_manager::drain() and it's not
touched here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15367
2023-09-13 17:32:51 +03:00
Pavel Emelyanov
9dea26aa03 storage_service: Remove proxy arg from init_messaging_service_part()
It's only used to be carried along down to a handler and get
sharded<database> from. Storage service itself can provide it, and the
handler in question already uses it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15368
2023-09-13 17:11:33 +03:00
Raphael S. Carvalho
c53b8fb1b5 storage_service: initialize group0 in ctor
there are a couple of places that check group is not nullptr,
so let's set it to nullptr on ctor, so shards that don't
have it initialized will bump on assert, instead of failing
with a cryptic segfault error.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15330
2023-09-13 14:51:24 +02:00
Botond Dénes
50e3448527 Merge 'unified: add --build-dir option and respect --pkgs' from Kefu Chai
in this series, `unified/build_unified.sh` is improved in couple perspectives:

1. add `--build-dir` option, so we don't hardwire the build directory to the naming convention of `build/$mode`.
2. `--pkgs` is respected. this allows the caller to specify the paths to the dist tarballs instead of hardwiring to the paths defined in this script

these changes give us more flexibility when building unified package, and enable us to switch over to CMake based building system,

Refs #15241

Closes #15377

* github.com:scylladb/scylladb:
  unified: respect --pkgs option
  unified: allow passing --pkgs with a semicolon-separated list
  unified: prefer SCYLLA-PRODUCT-FILE in build_dir
  unified: derive UNIFIED_PKG from --build-dir
  unified: add --build-dir option to build_unified.sh
2023-09-13 15:30:57 +03:00
Kamil Braun
a184b07cbb Merge 'raft topology: make CDC_GENERATIONS_V3 single-partition, timeuuid-sorted' from Patryk Jędrzejczak
We make the `CDC_GENERATIONS_V3` table single-partition and change the
clustering key from `range_end` to `(id, range_end)`. We also change the
type of `id` to `timeuuid` and ensure that a new generation always has
the highest `id`. These changes allow efficient clearing of obsolete CDC
generation data, which we need to prevent Raft-topology snapshots from
endlessly growing as we introduce new generations over time.

All this code is protected by an experimental feature flag. It includes
the definition of `CDC_GENERATIONS_V3`. The table is not created unless
the feature flag is enabled.

Fixes #15163

Closes #15319

* github.com:scylladb/scylladb:
  system_keyspace: rename cdc_generation_id_v2
  system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3
  cdc: generation: remove topology_description_generator
  cdc: do not create uuid in make_new_generation_data
  system_kayspace: make CDC_GENERATIONS_V3 single-partition
  cdc: generation: introduce get_common_cdc_generation_mutations
  cdc: generation: rename get_cdc_generation_mutations
2023-09-13 12:54:49 +02:00
Kefu Chai
bbb6e4f822 docs: s/tar xvfz tar/tar xvfz/ in command line sample
should not "tar" to tar, otherwise we'd have following error:
```
tar (child): tar: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
```
as "tar" is not the compressed tarball we want to untar.

Fixes #15328
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15383
2023-09-13 13:37:38 +03:00
Aleksandra Martyniuk
d0d0ad7aa4 node_ops: extract classes related to node operations
Node operations will be integrated with task manager and so node_ops
directory needs to be created. To have an access to node ops related
classes from task manager and preserve consistent naming, move
the classes to node_ops/node_ops_data.cc.
2023-09-13 10:49:31 +02:00
Aleksandra Martyniuk
e90e10112f node_ops: repair: move node_ops_id to node_ops directory 2023-09-13 10:40:04 +02:00
Piotr Dulikowski
66206207f9 gossiper: properly acquire lock_endpoint_update_semaphore in reset_endpoint_state_map
The `gossiper::reset_endpoint_state_map` function is supposed to acquire
a lock in order to serialize with `replicate_live_endpoints_on_change`.
The `lock_endpoint_update_semaphore` is called, but its result is a
future - and it is not co_awaited. Therefore, the lock has no effect.

This commit fixes the issue by adding missing co_await.

Fixes: #15361

Closes #15362
2023-09-13 10:03:47 +02:00
Botond Dénes
7e7101c180 Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes"
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.

The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.

Fixes: #15371
2023-09-13 10:45:37 +03:00
Avi Kivity
2c810e221a Merge 'Gossiper: replace seastar threads with coroutines' from Benny Halevy
Many of the gossiper internal functions currently use seastar threads for historical reasons,
but since they are short living, the cost of spawning a seastar thread for them is excessive
and they can be simplified and made more efficient using coroutines.

Closes #15364

* github.com:scylladb/scylladb:
  gossiper: reindent do_stop_gossiping
  gossiper: coroutinize do_stop_gossiping
  gossiper: reindent assassinate_endpoint
  gossiper: coroutinize assassinate_endpoint
  gossiper: coroutinize handle_ack2_msg
  gossiper: handle_ack_msg: always log warning on exception
  gossiper: reindent handle_ack_msg
  gossiper: coroutinize handle_ack_msg
  gossiper: reindent handle_syn_msg
  gossiper: coroutinize handle_syn_msg
  gossiper: message handlers: no need to capture shared_from_this
  gossiper: add_local_application_state: throw internal error if endpoint state is not found
  gossiper: coroutinize add_local_application_state
2023-09-12 21:50:52 +03:00
Benny Halevy
47dc287efd gossiper: reindent do_stop_gossiping
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:33:09 +03:00
Benny Halevy
8fa65ed016 gossiper: coroutinize do_stop_gossiping
Simplify the function.  It does not need to spawn
a seastar thread.

While at it, declare it as private since it's called
only internally by the gossiper (and on shard 0).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:33:09 +03:00
Benny Halevy
a792babbda gossiper: reindent assassinate_endpoint
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:33:09 +03:00
Benny Halevy
5dbc168c03 gossiper: coroutinize assassinate_endpoint
It has no need to spawn a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:33:09 +03:00
Benny Halevy
29b9596050 gossiper: coroutinize handle_ack2_msg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:33:09 +03:00
Benny Halevy
cc030a5040 gossiper: handle_ack_msg: always log warning on exception
Unlike handle_syn_msg, the warning is currently printed only
`if (_ack_handlers.contains(from.addr))`.
Unclear why. It is interesting in any case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:32:40 +03:00
Benny Halevy
990ac23d19 gossiper: reindent handle_ack_msg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:27:08 +03:00
Benny Halevy
2ca2118130 gossiper: coroutinize handle_ack_msg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:26:03 +03:00
Benny Halevy
8c065bf023 gossiper: reindent handle_syn_msg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:24:14 +03:00
Benny Halevy
264f4daded gossiper: coroutinize handle_syn_msg
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:23:09 +03:00
Benny Halevy
63ab5f1ab3 gossiper: message handlers: no need to capture shared_from_this
The handlers future is waited on under `background_msg`
which is closed in gossiper::stop so the instance is
already guranteed to be kept valid.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:21:07 +03:00
Benny Halevy
8bfec81985 gossiper: add_local_application_state: throw internal error if endpoint state is not found
If the function is called too early, the first get_endpoint_state_ptr
would throw an exception that is later caught and degraded
into a warning.

But that endpoint_state should never disappear after yielding,
so call on_internal_error in that case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:21:07 +03:00
Benny Halevy
d1c67300d4 gossiper: coroutinize add_local_application_state
There is no need for it to spawn a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-12 19:20:41 +03:00
Kefu Chai
75f458f2a5 unified: respect --pkgs option
let's provide the default value, only if user does not specify --pkgs.
otherwise the --pkgs option is always ignored.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 22:56:10 +08:00
Kefu Chai
84387e3856 unified: allow passing --pkgs with a semicolon-separated list
simpler than passing a space-separated list requiring escaping, which
is a source of headache.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 22:56:10 +08:00
Kefu Chai
d9dcda9dd5 unified: prefer SCYLLA-PRODUCT-FILE in build_dir
unlike `configure.py`, the building system created by CMake do not
share the `SCYLLA-PRODUCT-FILE` across different builds. so we cannot
assume that build/SCYLLA-PRODUCT-FILE exists.

so, in this change, we check $BUILD_DIR/SCYLLA-PRODUCT-FILE first,
and fallback to $BUILD_DIR/../SCYLLA-PRODUCT-FILE. this should work
for both configure.py and CMake building system.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 22:56:10 +08:00
Kefu Chai
fea3a11716 unified: derive UNIFIED_PKG from --build-dir
we should respect the --build-dir if --unified-pkg is not specified,
and deduce the path to unified pkg from BUILD_DIR.

so, in this change, we deduce the path to unified pkg from BUILD_DIR
unless --unified-pkg is specfied.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 22:56:10 +08:00
Kefu Chai
4bb5af763b unified: add --build-dir option to build_unified.sh
this allows build_unified.sh to generate unified pkg in specified
directory, instead of assuming the naming convention of build/$mode.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 22:56:10 +08:00
Pavel Emelyanov
d48aff5789 batchlog_manager: Remove start() method
It's now a no-op, can be dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 16:37:52 +03:00
Pavel Emelyanov
3966a50ed4 batchlog_manager: Start replay loop in constructor
... and sanitize the future used on stop.

The loop in question is now started in .start(), but all callers now
construct the manager late enough, so the loop spawning can be moved.
This also calls for renaming the future member of the class and allows
to make it regular, not shared, future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 16:35:53 +03:00
Pavel Emelyanov
512465288f main, cql_test_env: Start-stop batchlog manager in one "block"
Currently starting and stopping of b.m. is spread over main(). Keep it
close to each other.

Another trickery here is that calling b.m.::start() can only be done
after joining the cluster, because this start() spawns replay loop
which, in turn calls token_metadata::count_normal_token_owners() and if
the latter returns zero, the b.m. code uses it as a fraction denominator
and crashes.

With the above in mind, cql_test_env should start batchlog manager after
it "joins the ring" too. For now it doesn't make any difference, but
next patch will make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 16:33:31 +03:00
Pavel Emelyanov
9f45778467 batchlog_manager: Move shard-0 check into batchlog_replay_loop()
Currently the only caller of it is the batchlog manager itself. It
checks for the shard-id to be zero, calls the method, then the method
asserts that it's run on shard-0.

Moving the check into the method removes the need for assertion and
makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 16:32:12 +03:00
Pavel Emelyanov
38d0ea0916 batchlog_manager: Fix drain() reentrability
Currently drain() is called twise -- first time from
storage_service::drain() (on shutdown), second via
batchlog_manager::stop(). The routine is unintentinally re-entrable,
because:
- explicit check for not aborting the abort source twise
- breaking semaphore can be done multiple times
- co-await-ing of the _started future works because the future is shared

That's not extremely elegant, better to make the drain() bail out early
if it was already called.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 16:30:07 +03:00
Avi Kivity
a1b2ca6184 Merge 'build: cmake: package cqlsh and fix the noarch postfix of python3 package' from Kefu Chai
in this series, the packaging of tools modules are improved:

- package cqlsh also. as cqlsh should be redistributed as a part of the unified package
- use ${arch} in the postfix of the python3 package. the python3 package is not architecture independent.
- set the version with tide for `Scylla_VERSION`, so it can be reused elsewhere.

Refs #15241

Closes #15369

* github.com:scylladb/scylladb:
  build: cmake: build cqlsh as a submodule
  build: cmake: always use the version with tilde
  build: cmake: build python3 dist tarball with arch postfix
  build: cmake: use the default comment message
2023-09-12 16:27:03 +03:00
David Garcia
5177ddac17 Support advanced db config scenarios
docs: skip html tags from description

Closes #15338
2023-09-12 15:29:16 +03:00
Tomasz Grabiec
6e83e54b0d Merge 'gossiper: get rid of uses_host_id' from Benny Halevy
This function practically returned true from inception.

In d38deef499
it started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day

So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.

Closes #15343

* github.com:scylladb/scylladb:
  storage_service: fixup indentation after last patch
  gossiper: get rid of uses_host_id
2023-09-12 12:44:56 +02:00
Kefu Chai
571fab4179 build: cmake: build cqlsh as a submodule
since we also redistribute cqlsh, let's package it as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 18:18:31 +08:00
Kefu Chai
4ff5ce9933 build: cmake: always use the version with tilde
since we always use tilde ("~") in the verson number,
let's just cache it as an internal variable in CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 18:18:31 +08:00
Kefu Chai
111d20958e build: cmake: build python3 dist tarball with arch postfix
now that `configure.py` always generate python3 dist tarball with
${arch} postfix, let's mirror this behavior. as `build_unified.sh`
uses this naming convention.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 18:18:31 +08:00
Kefu Chai
760b7c8772 build: cmake: use the default comment message
it turns out "Generating submodule python3 in python3" is not
as informative as default one:
"/home/kefu/dev/scylladb/tools/python3/build/scylla-python3-5.4.0~dev-0.20230908.1668d434e458.noarch.tar.gz"
so let's drop the "COMMENT" argument.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-12 18:18:31 +08:00
Patryk Jędrzejczak
92209996b5 system_keyspace: rename cdc_generation_id_v2
Changing the second value of cdc_generation_id_v2 from uuid_type
to timeuuid_type made the name of cdc_generation_id_v2 unsuitable
because it does not match cdc::generation_id_v2 anymore.
2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak
1c58c6336a system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3
We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to
give them a time-based order. We also change how we initialize
them so that the new CDC generation always has the highest ID.
This is the last step to enabling the efficient clearing of
obsolete CDC generation data.

Additionally, we change the types of current_cdc_generation_uuid,
new_cdc_generation_data_uuid and the second values of the elements
in unpublished_cdc_generations to timeuuid, so that they match id
in CDC_GENERATIONS_V3.
2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak
fab066cffe cdc: generation: remove topology_description_generator
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
2023-09-12 11:18:54 +02:00
Patryk Jędrzejczak
3bf4cac72e cdc: do not create uuid in make_new_generation_data
In the future commit, we change how we initialize uuid of the
new CDC generation in the Raft-based topology. It forces us to
move this initialization out of the make_new_generation_data
function shared between Raft-based and gossiper-based topologies.

We also rename make_new_generation_data to
make_new_generation_description since it only returns
cdc::topology_description now.
2023-09-12 11:18:38 +02:00
Patryk Jędrzejczak
2cd430ac80 system_kayspace: make CDC_GENERATIONS_V3 single-partition
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.

After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
2023-09-12 09:51:45 +02:00
Patryk Jędrzejczak
29f54836d0 cdc: generation: introduce get_common_cdc_generation_mutations
In the following commit, we implement the
get_cdc_generation_mutations_v3 function very similar to
get_cdc_generation_mutations_v2. The only differences in creating
mutations between CDC_GENERATIONS_V2 and CDC_GENERATIONS_V3 are:
- a need to set the num_ranges cell for CDC_GENERATIONS_V2,
- different partition keys,
- different clustering keys.

To avoid code duplication, we introduce
get_common_cdc_generation_mutations, which does most of the work
shared by both functions.
2023-09-12 09:37:21 +02:00
Botond Dénes
bc4b3e4fa3 Merge 'build: cmake: add packaging support ' from Kefu Chai
this change allows CMake to build the dist tarball for a certain build.

Refs https://github.com/scylladb/scylladb/issues/15241

Closes #15352

* github.com:scylladb/scylladb:
  build: cmake: add packaging support
  build: cmake: enable build of seastar/apps/iotune
2023-09-12 09:59:53 +03:00
Pavel Emelyanov
3d0a5f2173 test: Extend object_store test to validate GC works
The test-case creates a S3-backed ks, populates it with table and data,
then forces flush to make sstables appear on the backend. Then it
updates the registry by marking all the objects as 'removing' so that on
next boot they will be garbage-collected.

After reboot check that the table is "empty" and also validate that the
backend doesn't have the corresponding objects on board for real

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 09:56:13 +03:00
Pavel Emelyanov
2c9ec6bc93 sstable_directory: Garbage collect S3 sstables on reboot
When booting there can be dangling entries in sstables registry as well
as objects on the storage itself. This patch makes the S3 lister list
those entries and then kick the s3_storage to remove the corresponding
objects. At the end the dangling entries are removed from the registry

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 09:56:13 +03:00
Pavel Emelyanov
6cb4e3d05a sstable_directory: Pass storage to garbage_collect()
The lister method is going to list the dangling objects and then call
storage to actually wipe them (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 09:29:34 +03:00
Pavel Emelyanov
a957e97ab4 sstable_directory: Create storage instance too
Right now the directory instance only creates lister, but lister is
unaware on exact objects manipulations. The storage is, so create it
too, it's going to be used by garbage collector in next patches

This change also needs fixing the way cql_test_env is configured for
schema_change_test. There are cases that try to pick up keyspace with S3
storage option from the pre-created sstables, and populating those would
need to provide some (even empty) object storage endpoint

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-12 09:29:34 +03:00
Avi Kivity
89ba4e4a5e Merge 'Stop using anonymous minio bucket for tests' from Pavel Emelyanov
Currently minio starts with a bucket that has public anonymous access. Respectively, all tests use unsigned S3 requests. That was done for simplicity, and its better to apply some policy to the bucket and, consequentially, make tests sign their requests.

Other than the obvious benefit that we test requests signing in unit tests, another goal of this PR is to make it possible to simulate and test various error paths locally, e.g. #13745 and #13022

Closes #14525

* github.com:scylladb/scylladb:
  test/s3: Remove AWS_S3_EXTRA usage
  test/s3: Run tests over non-anonymous bucket
  test/minio: Create random temp user on start
  code: Rename S3_PUBLIC_BUCKET_FOR_TEST
2023-09-11 23:12:56 +03:00
Tomasz Grabiec
f77e90a0f0 tests: test_tablets: Reconnect the driver after server restart
This is a workaround for the flakiness of the test where INSERT
statements following the rolling restart fail with "No host available"
exception. The hypothesis is that those INSERTS race with driver
reconnecting to the cluster and if INSERTs are attempted before
reconnection is finished, the driver will refuse to execute the
statements.

The real fix should be in the driver to join with reconnections but
before that is ready we want to fix CI flakiness.

Refs #14746

Closes #15355
2023-09-11 21:58:46 +03:00
Kefu Chai
34e3302c01 dbuild: use --userns option when using podman
instead of fabricating a `/etc/password` manually, we can just
leave it to podman to add an entry in `/etc/password` in container.
as podman allows us to map user's account to the same UID in the
container. see
https://docs.podman.io/en/stable/markdown/options/userns.container.html.

this is not only a cosmetic change, it also avoid the permission denied
failure when accessing `/etc/passwd` in the container when selinux is
enabled. without this change, we would otherwise need to either add the
selinux lable to the bind volume with ':Z' option address the failure
like:

```
type=AVC msg=audit(1693449115.261:2599): avc:  denied  { open } for  pid=2298247 comm="bash" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
type=AVC msg=audit(1693449115.263:2600): avc:  denied  { open } for  pid=2298249 comm="id" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
```

found in `/var/log/audit/audit.log`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15230
2023-09-11 21:41:48 +03:00
Avi Kivity
b8a655f55e Update tools/python3 submodule
* tools/python3 45fbd05...3e833f1 (1):
  > install.sh: replace <tab> with spaces
2023-09-11 21:38:02 +03:00
Avi Kivity
628e6ffd33 Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:

1. Large allocations.  Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator  does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.

This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of the query.

Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).

Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111

Closes #14923

* github.com:scylladb/scylladb:
  test/topology_custom: add test_read_repair.py
  replica/mutation_dump: detect end-of-page in range-scans
  tools/scylla-sstable: write: abort parser thread if writing fails
  test/pylib: add REST methods to get node exe and workdir paths
  test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
  service/storage_proxy: add trace points for the actual read executor type
  service/storage_proxy: add trace points for read-repair
  storage_proxy: Add more trace-level logging to read-repair
  database: Fix accounting of small partitions in mutation query
  database, storage_proxy: Reconcile pages with no live rows incrementally
2023-09-11 19:20:19 +03:00
Kefu Chai
a0dcbb09c3 build: cmake: add packaging support
this change allows CMake to build the dist tarball for a certain build.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-11 23:05:30 +08:00
Kefu Chai
649c8f248d build: cmake: enable build of seastar/apps/iotune
scylla redistribute iotune, so let's enable the related building
options, so that we can built iotune on demand.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-11 23:02:52 +08:00
Nadav Har'El
45ec76cfbf Merge 'Enlighten native-transport shutdown' from Pavel Emelyanov
When `nodetool disablebinary` command executes its handler aborts listening sockets, shuts down all client connections _and_ (!) then waits for the connections to stop existing. Effectively the command tries to make sure that no activity initiated by a CQL query continues, even though client would never see its result (client sockets are closed)

This makes the disablebinary command hang for long sometimes, which is not really nice. The proposal is to wait for the connections to terminate in the background. So once disablebinary command exists what's guaranteed is that all client connections are aborted and new connections are not admitted, but some activity started by them may still be running (e.g. up until `nodetool drain` is issued). Driver-side sockets won't get the queries' results anyway.

The behavior of `disablebinary` is not documented wrt whether it should wait for CQL processing to stop or not, so technically we're not breaking anything. However, it can happen that it's a disruptive change and some setups may behave differently after it.

refs: #14031
refs: #14711

Closes #14743

* github.com:scylladb/scylladb:
  test/cql-pytest: Add enable|disable-binary test case
  test.py: Add suite option to auto-dirty cluster after test
  test/pylib: Add nodetool enable|disable-binary commands
  transport: Shutdown server on disablebinary
  generic_server: Introduce shutdown()
  generic_server: Decouple server stopped from connection stopped
  transport/controller: Coroutinize do_stop_server()
  transport/controller: Coroutinize stop_server()
2023-09-11 17:54:52 +03:00
Pavel Emelyanov
821a9c1fd4 test/cql-pytest: Add enable|disable-binary test case
The test checks that `nodetool disablebinary` makes subsequent queries
fail and `nodetool enablebinary` lets client to establish new
connections.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:38:49 +03:00
Pavel Emelyanov
375b8c6213 test.py: Add suite option to auto-dirty cluster after test
ScyllaCluster can be marked as 'dirty' which means that the cluster is
in unusable state (after test) and shouldn't be re-used by other tests
launched by test.py. For now this is only implemented via the cluster
manager class which is only available for topology tests.

Add a less flexible short-cut for cql-pytest-s via suite.yaml marking.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:37:48 +03:00
Pavel Emelyanov
2c3b30b395 test/pylib: Add nodetool enable|disable-binary commands
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:37:48 +03:00
Pavel Emelyanov
b42391bfbe transport: Shutdown server on disablebinary
... and do the real "sharded::stop" in the background. On node shutdown
it needs to pick up all dangling background stopping.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:37:48 +03:00
Pavel Emelyanov
4682c7f9a5 generic_server: Introduce shutdown()
The method waits for listening sockets to stop listening and aborts the
connected sockets, but doesn't wait for the established connections to
finish processing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:37:48 +03:00
Pavel Emelyanov
6dcf653995 generic_server: Decouple server stopped from connection stopped
The _stopped future resolves when all "sockets" stop -- listening and
connected ones. Furure patching will need to wait for listening sockets
to stop separately from connected ones.

Rename the `_stopped` to reflect what it is now while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:32:07 +03:00
Pavel Emelyanov
bc2d44994a transport/controller: Coroutinize do_stop_server()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:32:07 +03:00
Pavel Emelyanov
7701aa0789 transport/controller: Coroutinize stop_server()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-11 17:32:07 +03:00
Benny Halevy
7119c1d8cc token_metadata: update_topology: make endpoint_dc_rack arg optional
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15300
2023-09-11 16:16:19 +02:00
Benny Halevy
08f8fd30ea gossiper: get rid of comment about advertise_removing
It was deleted in 66ff072540.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20230911140349.1809014-1-bhalevy@scylladb.com>
2023-09-11 16:14:26 +02:00
Benny Halevy
ed32ba7431 storage_service: fixup indentation after last patch
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-11 16:51:28 +03:00
Benny Halevy
f855479c9d gossiper: get rid of uses_host_id
This function practically returned true from inception.

In d38deef499
It started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day

So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-11 16:48:07 +03:00
Kefu Chai
87088b65b6 util: replace <tab> with spaces
to be aligned with seastar's coding-style.md: scylladb uses seastar's
coding-style.md. so let's adhere to it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15345
2023-09-11 14:38:46 +03:00
Botond Dénes
e5f724d5bd Merge 'create-relocatable-package.py: add --node-exporter-dir and --build-dir options' from Kefu Chai
this series adds `--node-exporter-dir` and `--build-dir` options to `create-relocatable-package.py`. this enables us to use create relocatable package from arbitrary build directories.

Refs #15241

Closes #15299

* github.com:scylladb/scylladb:
  create-relocatable-package.py: add --node-exporter-dir option
  build: specify the build dir instead mode
2023-09-11 14:32:53 +03:00
Botond Dénes
0c107c2076 Merge 'dist/debian: add command line option for builddir ' from Kefu Chai
so we can point `debian_files_gen.py` to builddir other than
'build', and can optionally use other output directory. this would
help to reduce the number of "magic numbers" in our building system.

Refs https://github.com/scylladb/scylladb/issues/15241

Closes #15282

* github.com:scylladb/scylladb:
  dist/debian: specify debian/* file encodings
  dist/debian: wrap lines whose length exceeds 100 chars
  dist/debian: add command line option for builddir
  dist/debian: modularize debian_files_gen.py
2023-09-11 14:31:33 +03:00
Botond Dénes
f770ff7a2b test/topology_custom: add test_read_repair.py 2023-09-11 07:07:12 -04:00
Botond Dénes
b55cead5cd replica/mutation_dump: detect end-of-page in range-scans
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
2023-09-11 07:02:14 -04:00
Botond Dénes
82f4563757 tools/scylla-sstable: write: abort parser thread if writing fails
Currently if writing the sstable fails, e.g. because the input data is
out-of-order, the json parser thread hangs because its output is no
longer consumed. This results in the entire application just freezing.
Fix this by aborting the parsing thread explicitely in the
json_mutation_stream_parser destructor. If the parser thread existed
successfully, this will be a no-op, but on the error-path, this will
ensure that the parser thread doesn't hang.
2023-09-11 07:02:14 -04:00
Botond Dénes
46e37436d0 test/pylib: add REST methods to get node exe and workdir paths 2023-09-11 07:02:14 -04:00
Botond Dénes
dc269cb6bd test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
To support the equivalent (roughly) of the following nodetool commands:
* nodetool refresh
* nodetool flush
* nodetool compact
2023-09-11 07:01:20 -04:00
Botond Dénes
ff29f43060 service/storage_proxy: add trace points for the actual read executor type
There is currently a trace point for when the read executor is created,
but this only contains the initial replica set and doesn't mention which
read executor is created in the end. This patch adds trace points for
each different return path, so it is clear from the trace whether
speculative read can happen or not.
2023-09-11 06:56:13 -04:00
Botond Dénes
727e519c3a service/storage_proxy: add trace points for read-repair
Currently the fact that read-repair was triggered can only be inferred
from seeing mutation reads in the trace. This patch adds an explicit
trace point for when read repair is triggered and also when it is
finished or retried.
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
f76f5f6bfe storage_proxy: Add more trace-level logging to read-repair
Extremely helpful in debugging.
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
0d773c9f9f database: Fix accounting of small partitions in mutation query
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.

Fix by accounting the real size of accumulated frozen_mutation.

Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.

Refs #7933
2023-09-11 06:56:13 -04:00
Tomasz Grabiec
2c8a0e4175 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-11 06:56:13 -04:00
Patryk Jędrzejczak
ed1c1369d9 cdc: generation: rename get_cdc_generation_mutations
In the following commits, we modify the CDC_GENERATIONS_V3 schema
to enable efficient clearing of obsolete CDC generation data.
These modifications make the current get_cdc_generation_mutations
work only for the CDC_GENERATIONS_V2 schema, and we need a new
function for CDC_GENERATIONS_V3, so we add the "_v2" suffix.
2023-09-11 12:30:21 +02:00
Botond Dénes
685486a20d Update tools/python3 submodule
* tools/python3 30b8fc21...45fbd056 (1):
  > build_reloc: do not run SCYLLA-VERSION-GEN twice
2023-09-11 10:59:56 +03:00
Kefu Chai
2bbffccaca SCYLLA-VERSION-GEN: do not print version by default
actually, we never use the its output in our workflow. and the
output is distracting when building the package. so, in this
change, let's print it only on demand. this feature is preserved
just in case some of us would want to use this script for getting
the version number string.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15327
2023-09-11 10:50:50 +03:00
Kefu Chai
bcc05305ae build: cmake: set the default CMAKE_BUILD_TYPE to Release
if user fails to set "CMAKE_BUILD_TYPE", it would be empty. and
CMake would fail with confusing error messages like

```
CMake Error at CMakeLists.txt:21 (list):
  list sub-command FIND requires three arguments.

CMake Error at CMakeLists.txt:27 (include):
  include could not find requested file:

    mode.
```

so, in this change

* the the default CMAKE_BUILD_TYPE to "Release"
* quote the ${CMAKE_BUILD_TYPE} when searching it
  in the allowed build type lists.

this should address the issues above.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15326
2023-09-11 10:49:28 +03:00
Botond Dénes
b062b245ad Merge 'Don't cache dc:rack on system keyspace local cache' from Pavel Emelyanov
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.

Closes #15280

* github.com:scylladb/scylladb:
  system_keyspace: Don't require snitch argument on start
  system_keyspace: Don't cache local dc:rack pair
  system_keyspace: Save local info with explicit location
  storage_service: Get endpoint location from snitch, not system keyspace
  snitch: Introduce and use get_location() method
  repair: Local location variables instead of system keyspace's one
  repair: Use full endpoint location instead of datacenter part
2023-09-11 10:26:26 +03:00
Nadav Har'El
ea56c8efcd test/alternator: reduce code duplication in test for list_append()
A reviewer noted that test_update_expression_list_append_non_list_arguments
has too much code duplication - the same long API call to run
"SET a = list_append(...)" was repeated many times.

So in this patch we add a short inner function "try_list_append" to
avoid this duplication.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes: #15298
2023-09-11 10:09:35 +03:00
David Garcia
a14bcf7c6a docs: improve configuration properties reference
- Adds type for each option.
- Filters out unused / invalid values, moves them to a separate section.
- Adds the term "liveness" to the glossary.
- Removes unused and invalid properties from the docs.
- Updates to the latest version of pyaml.

docs: rename config template directive

Closes #15164
2023-09-11 09:47:16 +03:00
Botond Dénes
d92620868d Merge 'docs: improve command line samples in unified-installer.rst' from Kefu Chai
in this series, we try to improve `unified-installer.rst`

- encourage user to install smaller package
- run `./install.sh` directly instead relying on that `sh` points to `bash`

Closes #15325

* github.com:scylladb/scylladb:
  doc: run install.sh directly
  doc: install headless jdk in sample command line
2023-09-11 09:34:14 +03:00
Botond Dénes
7385f93816 Merge 'Task manager repair tasks progress' from Aleksandra Martyniuk
Find progress of repair tasks based on the number of ranges
that have been repaired.

Fixes: [#1156](https://github.com/scylladb/scylla-enterprise/issues/1156).

Closes #14698

* github.com:scylladb/scylladb:
  test: repair tasks test
  repair: add methods making repair progress more precise
  tasks: make progress related methods virtual
  repair: add get_progress method to shard_repair_task_impl
  repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size()
  repair: log a name of a particular table repair is working on
  tasks: delete move and copy constructors from task_manager::task::impl
2023-09-11 09:32:23 +03:00
Guy Shtub
8606cca64f updating presto integration page 2023-09-11 08:31:20 +03:00
Aleksandra Martyniuk
932f39e37c compaction: warn about compaction_done()
compaction_done() returns ready future before compaction_task_executor::run_compaction()
even though the compaction did not start.

Make compaction_done() private and add a comment to warn against
incorrect usage.
2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk
59b7a45f73 compaction: do not run stopped compaction
Before compaction_task_executor::do_run	is called, the executor	can
be already aborted. Check if compaction	was stopped and set
_compaction_done to exceptional	future.
2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk
515b8d4890 compaction: modify lowest compaction tasks' run method
For compaction_task_executors, unlike for all other task manager
tasks, run method does not embrace operations performed in a scope
of a task, but only waits until shared_future connected with
the operations is resolved.

Apart from breaking task manager task conventions, such a run method
must consider all corner cases, not to break task manager or
compaction manager functionality.

To fix existing and prevent further bugs related to task manager
and compaction manager coexistence, call perform_task inside
run method and wait for it in a standard way.

Executors that are not going to be reflected in task manager run call
perform_task the old way.
2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk
832df38d26 compaction: pass do_throw_if_stopping to compaction_task_executor
As a preparation for further changes, keep do_throw_if_stopping flag
as a member of compaction_task_executor.
2023-09-09 11:19:11 +02:00
Raphael S. Carvalho
c7e02a1077 storage_service: Enforce tablet streaming runs on shard 0
SIGSEGV was caught during tablet streaming, and the reason was
that storage_service::_group0 (via set_group0()) is only set on
shard 0, therefore when streaming ran on any other shard,
it tried to dereference garbage, which resulted in the crash.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15307
2023-09-08 20:45:13 +03:00
Kefu Chai
ce6464b649 sstable: do not call into sstable in filesystem_storage::open()
before this change, filesystem_storage::open() reuses
`sstable::make_component_file_writer()` to create the
temporary toc, it will rename the temporary toc to the
real TOC when sealing the sstable.

but this prevents us from reusing filesystem_storage in
yet another storage backend. as the

1. create temporary
2. rename temporary to toc

dance only applies to filesystem_storage. when
filesystem_storage calls into sstable, it calls `sst.make_component_file_writer()`,
which in turn calls the `_storage->make_component_sink()`.
but at this moment, `_storage` is not necessarily `filesystem_storage`
anymore. it could be a wrapper around `filesystem_storage`,
which is not aware of the create-rename dance. and could do
a lot more than create a temporary file when asked to
"make_component_sink()".

if we really want to go this way by reusing sstable's API
in `filesystem_storage` to create a temporary toc, we will
have to rename the whatever temporary toc component created
by the wrapper backend to the toc with the seal() func. but
again, this rename op is only implemented in the
filesystem_storage backend. to mirror this operation in
the wrapper backend does not make sense at all -- it
does not have to be aware of the filesystem_storage's internals.

so in this change, instead of reusing the
`sstable::make_component_file_writer()`, we just inline
its implementation in filesystem_storage to avoid this
problem. this is also an improvement from the design
perspective, as the storage should not call into its
the higher abstraction -- sstable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14443
2023-09-08 19:57:39 +03:00
Kefu Chai
ce291f4385 s3/client: do not use deprecated tls::connect() overload
seastar has deprecated the overload which accepts `server_name`,
let's use the one which accepts `tls::tls_options`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15324
2023-09-08 18:44:45 +03:00
Avi Kivity
0656810c28 Update tools/java submodule
* tools/java 585b30fda6...9dddad27bf (1):
  > install-dependencies.sh: do not install weak dependencies

Frozen toolchain regenerated.

Closes #15322
2023-09-08 17:22:07 +03:00
Kamil Braun
26d9a82636 Merge 'raft topology: replace publish_cdc_generation with a bg fiber' from Patryk Jędrzejczak
Currently, the topology coordinator has the
`topology::transition_state::publish_cdc_generation` state responsible
for publishing the already created CDC generations to the user-facing
description tables. This process cannot fail as it would cause some CDC
updates to be missed. On the other hand, we would like to abort the
`publish_cdc_generation` state when bootstrap aborts. Of course, we
could also wait until handling this state finishes, even in the case of
the bootstrap abort, but that would be inefficient. We don't want to
unnecessarily block topology operations by publishing CDC generations.

The solution proposed by this PR is to remove the
`publish_cdc_generation` state completely and introduce a new background
fiber of the topology coordinator -- `cdc_generation_publisher` -- that
continually publishes committed CDC generations.

Apart from introducing the CDC generation publisher, we add
`test_cdc_generation_publishing.py` that verifies its correctness and we
adapt other CDC tests to the new changes.

Fixes #15194

Closes #15281

* github.com:scylladb/scylladb:
  test: test_cdc: introduce wait_for_first_cdc_generation
  test: move cdc_streams_check_and_repair check
  test: add test_cdc_generation_publishing
  docs: remove information about publish_cdc_generation
  raft topology: introduce the CDC generation publisher
  system_keyspace: load unpublished_cdc_generations to topology
  raft topology: mark committed CDC generations as unpublished
  raft topology: add unpublished_cdc_generations to system.topology
2023-09-08 15:08:41 +02:00
Kamil Braun
8bff5843b5 Merge 'test: topology: add tests for gossiper/endpoint/live and gossiper/endpoint/down' from Aleksandra Martyniuk
Add tests for gossiper/endpoint/live and gossiper/endpoint/down
which run only in release mode.

Enable test_remove_node_with_concurrent_ddl and fix types and
variables names used by it, so that they can be reused in gossiper
test.

Fixes: #15223.

Closes #15244

* github.com:scylladb/scylladb:
  test: topology: add gossiper test
  test: fix types and variable names in wait_for_host_down
2023-09-08 12:43:11 +02:00
Nadav Har'El
548386a0bb treewide: reduce include of cql_statement.hh
ClangBuildAnalyzer reports cql3/cql_statement.hh as being one of the
most expensive header files in the project - being included (mostly
indirectly) in 129 source files, and costing a total of 844 CPU seconds
of compilation.

This patch is an attempt, only *partially* successful, to reduce the
number of times that cql_statement.hh is included. It succeeds in
lowering the number 129 to 99, but not less :-( One of the biggest
difficulties in reducing it further is that query_processor.hh includes
a lot of templated code, which needs stuff from cql_statement.hh.
The solution should be to un-template the functions in
query_processor.hh and move them from the header to a source file, but
this is beyond the scope of this patch and query_processor.hh appears
problematic in other respects as well.

Unfortunately the compilation speedup by this patch is negligible
(the `du -bc build/dev/**/*.o` metric shows less than 0.01% reduction).
Beyond the fact that this patch only removes 30% of the inclusions of
this header, it appears that most of the source files that no longer
include cql_statement.hh after this patch, included anyway many of the
other headers that cql_statement.hh included, so the saving is minimal.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15212
2023-09-08 13:23:50 +03:00
Kefu Chai
7591b1b384 doc: run install.sh directly
strictly speaking, `sh` is not necessarily bash. while `install.sh`
is written in the Bash dialect. and it errors out if it is not executed
with Bash. and we don't need to add "-x" when running the script, if
we have to, we should add it in `install.sh` not ask user to add this
option. also, `install.sh` is executable with a shebang line using
bash, so we can just execute it.

so, in this change, we just launch this script in the command line
sample.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-08 17:21:30 +08:00
Kefu Chai
1e0c7d14aa doc: install headless jdk in sample command line
in comparison with java-11-openjdk, java-11-openjdk-headless does not
offer audio and video support, and has less dependencies. for instance,
java-11-openjdk depends on the X11 libraries, and it also provides
icons representing JDK. but since scylla is a server side application,
we don't expect user to run a desktop on it. so there is no need to
support audio and video.

in this change, we just suggest the a "smaller" package, which is
actually also a dependency of java-11-open-jdk.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-08 17:21:30 +08:00
Patryk Jędrzejczak
23a4557662 test: test_cdc: introduce wait_for_first_cdc_generation
After introducing the CDC generation publisher,
test_cdc_log_entries_use_cdc_streams could (at least in theory)
fail by accessing system_distributed.cdc_streams_descriptions_v2
before the first CDC generation has been published.

To avoid flakiness, we simply wait until the first CDC generation
is published in a new function -- wait_for_first_cdc_generation.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
3a2c080cbe test: move cdc_streams_check_and_repair check
The part of test_topology_ops that tests the
cdc_streams_check_and_repair request could (at least in theory)
fail on
`assert(len(gen_timestamps) + 1 == len(new_gen_timestamps))`
after introducing the CDC generation publisher because we can
no longer assume that all previously committed CDC generations
have been published before sending the request.

To prevent flakiness, we move this part of the test to
test_cdc_generations_are_published. This test allows for ensuring
that all previous CDC generations have been published.
Additionally, checking cdc_streams_check_and_repair there is
simpler and arguably fits the test better.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
4ee68a47bb test: add test_cdc_generation_publishing
We add two test cases that test the new CDC generation publisher
to detect potential bugs like incorrect order of publications or
not publishing some generations at all.

The purpose of the second test case --
test_multiple_unpublished_cdc_generations -- is to enforce and test
a scenario when there are multiple unpublished CDC generations at
the same time. We expect that this is a rare case. The main fiber
of the topology coordinator would have to make much more progress
(like finishing two bootstraps) than the CDC generation publisher
fiber. Since multiple unpublished CDC generations might never
appear in other tests but could be handled incorrectly, having
such a test is valuable.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
2643ccc70e docs: remove information about publish_cdc_generation
We update documentation after replacing the
topology::transition_state::publish_cdc_generation state with
the CDC generation publisher fiber.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
fc1ee2cc14 raft topology: introduce the CDC generation publisher
Currently, the topology coordinator has the
topology::transition_state::publish_cdc_generation state
responsible for publishing the already created CDC generations
to the user-facing description tables. This process cannot fail
as it would cause some CDC updates to be missed. On the other
hand, we would like to abort the publish_cdc_generation state when
bootstrap aborts. Of course, we could also wait until handling this
state finishes, even in the case of the bootstrap abort, but that
would be inefficient. We don't want to unnecessarily block topology
operations by publishing CDC generations.

The solution is to remove the publish_cdc_generation state
completely and introduce a new background fiber of the topology
coordinator -- cdc_generation_publisher -- that continually
publishes committed CDC generations.

The implementation of the CDC generation publisher is very similar
to the main fiber of the topology coordinator. One noticeable
difference is that we don't catch raft::commit_status_unknown,
which is handled raft_group0_client::add_entry.

Note that this modification changes the Raft-based topology a bit.
Previously, the publish_cdc_generation state had to end before
entering the next state -- write_both_read_old. Now, committed
CDC generations can theoretically be published at any time.
Although it is correct because the following states don't depend on
publish_cdc_generation, it can cause problems in tests. For example,
we can't assume now that a CDC generation is published just because
the bootstrap operation has finished.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
d404443b54 system_keyspace: load unpublished_cdc_generations to topology
We extend service::topology with the list of unpublished CDC
generations and load its contents from system.topology. This step
is the last one in making unpublished CDC generations accessible
to the topology coordinator.

Note that when we load unpublished_cdc_generations, we don't
perform any sanity checks contrary to current_cdc_generation_uuid.
Every unpublished CDC generation was a current generation once,
and we checked it at that moment.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
bc726a066f raft topology: mark committed CDC generations as unpublished
We add committed CDC generations to unpublished_cdc_generations
so that we can load them to topology and properly handle them
in the following commits.
2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak
5ed9d4db6d raft topology: add unpublished_cdc_generations to system.topology
In the following commits, we replace the
topology::transition_state::publish_cdc_generation state with
a background fiber that continually publishes committed CDC
generations. To make these generations accessible to the
topology coordinator, we store them in the new column of
system.topology -- unpublished_cdc_generations.
2023-09-08 09:05:01 +02:00
Israel Fruchter
3d082acd29 Update tools/cqlsh submodule
* tools/cqlsh 2254e920...66ae7eac (5):
  > switch from `ssl_options` to `ssl_context`
  > cqlsh should use cql v4 by default when connecting #44
  > Revert "Skip pp38-macosx wheel builds"
  > update to newer cibuildwheel
  > Skip pp38-macosx wheel builds

Closes #15308
2023-09-07 22:48:37 +03:00
Aleksandra Martyniuk
8a65477202 tasks: db: change default task_ttl value
If a test isn't going to use task manager or isn't interested in
statuses of finished tasks, then keeping them in the memory
for some time (currently 10s by default) after they are finished
is a memory waste.

Set default task_ttl value to zero. It can be changed by setting
--task-ttl-in-seconds or through rest api (/task_manager/ttl).

In conf/scylla.yaml set task-ttl-in-seconds to 10.

Closes #15239
2023-09-07 12:42:29 +03:00
Nadav Har'El
42e26ab13b Merge 'Explicitly use do_with_cql_env_thread in query test' from Pavel Emelyanov
Some tests use non-threaded do_with_cql_env() and wrap the inner lambda with seastar::async(). The cql env already provides a helper for that

Closes #15305

* github.com:scylladb/scylladb:
  cql_query_test: Fix indentation after previous patch
  cql_query_test: Use do_with_cql_env_thread() explicitly
2023-09-07 11:54:54 +03:00
Pavel Emelyanov
4dc4f65b18 test/s3: Remove AWS_S3_EXTRA usage
Now when the keys and region can be configured with "standard"
environment variables, the old custom one can be removed. No automation
uses that it was purely a support for manual testing of a client against
AWS's S3 server

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 11:16:13 +03:00
Pavel Emelyanov
1d00cc5baa test/s3: Run tests over non-anonymous bucket
Currently minio applies anonymous public policy for the test bucket and
all tests just use unsigned S3 requests. This patch generates a policy
for the temporary minio user and removes the anon public one. All tests
are updated respectively to use the provided key:secret pair.

The use-https bit is off by default as minio still starts with plain
http. That's OK for now, all tests are local and have no secret data
anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 11:16:13 +03:00
Benny Halevy
c5e4dace8e gossiper: real_mark_alive: do not erase from unreachable_endpoints without holding lock
This code was supposed to be moved into
`mutate_live_and_unreachable_endpoints`
in 2c27297dbd
but it looks like the original statements were left
in place outside the mutate function.

This patch just removes the stale code since the required
logic is already done inside `mutate_live_and_unreachable_endpoints`.

Fixes scylladb/scylladb#15296

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15304
2023-09-07 10:02:49 +02:00
Pavel Emelyanov
bff8064abd test/minio: Create random temp user on start
The user is going to have rights to access the test bucket. For now just
create one and export the tests via environment

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 10:40:12 +03:00
Pavel Emelyanov
e8e8539c7c code: Rename S3_PUBLIC_BUCKET_FOR_TEST
The bucket is going to stop being public, rename the env variable in
advance to make the essential patch smaller

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 10:25:53 +03:00
Pavel Emelyanov
308db51306 s3/client: Add IO stats metrics
These metrics mimic the existing IO ones -- total number of read
operation, total number of read bytes and total read delay. And the same
for writing.

This patch makes no difference between wrting object with plain PUT vs
putting it with multipart uploading. Instead, it "measures" individual
IO writes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Pavel Emelyanov
91235a84cd s3/client: Add HTTP client metrics
Currently an http client has several exported "numbers" regarding the
number of transport connections the client uses. This patch exports
those via S3 client's per-sched-group metrics and prepares the ground
for more metrics in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Pavel Emelyanov
08a12cd4a6 s3/client: Split make_request()
There will appear another make_request() helper that'll do mostly the
same. This split will help to avoid code duplication

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Pavel Emelyanov
4b548dd240 s3/client: Wrap http client with struct group_client
The http-client is per-sched-group. Next patch will need to keep metrics
per-sched-group too and this sched-group -> http-client map is the good
place to put them on. Wrapping struct will allow extending it with
metrics

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Pavel Emelyanov
627c1932e4 s3/client: Move client::stats to namespace scope
The stats is stats about object, not about client, so it's better if it
lives in namespace scope. Also it will avoid conflicts with client stats
that will be reported as metrics (later patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Pavel Emelyanov
896b582850 s3/client: Keep part size local variable
This serves two purposes. First, it fixes potential use-after-move since
the bufs are moved on lambda and bufs.size() are called in the same
statement with no defined evaluation order.

Second, this makes 'size' varable alive up to the time request is
complete thus making it possible to update stats with it (later patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-07 09:25:00 +03:00
Nadav Har'El
c52e0fd333 test/alternator: avoid warnings about unverified HTTPS
The Alternator tests can run against HTTPS - namely when using
test/alternator/run with the "--https" option (local Alternator
configured with HTTPS) or "--aws" option (DynamoDB, using HTTPS).

In some cases we make these HTTPS requests with verify=False, to avoid
checking the SSL certificates. E.g., this is necessary for Alternator
with a self-signed certificate. Unfortunately, the urllib3 library adds
an ugly warning message when SSL certificate verification is disabled.

In the past we tried to disable these warnings, using the documented
urllib3.disable_warnings() function, but it didn't help. It turns out
that pytest has its own warning handling, so to disable warnings in
pytest we must say so in a special configuration parameter in pytest.ini.

So in this patch, we drop the disable_warnings call from conftest.py
(where it didn't help), and instead put a similar declaration in
pytest.ini. The disable_warnings call in the test/alternator/run
script needs to remain - it is run outside pytest, so pytest.ini
doesn't affect it.

After this patch, running test/alternator/run with --https or --aws
finishes without warnings, as desired.

Fixes #15287

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15292
2023-09-07 07:23:57 +03:00
Tomasz Grabiec
dd57c53328 Merge 'Topology: use this host_id in is_configured_this_node' from Benny Halevy
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.

Prepare for extending the token_metadata interface
to provide host_id in update_topology.

We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).

Closes #15297

* github.com:scylladb/scylladb:
  locator: topology: is_configured_this_node: delete spurious semicolumn
  locator: topology: is_configured_this_node: compare host_id first
2023-09-06 22:13:29 +02:00
Pavel Emelyanov
9da4668c71 cql_query_test: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 16:54:25 +03:00
Pavel Emelyanov
84e30ab56c cql_query_test: Use do_with_cql_env_thread() explicitly
Some tests use non-threaded do_with_cql_env() and wrap the inner lambda
with seastar::async(). The cql env already provides a helper for that

Indentation is deliberately left broken until next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 16:54:14 +03:00
Kefu Chai
e1d08d888a create-relocatable-package.py: add --node-exporter-dir option
before this change, we assume that node_exporter artifacts are
always located under `build/node_exporter`. but this could might
hold anymore, if we want to have a self-contained build, in the sense
that different builds do not share the same set of node_exporter
artifacts. this could be a waste as the node_exporter artifacts
are identical across different builds, but this makes things
a lot simpler -- different builds do not have to hardwire to
a certain directory.

so, a new option is added to `create-relocatable-package.py`, this
allows us to specify the directory where node_export artifacts
are located.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 20:52:07 +08:00
Kefu Chai
8bd9645d1d build: specify the build dir instead mode
instead of specifying the build "mode", and assuming that
the build directory is always located at "build/${mode}", specify
the build directory explicitly. this allows us to use
`create-relocatable-package.py` to package artifacts built
at build directory whose path does not comply to the
naming convention, for instance, we might want to build
scylla in `build/yet-another-super-feature/release`.

so, in this change, we trade `--mode` for an option named
`--build-dir` and update `configure.py` accordingly.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 20:47:42 +08:00
Kefu Chai
1ed894170c sstables: throw at seeing invalid chunk_len
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.

Fixes #15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15264
2023-09-06 14:20:38 +03:00
Nadav Har'El
5930637ad8 Merge 'task_manager: module: make_task: enter gate when the task is created' from Benny Halevy
Passing the gate_closed_exception to the task promise
ends up with abandoned exception since no-one is waiting
for it.

Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.

Fixes scylladb/scylladb#15211

In addition, this series adds a private abort_source for each task_manager module
(chained to the main task_manager::abort_source) and abort is requested on task_manager::module::stop().

gate holding in compaction_manager is hardened
and makes sure to stop compaction_manager and task_manager in sstable_compaction_test cases.

Closes #15213

* github.com:scylladb/scylladb:
  compaction_manager: stop: close compaction_state:s gates
  compaction_manager: gracefully handle gate close
  task_manager: task: start: fixup indentation
  task_manager: module: make_task: enter gate when the task is created
  task_manaer: module: stop: request abort
  task_manager: task::impl: subscribe to module about_source
  test: compaction_manager_stop_and_drain_race_test: stop compaction and task managers
  test: simple_backlog_controller_test: stop compaction and task managers
2023-09-06 13:29:26 +03:00
Benny Halevy
574c7e349a locator: topology: is_configured_this_node: delete spurious semicolumn
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Benny Halevy
115462be17 locator: topology: is_configured_this_node: compare host_id first
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.

Prepare for extending the token_metadata interface
to provide host_id in update_topology.

We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Nadav Har'El
cfc70810d3 test/alternator: more error-path tests for list_append() function
Improved the coverage of the tests for the list_append() function
in UpdateExpression - test that if one of its arguments is not a list,
including a missing attribute or item, it is reported as an error as
expected.

The new tests pass on both Alternator and DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15291
2023-09-06 11:59:54 +03:00
Kefu Chai
4856e95a1c dist/debian: specify debian/* file encodings
instead of using the encoding returned by "locale.getencoding()",
explicitly specify the used encoding. per the Debian policies,
the debian/control, debian/changelog, and paths should be encoded
in UTF-8. see

- https://www.debian.org/doc/debian-policy/ch-controlfields.html
- https://manpages.debian.org/testing/dpkg-dev/deb-changelog.5.en.html
- https://www.debian.org/doc/debian-policy/ch-files.html#file-names

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 11:26:19 +08:00
Kefu Chai
a43cd7f03e dist/debian: wrap lines whose length exceeds 100 chars
to be more PEP8 compliant. it requires < 80 chars.
see https://peps.python.org/pep-0008/#maximum-line-length

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 11:26:19 +08:00
Kefu Chai
524ccc6da7 dist/debian: add command line option for builddir
so we can point debian_files_gen.py to builddir other than
'build', and can optionally use other output directory. this would
help to reduce the number of "magic numbers" in our building system.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 11:26:19 +08:00
Kefu Chai
899b12da54 dist/debian: modularize debian_files_gen.py
restructure the script into functions, prepare for the change which
allows us to specify the build directory when preparing the "debian"
packaging recipes.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-06 11:26:19 +08:00
Avi Kivity
f594175042 Merge 'build: extract generate_compdb() out' from Kefu Chai
instead of flattening the functions into the script, let's structure them into functions. so they can be reused. and more maintainable this way.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15242

* github.com:scylladb/scylladb:
  build: early return when appropriate
  build: extract generate_compdb() out
2023-09-05 20:54:06 +03:00
Dawid Medrek
c7fe5d7f94 utils/lister: Limit the API of scan_dir() to fs::path
Right now, the function allows for passing the path to a file as a seastar::sstring,
which is then converted to std::filesystem::path -- implicitly to the caller.
However, the function performs I/O, and there is no reason to accept any other type
than std::filesystem::path, especially because the conversion is straightforward.
Callers can perform it on their own.

This commit introduces the more constrained API.

Closes #15266
2023-09-05 20:50:42 +03:00
Nadav Har'El
1cbe60a7e3 Update seastar submodule
* seastar 6e80e84a...576ee47d (9):
  > http/client: Add "total new connections" metrics
  > semaphore: initialize wait_list in move ctor

Fixes #15253
Fixes #15263

  > tutorial: Add a missing argument in code example
  > sstring: format sstring without implicitly conversion
  > coroutine: Add a necessary include in generator.hh
  > tls: Move server name into tls_options
  > net/arp|ip: fix unused param warning in forward virtual method
  > net/ethernet: fix unused param ethernet_address::adjust_endianness
  > tls: Optionally skip client EOF wait

Closes #15273
2023-09-05 17:07:08 +03:00
Aleksandra Martyniuk
c96224e97d test: topology: add gossiper test
Add tests for gossiper/endpoint/live and gossiper/endpoint/down
which run only in release mode.
2023-09-05 15:04:26 +02:00
Aleksandra Martyniuk
ede8182dd4 test: fix types and variable names in wait_for_host_down
Fix types and variable names in ManagerClient::wait_for_host_down
and related methods.
2023-09-05 15:01:59 +02:00
Pavel Emelyanov
1ef4ba196b Merge 'Gossiper: mark const methods and remove dead code' from Benny Halevy
This series cleans up gossiper.
Methods that do not change the gossiper object are marked as const.
Dead code is removed.

Closes #15272

* github.com:scylladb/scylladb:
  gossiper: get_current* methods: mark as const
  gossiper: get_generation_for_nodes: mark as const
  gossiper: examine_gossiper: mark as const
  gossiper: request_all, send_all: mark as const
  gossiper: do_on_*notifications: mark as const
  utils: atomic_vector: mark for_each functions as const
  gossiper: compare_endpoint_startup: mark as const
  gossiper: get_state_for_version_bigger_than: mark as const
  gossiper: make_random_gossip_digest: delete dead legacy code
  gossiper: make_random_gossip_digest: mark as const
  gossiper: do_sort: mark as const
  gossiper: is* methods: mark as const
  gossiper: wait_for_gossip and friends: mark as const
  gossiper: drop unused dump_endpoint_state_map
  gossiper: remove unused shadow version members
2023-09-05 13:47:29 +03:00
Pavel Emelyanov
5d52a35e05 system_keyspace: Don't require snitch argument on start
Now system keyspace is finally independent from snitch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:57:09 +03:00
Pavel Emelyanov
1daa8fa3bb system_keyspace: Don't cache local dc:rack pair
Now no code needs it from system keyspace

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:56:45 +03:00
Pavel Emelyanov
9926917bf5 system_keyspace: Save local info with explicit location
On boot system keyspace is kicked to insert local info into system.local
table. Among other things there's dc:rack pair which sys.ks. gets from
its cache which, in turn, should have been previously initialized from
snitch on sys.ks. start. This patch makes the local info updating method
get the dc:rack from caller via argument. Callers, in turn, call snitch
directly, because these are main and cql_test_env startup routines.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:54:46 +03:00
Pavel Emelyanov
99cfd018c5 storage_service: Get endpoint location from snitch, not system keyspace
Storage service needs to get local dc:rack pair in some places and it
calls system keyspace's local_dc_rack() method for it. However, the
method returns back the data from sys.ks. cache which, in turn, was
previously initialized from snitch's data. This patch makes storage
service get location from snitch directly, without messing with system
keyspace.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:53:20 +03:00
Pavel Emelyanov
d2bd203cba snitch: Introduce and use get_location() method
There are some places out there that generate locator::endpoint_dc_rack
pair out of snitch's get_datacenter() and get_rack() calls. Generalize
those with snitch's new method. It will also be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:52:30 +03:00
Pavel Emelyanov
153607d587 repair: Local location variables instead of system keyspace's one
Previous patch made full endpoint location be available as a local
variable near the places that get this location from the system
keyspace. This patch replaces the sys.ks. calls with the variables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:51:34 +03:00
Pavel Emelyanov
620273899b repair: Use full endpoint location instead of datacenter part
There are several places in repair code that get datacenter from the
topology. Nearby there are calls to update_topology() which, in turn,
needs full location ({dc, rack} pair). This patch makes the former
places obtain full location from topology and get the dc part from it.
This is needed as a preparation to let latter places use that location.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:50:00 +03:00
Kefu Chai
f6cca741ea config: remove "experimental" option
"experimental" option was marked "Unused" in 64bc8d2f7d. but we
chose to keep it in hope that the upgrade test does not fail.
despite that the upgrade tests per-se survived the "upgrade",
after the upgrade, the tests exercising the experimental features
are still failing hard. they have not been updated to set the
"experimental-features" option, and are still relying on
"experimental" to enable all the experimental features under
test.

so, in this change, let's just drop the option so that
scylla can fail early at seeing this "experimental" option.
this should help us to identify the tests relying on it
quicker. as the "experimental" features should only be used
in development environment, this change should have no impact
to production.

Refs #15214
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15233
2023-09-05 10:09:04 +03:00
Benny Halevy
cfecb68245 compaction_manager: stop: close compaction_state:s gates
Make sure the compaction_state:s are idle before
they are destroyed. Although all tasks are stopped
in stop_ongoing_compactions, make sure there is
fiber holding the compaction_state gate.

compaction_manager::remove now needs to close the
compaction_state gate and to stop_ongoing_compactions
only if the gate is not closed yet.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
96055414c7 compaction_manager: gracefully handle gate close
Check if the compaction_state gate is closed
along with _state != state::enabled and return early
in this case.

At this point entering the gate is guaranteed to succeed.
So enter the gate before calling `perform_compaction`
keeping the std::optional<gate_holder> throughout
the compaction task.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
a5b7f1a275 task_manager: task: start: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
f9a7635390 task_manager: module: make_task: enter gate when the task is created
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.

Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.

Fixes scylladb/scylladb#15211

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
51792d2292 task_manaer: module: stop: request abort
Have a private about_source for every module
and request abort on stop() to signal all outstanding
tasks to abort (especially when they are sleeping
for the task_ttl).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
d7205db863 task_manager: task::impl: subscribe to module about_source
Rather to the top-level task_manager about_source,
to provide separation between task_manager modules
so each one can be aborted and stopped independentally
of the others (in the next patch).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
062684eb1f test: compaction_manager_stop_and_drain_race_test: stop compaction and task managers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Benny Halevy
b9127f55ac test: simple_backlog_controller_test: stop compaction and task managers
The compaction_manager and task_manager should
be orderly stopped before they are destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-05 09:17:25 +03:00
Pavel Emelyanov
13a0c29618 storage_service: Remove query processor arg from join_cluster()
The s.service since d42685d0cb is having on-board query processor ref^w
pointer and can use it to join cluster

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15236
2023-09-05 07:30:37 +03:00
Kefu Chai
ea91342d4b build: early return when appropriate
less intentation for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-05 12:14:02 +08:00
Kefu Chai
ce5f7d36cd build: extract generate_compdb() out
instead of flattening the functions into the script, let's structure
them into functions. so they can be reused. and more maintainable
this way.

Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-05 12:14:02 +08:00
Piotr Smaroń
eb46f1bd17 guardrails: restrict replication factor (RF)
Replacing `minimum_keyspace_rf` config option with 4 config options:
`{minimum,maximum}_replication_factor_{warn,fail}_threshold`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on RF when creating/altering a keyspace.
The reason to rather replace than extend `minimum_keyspace_rf` config
option is to be aligned with Cassandra, which did the same, and has the
same parameters' names.
Only min soft limit is enabled by default and it is set to 3, which means
that we'll generate a CQL warning whenever RF is set to either 1 or 2.
RF's value of 0 is always allowed and means that there will not be any
replicas on a given DC. This was agreed with PM.
Because we don't allow to change guardrails' values when scylla is
running (per PM), there're no tests provided with this PR, and dtests will be
provided separately.
Exceeding guardrails' thresholds will be tracked by metrics.

Resolves #8619
Refs #8892 (the RF part, not the replication-strategy part)

Closes #14262
2023-09-04 19:22:17 +03:00
Benny Halevy
04ba560b8d gossiper: get_current* methods: mark as const
We need to const_cast `this` since the const
container() has no const invoke_on override.
Trying to fix this in seastar sharded.hh breaks
many other call sites in scylla.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:18:04 +03:00
Benny Halevy
43d883c5aa gossiper: get_generation_for_nodes: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:17:38 +03:00
Benny Halevy
cfe0ec2203 gossiper: examine_gossiper: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:17:25 +03:00
Benny Halevy
ce05bbe32f gossiper: request_all, send_all: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:16:19 +03:00
Benny Halevy
cc1d5771e5 gossiper: do_on_*notifications: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:16:10 +03:00
Benny Halevy
eb51b70e6d utils: atomic_vector: mark for_each functions as const
They only need to access the _vec_lock rwlock
so mark it as mutable, but otherwise they provide a const
interface to the calls, as the called func receives
the entries by value and it cannot modify them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:14:38 +03:00
Benny Halevy
963d6fb009 gossiper: compare_endpoint_startup: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:14:22 +03:00
Benny Halevy
2899e07572 gossiper: get_state_for_version_bigger_than: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:13:02 +03:00
Benny Halevy
87ac1a26f2 gossiper: make_random_gossip_digest: delete dead legacy code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:12:51 +03:00
Benny Halevy
33f004587e gossiper: make_random_gossip_digest: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:12:43 +03:00
Benny Halevy
02e8fdc4b8 gossiper: do_sort: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:11:56 +03:00
Benny Halevy
482963b2c4 gossiper: is* methods: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:11:00 +03:00
Benny Halevy
f7eddf0322 gossiper: wait_for_gossip and friends: mark as const
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:09:15 +03:00
Benny Halevy
044a696aca gossiper: drop unused dump_endpoint_state_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:09:04 +03:00
Benny Halevy
083506d479 gossiper: remove unused shadow version members
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-04 16:08:25 +03:00
Patryk Jędrzejczak
06c0c6a9d9 raft_group0: start leadership monitor fiber during restart
Currently, we start group 0 leadership monitor fiber only during
a normal bootstrap. However, we should also do it when we restart
a node (either with or without upgrading it to Raft).

Fixes #15166

Closes #15204
2023-09-04 10:41:50 +02:00
Kefu Chai
dd59b90999 open-coredump: pass the content of "image" file not its path to dbuild
in a4eb3c6e0f, we passed the path of
"image" to `dbuild`, but that was wrong. we should pass its content
to this script. so in this change, it is fixed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15247
2023-09-04 07:34:44 +03:00
Avi Kivity
78fc3b5f56 config: rename stream_plan_ranges_percentage to *_fraction
The value is specified as a fraction between 0 and 1, so don't
mislead users into specifying a value between 0 and 100.

Closes #15261
2023-09-03 23:24:29 +03:00
Avi Kivity
9a3d57256a Merge 'config: add index_cache_fraction' from Michał Chojnowski
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See https://github.com/scylladb/scylladb/issues/11202.

However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.

As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophic regressions in
small-partition workloads, but big enough to accommodate workloads where
index cache is obviously beneficial.

This series adds such a configurable limit, sets it to to 0.2 of total cache memory by default,
and re-enables index caching by default.

Fixes #15118

Closes #14994

* github.com:scylladb/scylladb:
  test: boost/cache_algorithm_test: add cache_algorithm_test
  sstables: partition_index_cache: deglobalize stats
  utils: cached_file: deglobalize cached_file metrics
  db: config: enable index caching by default
  config: add index_cache_fraction
  utils: lru: add move semantics to list links
2023-09-03 19:39:31 +03:00
Dawid Medrek
a5448fade9 Simplify service_set constructor in init.hh
Leverage the comma operator with a parameter pack.

Closes #15246
2023-09-03 18:14:44 +03:00
Aleksandra Martyniuk
cf37ab96f4 api: task_manager: fix indentation
Closes #15173
2023-09-02 08:18:59 +03:00
Michał Chojnowski
bcc235ad5f test: boost/cache_algorithm_test: add cache_algorithm_test
The tests added in this patch validate that index_cache_fraction
does what it's supposed to do.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
f00bed9429 sstables: partition_index_cache: deglobalize stats
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.

This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
c7d9d35030 utils: cached_file: deglobalize cached_file metrics
Move cached_file metrics from a thread_local variable
to cache_tracker.

This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
023accf246 db: config: enable index caching by default
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See #11202.

However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.

As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophical regressions in
small-partition workloads, but big enough to accomodate workloads where
index cache is obviously beneficial.

This patch sets such a limit to 0.2 of total cache memory, and re-enables
index caching by default.
2023-09-01 22:34:23 +02:00
Michał Chojnowski
50b429f255 config: add index_cache_fraction
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.

This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
2023-09-01 22:34:23 +02:00
Michał Chojnowski
6a7ce6781e utils: lru: add move semantics to list links
Before the patch, fixing list links is done manually in the move constructor of
`evictable`. After the patch, it is done by the move constructors of the links
themselves.

This makes for slightly cleaner code, especially after we add more links in an
upcoming patch.
2023-09-01 22:34:23 +02:00
Tomasz Grabiec
7b65d4d947 Merge 'Gossiper: provide strong exception safety for endpoint state changes' from Benny Halevy
This series ensures that endpoint state changes (for each single endpoint) are applied to the gossiper endpoint_state_map as a whole and on all shards.
Any failure in the process will keep the existing endpoint state intact.

Note that verbs that modify the endpoint states of multiple endpoints may still succeed to modify some of them before hitting an error and those changes are committed to the endpoint_state_map, so we don't ensure atomicity when updating multiple endpoints' states.

Fixes scylladb/scylladb#14794
Fixes scylladb/scylladb#14799

Closes #15073

* github.com:scylladb/scylladb:
  gossiper: move endpoint_state by value to apply it
  gossiper: replicate: make exception safe
  gms: pass endpoint_state_ptr to endpoint_state change subscribers
  gossiper: modify endpoint state only via replicate
  gossiper: keep and serve shared endpoint_state_ptr in map
  gossiper: get_max_endpoint_state_version: get state by reference
  api/failure_detector: get_all_endpoint_states: reduce allocations
  cdc/generation: get_generation_id_for: get endpoint_state&
  gossiper: add for_each_endpoint_state helpers
  gossiper: add num_endpoints
  gossiper: add my_endpoint_state
2023-09-01 12:23:19 +02:00
Kamil Braun
117dedab19 Merge 'Cluster features on raft: topology coordinator + check on boot followups' from Piotr Dulikowski
This PR collects followups described in #14972:

- The `system.topology` table is now flushed every time feature-related
  columns are modified. This is done because of the feature check that
  happens before the schema commitlog is replayed.
- The implementation now guarantees that, if all nodes support some
  feature as described by the `supported_features` column, then support
  for that feature will not be revoked by any node.  Previously, in an
  edge case where a node is the last one to add support for some feature
  `X` in `supported_features` column, crashes before applying/persisting
  it and then restarts without supporting `X`, it would be allowed to boot
  anyway and would revoke support for the `X` in `system.topology`.
  The existing behavior, although counterintuitive, was safe - the
  topology coordinator is responsible for explicitly marking features as
  enabled, and in order to enable a feature it needs to perform a special
  kind of a global barrier (`barrier_after_feature_update`) which only
  succeeds after the node has updated its features column - so there is no
  risk of enabling an unsupported feature.  In order to make the behavior
  less confusing, the node now will perform a second check when it tries
  to update its `supported_features` column in `system.topology`.
- The `barrier_after_feature_update` is removed and the regular global
  `barrier` topology command is used instead. The `barrier` handler now
  performs a feature check if the node did not have a chance to verify and
  update its cluster features for the second time.

JOIN_NODE rpc will be sent separately as it is a big item on its own.

Fixes: #14972

Closes #15168

* github.com:scylladb/scylladb:
  test: topology{_experimental_raft}: don't stop gracefully in feature tests
  storage_service: remove _topology_updated_with_local_metadata
  topology_coordinator: remove barrier_after_feature_update
  topology_coordinator: perform feature check during barrier
  storage_service: repeat the feature check after read barrier
  feature_service: introduce unsupported_feature_exception
  feature_service: move startup feature check to a separate function
  topology_coordinator: account for features to enable in should_preempt_balancing
  group0_state_machine: flush system.topology when updating features columns
2023-09-01 11:52:26 +02:00
Nadav Har'El
5625624533 doc/dev: add document about analyzing build time
Add a document describing in detail how to use clang's "-ftime-trace"
option, and the ClangBuildAnalyzer tool, to find the source files,
header files and templates which slow down Scylla's build the most.

I've used this tool in the past to reduce Scylla build time - see
commits:

   fa7a302130 (reduced 6.5%)
   f84094320d (reduced 0.1%)
   6ebf32f4d7 (reduced 1%)
   d01e1a774b (reduced 4%)

I'm hoping that documenting how to use this tool will allow other
developers to suggest similar commits.

Refs #1.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15209
2023-09-01 11:33:36 +03:00
Botond Dénes
b6986668d4 Merge 'sstables: change make_descriptor() to accept fs::path' from Kefu Chai
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.

Refs #15187

Closes #15188

* github.com:scylladb/scylladb:
  sstable: add samples of fname to be matched by regex
  sstables: change make_descriptor() to accept fs::path
  sstables: switch entry_descriptor(sstring..) to std::string_view
  sstables: change make_descriptor() to accept fs::path
2023-09-01 10:55:25 +03:00
Kefu Chai
a4eb3c6e0f open-coredump: use the dbuild script in current branch
the dbuild script provided by the branch being debugged might not
include the recent fixes included by current branch from which
`open-coredump.sh` is launched.

so, instead of using the dbuild script in the repo being debugged,
let's use the dbuild provided by current branch. also, wrap the
dbuild command line. for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15240
2023-09-01 07:27:21 +03:00
Botond Dénes
34d94fb549 test/cql-pytest/test_tools.py: improve tempdir usage for scrub tests
Scrub tests use a lot of temporary directories. This is suspected to
cause problems in some cases. To improve the situation, this patch:
* Creates a single root temporary directory for all scrub tests
* All further fixtures create their files/directories inside this root
  dir.
* All scrub tests create their temporary directories within this root
  dir.
* All temporary directories now use an appropriate "prefix", so we can
  tell which temporary directory is part of the problem if a test fails.

Refs: #14309

Closes #15117
2023-09-01 07:17:49 +03:00
Gleb Natapov
55f047f33f raft: drop assert in server_impl::apply_snapshot for a condition that may happen
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.

Drop the assert that verifies the assumption and return an error in this
case instead.

Fixes: #15222

Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
2023-09-01 07:17:49 +03:00
Pavel Emelyanov
91cc544b05 Update seastar submodule
* seastar 0784da87...6e80e84a (29):
  > Revert "shared_token_bucket: Make duration->tokens conversion more solid"
  > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai
  > memory: diable transparent hugepages if --overprovisioned is specified
Ref https://github.com/scylladb/scylladb/issues/15095
  > http/exception: s/<TAB>/    /
  > install-dependencies.sh: re-add protobuf
  > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov
  > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov
Fixes https://github.com/scylladb/scylladb/issues/13100
  > smp: make service management semaphore thread local
  > tls_test: abort_accept() after getting server socket
  > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov
  > rpc: Fix client-side stream registration race
Ref https://github.com/scylladb/scylladb/issues/13100
  > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas
  > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler
  > reactor: Drop some dangling friend declarations
  > fair_queue: Do not re-evaluate request capacity twice
  > build: do not use serial number file when signing a cert
  > shared_token_bucket: Make duration->tokens conversion more solid
  > tests: Add perf test for shard_token_bucket
  > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov
  > fair_queue: Remove individual requests counting
  > reactor, linux-aio: print value of aio-max-nr on error
  > Merge 'build, net: disable implicit fallthough' from Kefu Chai
  > shared_token_bucket: Fix duration_for() underflow
  > rpc: Generalize get_stats_internal() method
  > doc/building-dpdk.md: fix invalid file path of README-DPDK.md
  > install-dependencies: add centos9
  > Merge 'log: report scheduling group along with shard id' from Kefu Chai
  > dns: handle exception in do_sendv for udp
  > Merge 'Add a stall detector histogram' from Amnon Heiman

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15218
2023-09-01 07:17:49 +03:00
Alexey Novikov
87fa7d0381 compact and remove expired range tombstones from cache on read
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs #2252
Fixes #6033

Closes #14463
2023-09-01 07:17:49 +03:00
Kefu Chai
94a056bda8 sstable: add samples of fname to be matched by regex
for better readability of code

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-01 07:44:06 +08:00
Kefu Chai
a29838f9e1 sstables: change make_descriptor() to accept fs::path
change another overload of `make_descriptor()` to accept `fs::path`,
in the same spirit of a previous change in this area. so we have
a more consistent API for creating sstable descriptor. and this
new API is simpler to use.

Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-01 07:44:06 +08:00
Kefu Chai
50c1d9aee7 sstables: switch entry_descriptor(sstring..) to std::string_view
so its callers don't need to construct a temporary `sstring` if
the parameter's type is not `sstring`. for instance, before
this change, `entry_descriptor::make_descriptor(const std::filesystem::path...)`
would have to construct two temporary instances of `sstring`
for calling this function.

after this change, it does not have to do so.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-01 07:44:06 +08:00
Kefu Chai
6656707164 sstables: change make_descriptor() to accept fs::path
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.

Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-09-01 07:44:06 +08:00
Piotr Dulikowski
5471330ee7 test: topology{_experimental_raft}: don't stop gracefully in feature tests
The current cluster feature tests are stopping nodes in a graceful way.
Doing it gracefully isn't strictly necessary for the test scenarios and
we can switch `server_stop_gracefully` calls to `server_stop`. This only
became possible after a previous commit which causes `system.topology`
table to be flushed when cluster feature columns are modified, and will
server as a good test for it.
2023-08-31 16:46:11 +02:00
Piotr Dulikowski
62167d584b storage_service: remove _topology_updated_with_local_metadata
After removing barrier_after_feature_update, the flag is no longer
needed by anybody. The field in storage_service is removed and
do_update_topology_with_local_metadata is inlined.
2023-08-31 16:46:11 +02:00
Piotr Dulikowski
3d7cf3bfe6 topology_coordinator: remove barrier_after_feature_update
The `barrier_after_feature_update` was introduced as a variant of the
`barrier` command, meant to be used by the topology coordinator when
enabling a feature. It was meant to give more guarantees to the topology
coordinator than the regular barrier, but the regular barrier has been
adjusted in the previous commits so that it can be used instead of the
special barrier.

This commit gets rid of `barrier_after_feature_update` and replaces its
uses with `barrier`.
2023-08-31 16:46:11 +02:00
Piotr Dulikowski
1b62abfc42 topology_coordinator: perform feature check during barrier
Due to the possible situation where a node applies a command that
advertises support for a feature but crashes before applying it, there
is a period of time where a node might have its group 0 server running
but does not support all of the features. Currently, we solve the issue
by using a special `barrier_after_feature_update` which will not succeed
until the node makes sure to update its `supported_features` column (or,
since the previous commit, shuts down if it doesn't support all required
features).

However, we can make it work with regular barrier after adjusting it
slightly. In case the local metadata was not updated yet, it will
perform a feature check. This will make sure that the global barrier
issued by the topology coordinator before enabling features will not
succeed if the problematic situation occurs.
2023-08-31 16:46:11 +02:00
Piotr Dulikowski
3edfd29c86 storage_service: repeat the feature check after read barrier
We would like to guarantee the following property: if all nodes have
some feature X in their `supported_features` column in
`system.topology`, then it's no longer possible for anybody to revoke
support for it. Currently, it is not guaranteed because the following
can happen:

1. A node commits a command that updates its `supported_features`,
   marking feature X as supported. It is the last node to do so and now
   all nodes support X.
2. Node crashes before applying the command locally.
3. Node is downgraded not to support X and restarted.
4. The feature check in `enable_features_on_startup` passes because it
   happens before starting the group 0 server.
5. The `supported_features` column is updated in
   `update_topology_with_local_metadata`, removing support for X.

Even though the guarantee does not hold, it's not a problem because the
`barrier_after_metadata_update` is required to succeed on all nodes
before topology coordinator moves to enable a feature, and - as the name
suggests - it requires `update_topology_with_local_metadata` to finish.

However, choosing to give this guarantee makes it simpler to reason
about how cluster features on raft work and removes some pathological
cases (e.g. trying to downgrade some other node after step 1 will fail,
but will be again possible after step 5). Therefore, this commit adds a
second check to `update_topology_with_local_metadata` which disallows
removing support for a feature that is supported by everybody - and
stops the boot process if necessary.
2023-08-31 16:46:11 +02:00
Piotr Dulikowski
aa5401383f feature_service: introduce unsupported_feature_exception
The new `unsupported_feature_exception` is introduced so that the
exception thrown by `check_features` can be caught in a type-safe way.
2023-08-31 16:46:10 +02:00
Piotr Dulikowski
8286a2c369 feature_service: move startup feature check to a separate function
The logic responsible for checking supported features agains the
currently enabled features (and features that are unsafe to disable) is
moved to a separate function, `check_features`. Currently, it is only
used from `enable_features_on_startup`, but more checks against features
in raft will be added in the commits that follow.
2023-08-31 16:45:40 +02:00
Benny Halevy
98fd9fcc11 gossiper: move endpoint_state by value to apply it
Save a copy of the applied endpoint state by moving
the value towards replicate.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:35:15 +03:00
Benny Halevy
38c2347a3c gossiper: replicate: make exception safe
First replicate the new endpoint_state on all shards
before applying the replicated endpoint_state objects
to _endpoint_state_map.

Fixes scylladb/scylladb#14794

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:35:15 +03:00
Benny Halevy
c16ec870da gms: pass endpoint_state_ptr to endpoint_state change subscribers
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:35:15 +03:00
Benny Halevy
1d04242a90 gossiper: modify endpoint state only via replicate
And restrict the accessor methods to return const pointers
or refrences.

With that, the endpoint_state_ptr:s held in the _endpoint_state_map
point to immutable endpoint_state objects - with one exception:
the endpoint_state update_timestamp may be updated in place,
but the endpoint_state_map is immutable.

replicate() replaces the endpoint_state_ptr in the map
with a new one to maintain immutability.

A later change will also make this exception safe so
replicate will guarantee strong exception safety so that all shards
are updated or none of them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:35:15 +03:00
Benny Halevy
d00e49a1bb gossiper: keep and serve shared endpoint_state_ptr in map
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.

Fixes scylladb/scylladb#14799

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:34:36 +03:00
Benny Halevy
f33a6d37f2 gossiper: get_max_endpoint_state_version: get state by reference
No need to copy the endpoint_state since the function
is synchronous.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 09:33:42 +03:00
Benny Halevy
f1a88c01a2 api/failure_detector: get_all_endpoint_states: reduce allocations
reserve the result vector based on the known
number of endpoints and then move-construct each entry
rather than copying it.
Also, use refrences to traverse the application_state_map
rather than copying each of them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 08:34:42 +03:00
Benny Halevy
1e0d19b89d cdc/generation: get_generation_id_for: get endpoint_state&
No need to lookup the application_state again using the
endpoint, as both callers already have a reference to
the endpoint_state handy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 08:34:23 +03:00
Benny Halevy
4f5ffc7719 gossiper: add for_each_endpoint_state helpers
Before changing _endpoint_state_map to hold a
lw_shared_ptr<endpoint_state>, provide synchronous helpers
for users to traverse all endpoint_states with no need
to copy them (as long as the called func does not yield).

With that, gossiper::get_endpoint_states() can be made private.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 08:32:31 +03:00
Benny Halevy
3208af1880 gossiper: add num_endpoints
Return the number of endpoints tracked by gossiper.
This is useful when the caller doesn't need
access to the endpoint states map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 08:30:40 +03:00
Benny Halevy
b82c77ed9c gossiper: add my_endpoint_state
Get or creates the endpoint_state for this node.
Instead of accessing _endpoint_state_map directly.
Do this before changing the map to hold a lw_shared_ptr<endpoint_state>
in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-31 08:27:27 +03:00
Kefu Chai
3bdbe620aa open-coredump: do not assume remote repo name
open-coredump.sh allows us to specify --scylla-repo-path, but
developer's remote repo name is not always "origin" -- the
"origin" could be his/her own remote repo. not the one from which
we want to pull from.

so, in this change, assuming that the remote repo to be pulled
from has been added to the local repo, we query the local repo
for its name and pull using that name instead.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15220
2023-08-31 08:04:28 +03:00
Kefu Chai
a47dcb29d5 open-coredump: add selinux label to shared volume
before this change, we don't configure the selinux label when
binding shared volume, but this results in permission denied
when accessing `/opt/scylladb` in the container when the selinux
is enabled. since we are not likely to share the volume with
other containers, we can use `Z` to indicate that the bind
mount is private and unshared. this allows the launched container
to access `/opt/scylladb` even if selinux is enabled.

since selinux is enabled by default on a installation of fedora 38.
this change should improve the user experience of open-coredump
when developer uses fedora distributions.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15229
2023-08-31 08:03:40 +03:00
Dawid Medrek
9afaf39acb Get rid of UB in commitlog.hh
Identifiers starting with an underscore followed by a capital letter
are reserved. They should not be used.

Closes #15227
2023-08-31 00:03:04 +03:00
Aleksandra Martyniuk
92fad5769a test: repair tasks test
Add tests checking whether repair tasks are properly structured and
their progress is gathered correctly.
2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
848dfb26ef repair: add methods making repair progress more precise
Override methods returning expected children number and job size
in repair tasks. With them get_progress method would be able to
return more precise progress value.
2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
564454f8c2 tasks: make progress related methods virtual 2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
4766f74623 repair: add get_progress method to shard_repair_task_impl
Count shard_repair_task_impl progress based on a number of ranges
which have already been repaired.
2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
c9d68869b8 repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size() 2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
09abbdddae repair: log a name of a particular table repair is working on
Instead of logging the list of all tables' names for a given repair,
log the particular name of a table the repair is working on.
2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk
587715b399 tasks: delete move and copy constructors from task_manager::task::impl 2023-08-30 15:34:22 +02:00
Botond Dénes
008bde5461 Merge 'gossiper: apply_new_states: tolerate listener errors during shutdown' from Benny Halevy
Change 6449c59 brought back abort on listener failure, but this causes crashes when listeners hit expected errors like gate_closed.

Detect shutdown usig the gossiper _abort_source
and in this case just log a warning about the errors but do not abort.

Fixes scylladb/scylladb#15031

Closes #15100

* github.com:scylladb/scylladb:
  gossiper: apply_new_states: tolerate listener errors during shutdown
  gossiper: do_on_change_notifications: check abort source
  gossiper: lock_endpoint_update_semaphore: get_units with _abort_source
  gossiper: lock_endpoint: get_units with _abort_source
  gossiper: is_enabled: consider also _abort_source
2023-08-30 11:52:13 +03:00
Kamil Braun
0ee23b260e Merge 'raft topology: add and deprecate support for --ignore-dead-nodes with IPs' from Patryk Jędrzejczak
We want to stop supporting IPs for `--ignore-dead-nodes` in
`raft_removenode` and `--ignore-dead-nodes-for-replace` for
`raft_replace`. However, we shouldn't remove these features without the
deprecation period because the original `removenode` and `replace`
operations still support them. So, we add them for now.

The `IP -> Raft ID` translation is done through the new
`raft_address_map::find_by_addr` member function.

We update the documentation to inform about the deprecation of the IP
support for `--ignore-dead-nodes`.

Fixes #15126

Closes #15156

* github.com:scylladb/scylladb:
  docs: inform about deprecating IP support for --ignore-dead-nodes
  raft topology: support IPs for --ignore-dead-nodes
  raft_address_map: introduce find_by_addr
2023-08-30 10:41:23 +02:00
Botond Dénes
eb7618406f Merge 'Gossiper: do_on_dead_notifications' from Benny Halevy
Use common code to notify subscribers on_dead
from remove_endpoint() and from mark_dead().

Modeled after do_on_change_notifications.

Refs https://github.com/scylladb/scylladb/pull/15179#discussion_r1306969125

Closes #15206

* github.com:scylladb/scylladb:
  gossiper: remove_endpoint: get the endpoint_state before yielding
  gossiper: add do_on_dead_notifications
2023-08-30 09:32:35 +03:00
Botond Dénes
3e7ec6cc83 Merge 'Move cell assertion from cql_test_env to cql_assertions' from Pavel Emelyanov
The cql_test_env has a virtual require_column_has_value() helper that better fits cql_assertions crowd. Also, the helper in question duplicates some existing code, so it can also be made shorter (and one class table helper gets removed afterwards)

Closes #15208

* github.com:scylladb/scylladb:
  cql_assertions: Make permit from env
  table: Remove find_partition_slow() helper
  sstable_compaction_test: Do not re-decorate key
  cql_test_env: Move .require_column_has_value
  cql_test_env: Use table.find_row() shortcut
2023-08-30 08:34:05 +03:00
Kamil Braun
0bff96a611 Merge 'gossip: add group0_id attribute to gossip_digest_syn' from Mikołaj Grzebieluch
Motivation:

The user can bootstrap 3 different clusters and then connect them
(#14448). When these clusters start gossiping, their token rings will be
merged, but there will be 3 different group 0s in there. It results in a
corrupted cluster.

We need to prevent such situations from happening in clusters which
don't use Raft-based topology.

-------

Gossiper service sets its group0 id on startup if it is stored in
`scylla_local` or sets it during joining group0.

Send group0_id (if it is set) when the node tries to initiate the gossip
round. When a node gets gossip_digest_syn it checks if its group0 id
equals the local one and if not, the message is discarded.

Fixes #14448

Performed manual tests with the following scenario:
1. setup a cluster of two nodes (one compiled with and one without this patch)
2. setup a new node
3. create a basic keyspace and table
4. execute simple select and insert queries

Tested 4 scenarios: the seed node was with or without this patch, and the third node was with or without this patch.
These tests didn't detect any errors.

Closes #15004

* github.com:scylladb/scylladb:
  tests: raft: cluster of nodes with different group0 ids
  gossip: add group0_id attribute to gossip_digest_syn
2023-08-29 16:41:29 +02:00
Pavel Emelyanov
5c95b1cb7f scylla-gdb: Remove _cost_capacity from fair-group debug
This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb

(see also ae6fdf1599)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15203
2023-08-29 16:13:50 +03:00
Pavel Emelyanov
137c7116dc cql_assertions: Make permit from env
To call table::find_row() one needs to provide a permit. Tests have
short and neat helper to create one from cql_test_env

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 16:01:29 +03:00
Kefu Chai
6a55e4120e encoding_state: mark helper methods protected
these methods are only used by the public methods of this class and
its derived class "memtable_encoding_stats_collector".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15190
2023-08-29 15:41:13 +03:00
Pavel Emelyanov
c2f2e0fd7a table: Remove find_partition_slow() helper
It's no longer used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 15:38:41 +03:00
Pavel Emelyanov
0a727a9b2e sstable_compaction_test: Do not re-decorate key
The is_partition_dead() local helper accepts partition key argument and
decorates it. Howerver, its caller gets partition key from decorated key
itself, and can just pass it along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 15:38:41 +03:00
Pavel Emelyanov
4e9f380608 cql_test_env: Move .require_column_has_value
This env helper is only used by tests (from cql_query_test)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 15:38:33 +03:00
Pavel Emelyanov
7597663ef5 cql_test_env: Use table.find_row() shortcut
The require_column_has_value() finds the cell in three steps -- finds
partition, then row, then cell. The class table already has a method to
facilitate row finding by partition and clustering key

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 15:37:27 +03:00
Kamil Braun
ebc9056237 Merge 'Restore storage_service -> cdc_generation_service dependency' from Pavel Emelyanov
The main goal of this PR is to stop cdc_generation_service from calling
system_keyspace::bootstrap_complete(). The reason why it's there is that
gen. service doesn't want to handle generation before node joined the
ring or after it was decommissioned. The cleanup is done with the help
of storage_service->cdc_generation_service explicit dependency brought
back and this, in turn, suddenly freed the raft and API code from the
need to carry cdc gen. service reference around.

Closes #15047

* github.com:scylladb/scylladb:
  cdc: Remove bootstrap state assertion from after_join()
  cdc: Rework gen. service check for bootstrap state
  api: Don't carry cdc gen. service over
  storage_service: Use local cdc gen. service in join_cluster()
  storage_service: Remove cdc gen. service from raft_state_monitor_fiber()
  raft: Do not carry cdc gen. service over
  storage_service: Use local cdc gen. service in topo calls
  storage_service: Bring cdc_generation_service dependency back
2023-08-29 14:10:06 +02:00
Piotr Dulikowski
5b99a9c084 topology_coordinator: account for features to enable in should_preempt_balancing
In #14722, a source of work was added to `handle_topology_transition`,
but the `should_preempt_balancing` function was not updated accordingly,
as is suggested by the comment in `handle_topology_transition`. This
omission happened due to a hasty rebase.

This commit fixes the issue, and now `should_preempt_balancing` will
return true if there are some features that should be enabled.
2023-08-29 11:53:12 +02:00
Piotr Dulikowski
371b640309 group0_state_machine: flush system.topology when updating features columns
The `supported_features` and `enabled_features` columns from
`system.topology` are read during the feature check that happens early
on boot. The check enforces two properties:

- A node is not allowed to revoke support for a feature after it notices
  in its local topology state that the feature is supported by all
  nodes.
- Similarly, a node is not allowed to revoke support for a feature after
  seeing that it was put to the `enabled_features` column by the
  topology coordinator.

However, due to the fact that the check has to happen before (schema)
commitlog replay and the table is not explicitly flushed when
`supported_features` or `enabled_features` columns are modified, the
feature check on boot might operate on old data and not do its job
properly.

In order to fix this, this commit modifies the `group0_state_machine` so
that is flushes the `system.topology` table every time the
`supported_features` or `enabled_features` column is modified, and after
every snapshot transfer.
2023-08-29 11:53:11 +02:00
Mikołaj Grzebieluch
bac8aa38d9 tests: raft: cluster of nodes with different group0 ids
The reproducer for #14448.

The test starts two nodes with different group0_ids. The second node
is restarted and tries to join the cluster consisting of the first node.
gossip_digest_syn message should be rejected by the first node, so
the second node will not be able to join the cluster.

This test uses repair-based node operations to make this test easier.
If the second node successfully joins the cluster, their tokens metadata
will be merged and the repair service will allow to decommission the second node.
If not - decommissioning the second node will fail with an exception
"zero replica after the removal" thrown by the repair service.
2023-08-29 11:09:15 +02:00
Mikołaj Grzebieluch
2230abc9b2 gossip: add group0_id attribute to gossip_digest_syn
Gossiper service sets its group0 id on startup if it is stored in `scylla_local`
or sets it during joining group0.

Send group0_id (if it is set) when the node tries to initiate the gossip round.
When a node gets gossip_digest_syn it checks if its group0 id equals the local
one and if not, the message is discarded.

Fixes #14448.
2023-08-29 11:09:15 +02:00
Aleksandra Martyniuk
5e31ca7d20 tasks: api: show tasks' scopes
To make manual analysis of task manager tasks easier, task_status
and task_stats contain operation scope (e.g. shard, table).

Closes #15172
2023-08-29 11:32:16 +03:00
Benny Halevy
fbc2907e70 gossiper: remove_endpoint: get the endpoint_state before yielding
We want to call the on_dead notifications if the
node was alive and it had endpoint_state.
Get the ep state before we may yield in
mutate_live_and_unreachable_endpoints, similarly
to mark_dead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-29 10:58:51 +03:00
Benny Halevy
adf5a1e082 gossiper: add do_on_dead_notifications
Use common code to notify subscribers on_dead
from remove_endpoint() and from mark_dead().

Modeled after do_on_change_notifications.

Refs https://github.com/scylladb/scylladb/pull/15179#discussion_r1306969125

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-29 10:58:37 +03:00
Botond Dénes
57deeb5d39 Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.

Fixes #12261
Fixes #15088

Also, add rest_api unit test for those apis

Closes #15093

* github.com:scylladb/scylladb:
  test: rest_api: add test_gossiper
  gossiper: add get_unreachable_members_synchronized
2023-08-29 10:43:22 +03:00
Pavel Emelyanov
4bf8f693ee cdc: Remove bootstrap state assertion from after_join()
As was described in the previous patch, this method is explicitly called
by storage service after updating the bootstrap state, so it's unneeded

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:47:35 +03:00
Pavel Emelyanov
566f57b683 cdc: Rework gen. service check for bootstrap state
The legacy_handle_cdc_generation() checks if the node had bootstrapped
with the help of system_keyspace method. The former is called in two
cases -- on boot via cdc_generation_service::after_join() and via
gossiper on_...() notifications. The notifications, in turn, are set up
in the very same after_join().

The after_join(), in turn, is called from storage_service explicitly
after the bootstrap state is updated to be "complete", so the check for
the state in legacy_handle_...() seems unnecessary. However, there's
still the case when it may be stepped on -- decommission. When performed
it calls storage_service::leave_ring() which udpates the bootstrap state
to be "needed", thus preventing the cdc gen. service from doing anything
inside gossiper's on_...() notifications.

It's more correct to stop cdc gen. service handling gossiper
notifications by unsubscribing it, but by adding fragile implicit
dependencies on the bootstrap state.

Checks for sys.dist.ks in the legacy_handle_...() are kept in a form
of on-internal-error. The system distributed keyspace is activated by
storage service even before the bootstrap state is updated and is
never deactivated, but it's anyway good to have this assertion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:46:13 +03:00
Pavel Emelyanov
c3a6e31368 api: Don't carry cdc gen. service over
There's a storage_service/cdc_streams_check_and_repair endpoint that
needs to provide cdc gen. service to call storage_service method on. Now
the latter has its own reference to the former and API can stop taking
care of that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Pavel Emelyanov
a61454be00 storage_service: Use local cdc gen. service in join_cluster()
The method in question accepts cdc_generation_service ref argument from
main and cql_test_env, but storage service now has local cdcv gen.
service reference, so this argument and its propagation down the stack
can be removed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Pavel Emelyanov
acc646fab6 storage_service: Remove cdc gen. service from raft_state_monitor_fiber()
This argument is just unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Pavel Emelyanov
4c89249c29 raft: Do not carry cdc gen. service over
There's a cdc_generation_service ref sitting on group0_state_machine and
the only reason it's there is to call storage_service::topology_...()
mathods. Now when storage service can access cdc gen. service on its
own, raft code can forget about cdc

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Pavel Emelyanov
c03d419adc storage_service: Use local cdc gen. service in topo calls
The topology_state_load() and topology_transition() both take cdc gen.
service an an argument, but can work with the local reference. This
makes the next patch possible

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Pavel Emelyanov
933ea0afe6 storage_service: Bring cdc_generation_service dependency back
It sort of reverts the 5a97ba7121 commit, because storage service now
uses the cdc generation service to serve raft topo updates which, in
turn, takes the cdc gen. service all over the raft code _just_ to make
it as an argument to storage service topo calls.

Also there's API carrying cdc gen. service for the single call and also
there's an implicit need to kick cdc gen. service on decommission which
also needs storage service to reference cdc gen. after boot is complete

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Benny Halevy
2c54d7a35a view, storage_proxy: carry effective_replication_map along with endpoints
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.

Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138

The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141

Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15142
2023-08-29 09:08:42 +03:00
Benny Halevy
0b9f221f2a gossiper: wait_for_live_nodes_to_show_up: increase timeout
This function is too flaky with the 30 seconds timeout.

For example, the following was seen locally with
`test_updated_shards_during_add_decommission_node` in dev mode:

alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node6.log:
```
INFO  2023-08-27 15:47:25,753 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-08-27 15:47:30,754 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-08-27 15:47:35,761 [shard 0] gossip - (rate limiting dropped 495 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-08-27 15:47:40,766 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-08-27 15:47:45,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-08-27 15:47:50,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
ERROR 2023-08-27 15:47:55,758 [shard 0] gossip - Timed out waiting for 2 live nodes to show up in gossip
INFO  2023-08-27 15:47:55,759 [shard 0] init - Shutting down group 0 service
```

alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node1.log:
```
INFO  2023-08-27 15:48:02,532 [shard 0] gossip - InetAddress 127.0.43.6 is now UP, status = UNKNOWN
...
WARN  2023-08-27 15:48:03,552 [shard 0] gossip - failure_detector_loop: Send echo to node 127.0.43.6, status = failed: seastar::rpc::closed_error (connection is closed)
```

Note that node1 saw node6 as UP after node6 already timed out
and was shutting down.

Increase the timeout to 3 minutes in all modes to reduce flakiness.

Fixes #15185

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15186
2023-08-29 09:02:41 +03:00
Michał Chojnowski
2000a09859 reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop()
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.

Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.

A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.

After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.

Fixes #15198

Closes #15199
2023-08-29 08:18:49 +03:00
Benny Halevy
5afc242814 token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id
It is too early to require that all nodes in normal state
have a non-null host_id.

The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.

In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.

Fixes scylladb/scylladb#15181

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15184
2023-08-28 21:40:55 +03:00
Botond Dénes
47ce69e9bf Merge 'paxos_response_handler: carry effective replication map' from Benny Halevy
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Closes #15141

* github.com:scylladb/scylladb:
  storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
  storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
2023-08-28 11:42:38 +03:00
Kefu Chai
86e8be2dcd replica:database: log if endpoint not found
if the endpoint specified when creating a KEYSPACE is not found,
when flushing a memtable, we would throw an `std::out_of_range`
exception when looking up the client in `storage_manager::_s3_endpoints`
by the name of endpoint. and scylla would crash because of it. so
far, we don't have a good way to error out early. since the
storage option for keyspace is still experimental, we can live
with this, but would be better if we can spot this error in logging
messages when testing this feature.

also, in this change, `std::invalid_argument` is thrown instead of
`std::out_of_range`. it's more appropriate in this circumstance.

Refs #15074
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15075
2023-08-28 10:51:19 +03:00
Avi Kivity
fb8375e1e7 Merge 'storage_proxy: mutate_atomically_result: carry effective replication map down to create_write_response_handler' from Benny Halevy
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.

Fixes scylladb/scylladb#15147

Closes #15149

* github.com:scylladb/scylladb:
  storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler
  storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
2023-08-27 16:34:34 +03:00
Benny Halevy
a5d5b6ded1 gossiper: remove_endpoint: call on_dead notifications is endpoint was alive
Since 75d1dd3a76
gossiper::convict will no longer call `mark_dead`
(e.g. when called from the failure detection loop
after a node is stopped following decommission)
and therefore the on_dead notification won't get called.

To make that explicit, if the node was alive before
remove_endpoint erased it from _live_endpoint,
and it has an endpoint_state, call the on_dead notifications.
These are imporant to clean up after the node is dead
e.g. in storage_proxy::on_down which cancels all
respective write handlers.

This preferred over going through `mark_dead` as the latter
marks the endpoint as unreachable, which is wrong in this
case as the node left the cluster.

Fixes #15178

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15179
2023-08-27 16:18:27 +03:00
Takuya ASADA
ae25a216bc scylla_fstrim_setup: stop disabling fstrim.timer
Disabling fstrim.timer was for avoid running fstrim on /var/lib/scylla from
both scylla-fstrim.timer and fstrim.timer, but fstrim.timer actually never do
that, since it is only looking on fstab entries, not our systemd unit.

To run fstrim correctly on rootfs and other filesystems not related
scylla, we should stop disabling fstrim.timer.

Fixes #15176

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #15177
2023-08-27 14:56:37 +03:00
Kefu Chai
83ceedb18b storage_service: do not cast a string to string_view before formatting
seastar::format() just forward the parameters to be formatted to
`fmt::format_to()`, which is able to format `std::string`, so there is
no need to cast the `std::string` instance to `std::string_view` for
formatting it.

in this change, the cast is dropped. simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15143
2023-08-25 16:43:38 +03:00
Mikołaj Grzebieluch
a031a14249 tests: add asynchronous log browsing functionality
Add a class that handles log file browsing with the following features:
* mark: returns "a mark" to the current position of the log.
* wait_for: asynchronously checks if the log contains the given message.
* grep: returns a list of lines matching the regular expression in the log.

Add a new endpoint in `ManagerClient` to obtain the scylla logfile path.

Fixes #14782

Closes #14834
2023-08-25 14:19:09 +02:00
Raphael S. Carvalho
a22f74df00 table: Introduce storage snapshot for upcoming tablet streaming
New file streaming for tablets will require integration with compaction
groups. So this patch introduces a way for streaming to take a storage
snapshot of a given tablet using its token range. Memtable is flushed
first, so all data of a tablet can be streamed through its sstables.
The interface is compaction group / tablet agnostic, but user can
easily pick data from a single tablet by using the range in tablet
metadata for a given tablet.

E.g.:

	auto erm = table.get_effective_replication_map();
	auto& tm = erm->get_token_metadata();
	auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id());

	for (auto tid : tablet_map.tablet_ids()) {
		auto tr = tmap.get_token_range(tid);

		auto ssts = co_await table.take_storage_snapshot(tr);

		...
	}

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15128
2023-08-25 13:06:02 +02:00
Patryk Jędrzejczak
64a9bbe0ce docs: inform about deprecating IP support for --ignore-dead-nodes
We also remove one of the removenode examples because it uses the
deprecated IP support.
2023-08-25 12:33:49 +02:00
Patryk Jędrzejczak
b2755755f4 raft topology: support IPs for --ignore-dead-nodes
We want to stop supporting IPs for --ignore-dead-nodes in
raft_removenode and --ignore-dead-nodes-for-replace for
raft_replace. However, we shouldn't remove these features without
the deprecation period because the original removenode and
replace operations still support them. So, we add them for now.

Additionally, we modify test_raft_ignore_nodes.py so that it
verifies the added IP support.
2023-08-25 12:33:45 +02:00
Patryk Jędrzejczak
9806bddf75 test: fix a test case in raft_address_map_test
The test didn't test what it was supposed to test. It would pass
even if set_nonexpiring() didn't insert a new entry.

Closes #15157
2023-08-25 12:11:33 +02:00
Kefu Chai
d2d1141188 sstables: writer: delegate flush() in checksummed_file_data_sink_impl
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.

this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!

but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.

the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.

in this change, we just delegate the `flush()` call to the
wrapped class.

Fixes #15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15134
2023-08-24 18:03:10 +03:00
Patryk Jędrzejczak
59df5ce7e4 raft_address_map: introduce find_by_addr
In the following commit, we add IP support for --ignore-dead-nodes
in raft_removenode and raft_replace. To implement it, we need
a way to translate IPs to Raft IDs. The solution is to add a new
member function -- find_by_addr -- to raft_address_map that
does the IP->ID translation.

The IP support for --ignore-dead-nodes will be deprecated and
find_by_addr shouldn't be called for other reasons, so it always
logs a warning.

We also add some unit tests for find_by_addr.
2023-08-24 15:10:43 +02:00
Raphael S. Carvalho
d6cc752718 test: Fix flakiness in sstable_compaction_test.autocompaction_control_test
It's possible that compaction task is preempted after completion and
before reevaluation, causing pending_tasks to be > 1.

Let's only exit the loop if there are no pending tasks, and also
reduce 100ms sleep which is an eternity for this test.

Fixes #14809.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15059
2023-08-24 13:37:06 +03:00
Benny Halevy
4a2e367e92 storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.

Fixes scylladb/scylladb#15138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:45:13 +03:00
Benny Halevy
672ec66769 test: rest_api: add test_gossiper
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:37:12 +03:00
Benny Halevy
8825828817 gossiper: add get_unreachable_members_synchronized
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.

Fixes scylladb/scylladb#15088

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:36:55 +03:00
Benny Halevy
415d923c08 gossiper: apply_new_states: tolerate listener errors during shutdown
Change 6449c59 brought back abort on listener failure,
but this causes crashes when listeners hit expected errors
like gate_closed.

Detect shutdown usig the gossiper _abort_source
and in this case just log a warning about the errors
but do not abort.

Fixes scylladb/scylladb#15031

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:12:03 +03:00
Benny Halevy
40f508a51d gossiper: do_on_change_notifications: check abort source
As Tomasz Grabiec correctly noted:
> We should also ensure that once _abort_source is aborted, we don't attempt to process any further notifications, because that would violate monotonicity due to partially failed notification. Even if the next listener eventually fails too, if this invariant is violated, it can have undesired side effects.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:12:03 +03:00
Benny Halevy
85f553e723 gossiper: lock_endpoint_update_semaphore: get_units with _abort_source
Locking the _endpoint_update_semaphore should be abortable with the
gossiper _abort_source.  No further processing should
be done once abort is requested.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:12:03 +03:00
Benny Halevy
d5dff1a16e gossiper: lock_endpoint: get_units with _abort_source
Locking an endpoint should be abortable with the
gossiper _abort_source.  No further processing should
be done once abort is requested.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:12:03 +03:00
Benny Halevy
ae70afd099 gossiper: is_enabled: consider also _abort_source
Once abort is requested we should not process any more
gossip RPCs to prevent undesired side effects
of partially applied state changes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 11:12:02 +03:00
Benny Halevy
6af0b281a6 storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.

Fixes scylladb/scylladb#15147

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 10:43:40 +03:00
Benny Halevy
098dd5021a storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.

It will be used in the next patch to get a respective
effective_replication_map_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 10:43:23 +03:00
Benny Halevy
27c33015a5 storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
Return error in production rather than crashing
as in https://github.com/scylladb/scylladb/issues/15138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-24 08:59:38 +03:00
Kefu Chai
2f17b76df7 docs/operating-scylla/admin-tools: add note on deprecating sstabledump
sstabledump is deprecated in place of `scylla sstable` commands. so
let's reflect this in the document.

Fixes #15020
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15021
2023-08-24 08:31:29 +03:00
Botond Dénes
1609c76d62 tools/scylla-sstable: scrub: don't qurantine sstables after validate
Scylla sstable promises to *never* mutate its input sstables. This
promise was broken by `scylla sstable scrub --scrub-mode=validate`,
because validate moves invalid input sstables into qurantine. This is
unexpected and caused occasional failures in the scrub tests in
test_tools.py. Fix by propagating a flag down to
`scrub_sstables_validate_mode()` in `compaction.cc`, specifying whether
validate should qurantine invalid sstables, then set this flag to false
in `scylla-sstable.cc`. The existing test for validate-mode scrub is
ammended to check that the sstable is not mutated. The test now fails
before the fix and passes afterwards.

Fixes: #14309

Closes #15139
2023-08-23 21:53:12 +03:00
Kamil Braun
93be4c0cb0 Merge 'Base node liveliness consistently on gossiper::is_alive' from Benny Halevy
Currently he gossiper marks endpoint_state objects as alive/dead.
I some cases the endpoint_state::is_alive function is checked but in many other cases
gossiper::is_alive(endpoint) is used to determine if the endpoint is alive.

This series removed the endpoint_state::is_alive state and moves all the logic to gossiper::is_alive
that bases its decision on the endpoint having an endpoint_state and being in the _live_endpoints set.

For that, the _live_endpoints is made sure to be replicated to all shards when changed
and the endpoint_state changes are serialized under lock_endpoint, and also making sure that the
endpoint_state in the _endpoint_states_map is never updated in place, but rather a temporary copy is changed
and then safely replicated using gossiper::replicate

Refs https://github.com/scylladb/scylladb/issues/14794

Closes #14801

* github.com:scylladb/scylladb:
  gossiper: mark_alive: remove local_state param
  endpoint_state: get rid of _is_alive member and methods
  gossiper: is_alive: use _live_endpoints
  gossiper: evict_from_membership: erase endpoint from _live_endpoints
  gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change
  gossiper: run: no need to replicate live_endpoints
  gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change
  gossiper: add mutate_live_and_unreachable_endpoints
  gossiper: reset_endpoint_state_map: clear also shadow endpoint sets
  gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards
  gossiper: functions that change _live_endpoints must be called on shard 0
  gossiper: add lock_endpoint_update_semaphore
  gossiper: make _live_endpoints an unordered_set
  endpoint_state: use gossiper::is_alive externally
2023-08-23 17:18:05 +02:00
Gleb Natapov
d1654ccdda storage_service: register schema version observer before joining group0 and starting gossiper
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.

Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.

Fixes: #15078

Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
2023-08-23 17:11:51 +02:00
Patryk Jędrzejczak
ef2eac9941 raft topology: make every type in request_param a named struct
We make every alternative type in the request_param variant
a named struct to make the code more readable. Additionally, this
change will make extending request parameters easier if we decide
to do so in the future.

Closes #15132
2023-08-23 16:56:00 +02:00
Patryk Jędrzejczak
7eab9f8a02 raft_removenode: remove "raft topology" from errors
Some runtime errors thrown in storage_service::raft_removenode
start with the "raft topology " prefix. Since "raft topology" is
an implementation detail, we don't want to throw this information
through the user API. Only logs should contain it.

Closes #15136
2023-08-23 16:20:14 +02:00
Amnon Heiman
4b1be88c93 service/storage_proxy.cc: mark counters with skip_when_empty
This patch mark per-scheduling group counters with skip_when_empty flag.
This reduces metrics reporting for scheduling groups that do not use
those counters.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-23 09:38:35 -04:00
Amnon Heiman
c279409d48 cql3/query_processor.cc: mark cas related metrics with skip_when_empty
This patch mark the conditional metrics counters with skip_when_empty
flag, to reduce metrics reporting when cas is not in used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-23 09:30:35 -04:00
Amnon Heiman
1abcd4bb11 transport/server.cc: mark metric counter with skip_when_empty
This patch mark scylla_transport_cql_errors_total with skip_when_empty
flag.

It reduces the overhead for metrics for types that are never reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-23 09:30:35 -04:00
Nadav Har'El
5530c529c2 test/cql-pytest: regression test for old bug with CAST(f AS TEXT) precision
When casting a float or double column to a string with `CAST(f AS TEXT)`,
Scylla is expected to print the number with enough digits so that reading
that string back to a float or double restores the original number
exactly. This expectation isn't documented anywhere, but makes sense,
and is what Cassandra does.

Before commit 71bbd7475c, this wasn't the
case in Scylla: `CAST(f AS TEXT)` always printed 6 digits of precision,
which was a bit under enough for a float (which can have 7 decimal digits
of precision), but very much not enough for a double (which can need 15
digits). The origin of this magic "6 digits" number was that Scylla uses
seastar::to_sstring() to print the float and double values, and before
the aforementioned commit those functions used sprintf with the "%g"
format - which always prints 6 decimal digits of precision! After that
commit, to_sstring() now uses a different approach (based on fmt) to
print the float and double values, that prints all significant digits.

This patch adds a regression test for this bug: We write float and double
values to the database, cast them to text, and then recover the float
or double number from that text - and check that we get back exactly the
same float or double object. The test *fails* before the aforementioned
commit, and passes after it. It also passes on Cassandra.

Refs #15127

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15131
2023-08-23 16:06:52 +03:00
Botond Dénes
e7af2a7de8 Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.

Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793

Closes #14844

* github.com:scylladb/scylladb:
  api: storage_service: improve description of /storage_service/host_id
  token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
2023-08-23 13:55:14 +03:00
Botond Dénes
139ba553b8 Merge 'sstable, test: log sstable name and pk when capping local_deletion_time ' from Kefu Chai
in this series, we also print the sstable name and pk when writing a tombstone whose local_deletion_time (ldt for short) is greater than INT32_MAX which cannot be represented by an uint32_t.

Fixes #15015

Closes #15107

* github.com:scylladb/scylladb:
  sstable/writer: log sstable name and pk when capping ldt
  test: sstable_compaction_test: add a test for capped tombstone ldt
2023-08-23 09:29:54 +03:00
Botond Dénes
f7505405f0 scylla-gdb.py: use for_each_table() everywhere
scylla-gdb.py has two methods for iterating over all tables:
* all_tables()
* for_each_table()

Despite this, many places in the code iterate over the column family map
directly. This patch leaves just a single method (for_each_table()) and
migrates all the codebase to use it, instead of iterating over the raw
map. While at it, the access to the map is made backward compatible with
pre 52afd9d42d code, said commit wrapped database::_column_families in
tables_metadata object. This broke scylla-gdb.py for older versions.

Closes #15121
2023-08-22 20:39:31 +03:00
Kamil Braun
169d19e5b0 Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.

The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.

Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.

Fixes #15025

Closes #15113

* github.com:scylladb/scylladb:
  test: add test_raft_ignore_nodes
  test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
  raft topology: pass ignore_nodes to {replace, remove}_with_repair
  raft topology: exec_global_command: add ignore_nodes to exclude_nodes
  raft topology: exec_global_command: change type of exclude_nodes
  topology_state_machine: extend request_param with a set of raft ids
  raft topology: set ignore_nodes in raft_removenode and raft_replace
  utils: introduce split_comma_separated_list
  raft topology: add the ignore_nodes column to system.topology
2023-08-22 18:04:59 +02:00
Kamil Braun
cdc3cd2b79 Merge 'raft: add fencing tests' from Petr Gusev
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.

The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.

Closes #14901

* github.com:scylladb/scylladb:
  test_fencing: add test_fence_hints
  test.py: output the skipped tests
  test.py: add skip_mode decorator and fixture
  test.py: add mode fixture
  hints: add debug log for dropped hints
  hints: send_one_hint: extend the scope of file_send_gate holder
  pylib: add ScyllaMetrics
  hints manager: add send_errors counter
  token_metadata: add debug logs
  fencing: add simple data plane test
  random_tables.py: add counter column type
  raft topology: don't increment version when transitioning to node_state::normal
2023-08-22 16:28:21 +02:00
Piotr Grabowski
17e3e367ca test: use more frequent reconnection policy
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.

This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.

Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.

Fixes #15104

Closes #15112
2023-08-22 15:40:39 +02:00
Avi Kivity
d944872d19 Merge 'Prevent reactor stalls in to_repair_rows_list' from Benny Halevy
This sort series deals with two stall sources in row-level repair `to_repair_rows_list`:
1. Freeing the input `repair_rows_on_wire` in one shot on return (as seen in https://github.com/scylladb/scylladb/issues/14537)
2. Freeing the result `row_list` in one shot on error.  this hasn't been seen in testing but I have no reason to believe it is not susceptible to stalls exactly like repair_rows_on_wire with the same number of rows and mutations.

Fixes https://github.com/scylladb/scylladb/issues/14537

Closes #15102

* github.com:scylladb/scylladb:
  repair: reindent to_repair_rows_list
  repair: to_repair_rows_list: clear_gently on error
  repair: to_repair_rows_list: consume frozen rows gently
2023-08-22 15:29:37 +03:00
Patryk Jędrzejczak
b044ee535f test: add test_raft_ignore_nodes
We add two tests verifying that --ignore-dead-nodes in
raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace are handled correctly.

We need a 7-cluster to have a Raft majority. Therefore, these
tests are quite slow, and we want to run them only in the dev mode.
2023-08-22 14:19:21 +02:00
Patryk Jędrzejczak
6818d13f7d test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
ManagerClient.remove_node allows passing ignore_dead only as
List[IPAddress]. However, raft_removenode currently supports
only host ids. To write a test that passes ignore_dead to
ManagerClient.remove_node in the Raft topology mode, we allow
passing ignore_dead as List[HostId].

Note that we don't want to use List[IPAddress | HostId] because
mixing IP addresses and host ids fails anyway. See
ss::remove_node.set(...) in api::set_storage_service.
2023-08-22 14:19:09 +02:00
Patryk Jędrzejczak
26ad527666 raft topology: pass ignore_nodes to {replace, remove}_with_repair
To properly stream ranges during the removenode or replace
operation in the Raft topology mode, we pass IPs of the ignored
nodes to replace_with_repair and remove_with_repair in
storage_service::raft_topology_cmd_handler.
2023-08-22 14:18:39 +02:00
Patryk Jędrzejczak
e685182290 raft topology: exec_global_command: add ignore_nodes to exclude_nodes
We add ignore_nodes to exclude_nodes in exec_global_command
to ignore nodes marked as dead by --ignore-dead-nodes for
raft_removenode and --ignore-dead-nodes-for-replace for
raft_replace.
2023-08-22 14:18:37 +02:00
Patryk Jędrzejczak
5ebee35f99 raft topology: exec_global_command: change type of exclude_nodes
We extend exclude_nodes in exec_global_command with ignore_nodes
in the next commit. Since we already use std::unordered_set to
store ids of the ignored nodes and their number is unknown, we
change the type of exclude_nodes from utils::small_vector to
std::unordered_set.
2023-08-22 14:17:55 +02:00
Patryk Jędrzejczak
1f57d80ba1 topology_state_machine: extend request_param with a set of raft ids
We add two new alternative types to service::request_param:
removenode_param and replace_param. They allow storing the list
of ignored nodes loaded from the ignore_nodes column of
system.topology. We also remove the raft::server_id type because
it has been only used by the replace operation.
2023-08-22 14:17:37 +02:00
Patryk Jędrzejczak
7d3dc306eb raft topology: set ignore_nodes in raft_removenode and raft_replace
To handle --ignore-dead-nodes in raft_removenode and
--ignore-dead-nodes-for-replace in raft_replace, we set the
ignore_nodes value of the topology mutation in these functions. In
the following commits, we ensure that the topology coordinator
properly makes use of it.
2023-08-22 14:13:51 +02:00
Petr Gusev
1ddc76ffd1 test_fencing: add test_fence_hints
The test makes a write through the first node with
the third node down, this causes a hint to be stored on the
first node for the second. We increment the version
and fence_version on the third node, restart it,
and expect to see a hint delivery failure
because of versions mismatch. Then we update the versions
of the first node and expect hint to be successfully
delivered.
2023-08-22 15:48:40 +04:00
Petr Gusev
3ccd2abad4 test.py: output the skipped tests
pytest option -rs forces it to print
all the skipped tests along with
the reasons. Without this option we
can't tell why certain tests were skipped,
maybe some of them shouldn't already.
2023-08-22 15:48:40 +04:00
Petr Gusev
c434d26b36 test.py: add skip_mode decorator and fixture
Syntactic sugar for marking tests to be
skipped in a particular mode.

There is skip_in_debug/skip_in_release in suite.yaml,
but they can be applied only on the entire file,
which is unnatural and inconvenient. Also, they
don't allow to specify a reason why the test is skipped.

Separate dictionary skipped_funcs is needed since
we can't use pytest fixtures in decorators.
2023-08-22 15:48:40 +04:00
Petr Gusev
a639d161e6 test.py: add mode fixture
Sometimes a test wants to know what mode
it is running in so that e.g. it can skip
itself in some of them.
2023-08-22 15:48:40 +04:00
Petr Gusev
439c91851f hints: add debug log for dropped hints
Dropping data is rather important event,
let's log it at least at the debug level.
It'll help in debugging tests.
2023-08-22 15:48:40 +04:00
Petr Gusev
9fd3df13a2 hints: send_one_hint: extend the scope of file_send_gate holder
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.

Fixes #15110
2023-08-22 15:48:40 +04:00
Petr Gusev
0b7a90dff6 pylib: add ScyllaMetrics
This patch adds facilities to work
with Scylla metrics from test.py tests.
The new metrics property was added to
ManagerClient, its query method
sends a request to Scylla metrics
endpoint and returns and object
to conveniently access the result.

ScyllaMetrics is copy-pasted from
test_shedding.py. It's difficult
to reuse code between 'new' and 'old'
styles of tests, we can't just import
pylib in 'old' tests because of some
problems with python search directories.
A past commit of mine that attempted
to solve this problem was rejected on review.
2023-08-22 14:31:04 +04:00
Petr Gusev
1b7603af23 hints manager: add send_errors counter
There was no indication of problems
in the hints manager metrics before.
We need this counter for fencing tests
in the later commit, but it seems to be
useful on its own.
2023-08-22 14:31:04 +04:00
Petr Gusev
fa25e6d63e token_metadata: add debug logs
We log the new version when the new token
metadata is set.

Also, the log for fence_version is moved
in shared_token_metadata from storage_service
for uniformity.
2023-08-22 14:31:04 +04:00
Petr Gusev
360453fd87 fencing: add simple data plane test
The test starts a three node cluster
and manually decrements the version on
the last node. It then tries to write
some data through the last node and
expects to get 'stale topology' exception.
2023-08-22 14:31:01 +04:00
Benny Halevy
801987ab19 gossiper: mark_alive: remove local_state param
It is not used anymore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 12:06:45 +03:00
Benny Halevy
75d1dd3a76 endpoint_state: get rid of _is_alive member and methods
Now that gossiper bases its is_alive status on _live_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 12:06:45 +03:00
Benny Halevy
8a92a1c699 gossiper: is_alive: use _live_endpoints
Use the presence of of the endpoint in _live_endpoints
as the authoritative source for is_alive
rather than the endpoint_state::is_alive status.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 12:06:45 +03:00
Benny Halevy
a79acbb643 gossiper: evict_from_membership: erase endpoint from _live_endpoints
Although it shouldn't be necessary, erase endpoint
from _live_endpoint, just in case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 12:06:30 +03:00
Benny Halevy
ce2b8724ed gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change
Rather than keeping an expensive copy of `_live_endpoint`
and `_unreachable_endpoints` in shadow members, while they aren't
currently used for their content anyhow, just to detect when
their corresponding members change.

With that, it is renamed to replicate_live_and_unreachable_endpoints.

This still doesn't provide strong exception safety guarantees,
but at least we don't "cheat" about shard state
and we don't mark shard 0 state as "replicated" by
updating the shadow members.
Also, we save some unneeded allocations.

Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 12:04:25 +03:00
Benny Halevy
d666fbfe8f gossiper: run: no need to replicate live_endpoints
As asias@scylladb.com noticed, after the previous
patch that calls replicate_live_endpoints_on_change
in mutate_live_and_unreachable_endpoints, _live_endpoints
are always updated on all shards when they change,
so there's no need anymore to replicate them here.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 11:57:32 +03:00
Benny Halevy
2c27297dbd gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change
We want to propagate any change to _live_endpoints
to all shards.  Currently we just update the `_live_endpoints_version`
and `replicate_live_endpoints_on_change` propagtes the
change some undetermined time in the future.

To rely on `_live_endpoints` for gossiper::is_alive,
that may be called on any shard, we want to propagate
the change to all shards as soon as it happens.

Use `mutate_live_and_unreachable_endpoints` to update
_live_endpoints and/or _unreachable_endpoints safely,
under `lock_endpoint_update_semaphore`. It is responsible
for incrementing _live_endpoints_version and
calling `replicate_live_endpoints_on_change` to
propagate the change to all shards.

Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 11:56:41 +03:00
Benny Halevy
86ccc1f49b gossiper: add mutate_live_and_unreachable_endpoints
To be used for safely modifying _live_endpoints
and/or _unreachable_endpoints and replicating the
new version to all shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 11:56:37 +03:00
Patryk Jędrzejczak
0beabdc6ba utils: introduce split_comma_separated_list
Three places handle comma-separated lists similarly:
- ss::remove_node.set(...) in api::set_storage_service,
- storage_service::parse_node_list,
- storage_service::is_repair_based_node_ops_enabled.
In the next commit, the fourth place that needs the same logic
appears -- storage_service::raft_replace. It needs to load
and parse the --ignore-dead-nodes-for-replace param from config.

Moreover, the code in is_repair_based_node_ops_enabled is
different and doesn't seem right. We swap '\"' and '\'' with ' '
but don't do anything with it afterward.

To avoid code duplication and fix is_repair_based_node_ops_enabled,
we introduce the new function utils::split_comma_separated_list.

This change has a small side effect on logging. For example,
ignore_nodes_strs in storage_service::parse_node_list might be
printed in a slightly different form.
2023-08-22 10:30:36 +02:00
Patryk Jędrzejczak
16f5db8af2 raft topology: add the ignore_nodes column to system.topology
In the following commits, we add support for --ignore-dead-nodes
in raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace. To make these request parameters accessible for the
topology coordinator, we store them in the new ignore_nodes
column of system.topology.
2023-08-22 10:30:12 +02:00
Benny Halevy
a14e5ab8a3 gossiper: reset_endpoint_state_map: clear also shadow endpoint sets
If we don't clear them, there is a slight chance
that the next update will make `_live_endpoints` or `_unreachable_endpoints`
equal to their shadow counterparts and prevent an update
in `replicate_live_endpoints_on_change`.

Fixes #15003

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:17:19 +03:00
Benny Halevy
0cc0a95543 gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards
Not only on the calling shard (shard 0).
Essentially this change folds `update_live_endpoints_version`
into `reset_endpoint_state_map`.

Acquire the _endpoint_update_semaphore to serialize
this with `replicate_live_endpoints_on_change`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:17:19 +03:00
Benny Halevy
c45868e3bc gossiper: functions that change _live_endpoints must be called on shard 0
`update_live_endpoints_version` and functions that call it
must be called on shard 0, since it updates the authoritative
`_live_endpoints` and `_live_endpoints_version`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:17:19 +03:00
Benny Halevy
b0b1c8ae6e gossiper: add lock_endpoint_update_semaphore
Add a private helper to acquire the _endpoint_update_semaphore
before calling replicate_live_endpoints_on_change.

Must be called on shard 0.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:17:19 +03:00
Benny Halevy
18881bc89d gossiper: make _live_endpoints an unordered_set
It is more efficient to maintain as an unrdered_set
and it will be used in a following patch
to determine is_alive(endpoint) in O(1) on average.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:17:19 +03:00
Benny Halevy
97061cc3b8 endpoint_state: use gossiper::is_alive externally
Before we remove endpoint_state:_is_alive to rely
solely on gossipper::_live_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 09:06:09 +03:00
Benny Halevy
758dc252ff repair: reindent to_repair_rows_list 2023-08-22 08:46:26 +03:00
Benny Halevy
7406e9f99b repair: to_repair_rows_list: clear_gently on error
Prevent destroying of potentially large `rows` and `row_list`
in one shot on error as it might caused a reactor stall.

Instead, use utils::clear_gently on the error return path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 08:45:59 +03:00
Benny Halevy
e55143148f repair: to_repair_rows_list: consume frozen rows gently
Although to_repair_rows_list may yield if needed
between rows and mutation fragments, the input
`repair_rows_on_wire` is freed in one shot
and that may cause stalls as seen in qa:
```
  |              bytes_ostream::free_chain at ././bytes_ostream.hh:163
  ++           - addr=0x4103be0:
  |              bytes_ostream::~bytes_ostream at ././bytes_ostream.hh:199
  |              (inlined by) frozen_mutation_fragment::~frozen_mutation_fragment at ././mutation/frozen_mutation.hh:273
  |              (inlined by) std::destroy_at<frozen_mutation_fragment> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
  |              (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
  |              (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
  |              (inlined by) std::__cxx11::_List_base<frozen_mutation_fragment, std::allocator<frozen_mutation_fragment> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
  |              (inlined by) partition_key_and_mutation_fragments::~partition_key_and_mutation_fragments at ././repair/repair.hh:203
  |              (inlined by) std::destroy_at<partition_key_and_mutation_fragments> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
  |              (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
  |              (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
  |              (inlined by) std::__cxx11::_List_base<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
  |              (inlined by) to_repair_rows_list at ./repair/row_level.cc:597
```

This change consumes the rows and frozen mutation fragments
incrementally, freeing each after being processed.

Fixes scylladb/scylladb#14537

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-22 08:45:54 +03:00
Nadav Har'El
a963b59495 test/cql-pytest: add reproducer for IN not working with secondary index
We already have a test for issue #13533, where an "IN" doesn't work with
a secondary index (the secondary index isn't used in that case, and
instead inefficient filtering is required). Recently a user noticed the
same problem also exists for local secondary indexes - and this patch
includes a reproducing test. The new test is marked xfail, as the issue is
still unfixed. The new test is Scylla-only because local secondary index
is a Scylla-only extension that doesn't exist in Cassandra.

Refs #13533.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15106
2023-08-22 07:25:32 +03:00
Avi Kivity
23be6f0336 tablets: change persistent type of replica set from set to list
The system.tablets table stores replica sets as a CQL set type,
which is sorted. This means that if, in a tablet replica set
[n1, n2, n3] n2 is replaced with n4, then on reload we'll see
[n1, n3, n4], changing the relative position of n3 from the third
replica to the second.

The relative position of replicas in a replica set is important
for materialized views, as they use it to pair base replicas with
view replicas. To prepare for materialized views using tablets,
change the persistent data type to list, which preserves order.

The code that generates new replica sets already preserves order:
see locator::replace_replica().

While this changes the system schema, tablets are an experimental
feature so we don't need to worry about upgrades.

Closes #15111
2023-08-21 22:55:14 +02:00
Nadav Har'El
18e8e62798 cql-pytest: translate Cassandra's tests for SELECT with LIMIT
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectLimitTest.java into our cql-pytest framework.

The tests reproduce two already-known bugs:

Refs #9879:  Using PER PARTITION LIMIT with aggregate functions should
             fail as Invalid query
Refs #10357: Spurious static row returned from query with filtering,
             despite not matching filter

And also helped discover two new issues:

Refs #15099: Incorrect sort order when combining IN, and ORDER BY
Refs #15109: PER PARTITION LIMIT should be rejected if SELECT DISTINCT
             is used

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #15114
2023-08-21 22:29:11 +03:00
Kefu Chai
63b32cbdb4 tasks: s/stoppping/stopping/
fix a typo

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15103
2023-08-21 22:28:38 +03:00
Eliran Sinvani
eb368f9f6e internal_keyspace extention: enhance the semantics also to flushes
commit 7c8c020 introduced a new type of a keyspace, an internal keyspace
It defined the semantics for this internal keyspace, this keyspace is
somewhat a hybrid between system and user keyspace.

Here we extend the semantics to include also flushes, meaning that
flushes will be done using the system dirty_mamory_manager. This is
in order to allow inter dependencies between internal tables and user
tables and prevent deadlocks.

One example of such a deadlock is our `replicated_key_provider`
encryption on the enterprise version. The deadlock occur because in some
circumstances, an encrypted user table flush is dependant upon the
`encrypted_keys` table being flushed but since the requests are
serialized, we get a deadlock.

Tests: unit tests dev + debug
The deadlock dtest reproducer:
encryption_at_rest_test.py::TestEncryptionAtRest::test_reboot

Fixes #14529

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #14547
2023-08-21 18:17:05 +03:00
Avi Kivity
ce43effc21 Merge "fix rebuild with consistent topology management" From Gleb Natapov
"
The series fixes bogus asserting during topology state load and add a
test that runs rebuild to make sure the code will not regress again.

Fixes #14958
"

* 'gleb/rebuilding_fix_v1' of github.com:scylladb/scylla-dev:
  test: add rebuild test
  system_keyspace: fix assertion for missing transition_state
2023-08-21 16:00:42 +03:00
Kefu Chai
8cc215db96 test: randomized_nemesis_test: do not brace around scalars
Clang and GCC's warning option of `-Wbraced-scalar-init` warns
at seeing superfluous use of braces, like:
```
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2187:32: error: braces around scalar initializer [-Werror,-Wbraced-scalar-init]
            .snapshot_threshold{1},
                               ^~~

```
usually, this does not hurt. but by taking the braces out, we have
a more readable piece of code, and less warnings.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15086
2023-08-21 15:57:06 +03:00
Kefu Chai
9c24be05c3 sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-21 19:25:32 +08:00
Kefu Chai
0bc99c7f49 test: sstable_compaction_test: add a test for capped tombstone ldt
local_delection_time (short for ldt) is a timestamp used for the
purpose of purging the tombstone after gc_grace_seconds. if its value
is greater than INT32_MAX, it is capped when being written to sstable.
this is very likely a signal of bad configuration or a even a bug in
scylla. so we keep track of it with a metric named
"scylla_sstables_capped_tombstone_deletion_time".

in this change, a test is added to verify that the metric is updated
upon seeing a tombstone with this abnormal ldt.

because we validate the consistency before and after compaction in
tests, this change adds a parameter to disable this check, otherwise,
because capping the ldt changes the mutation, the validation would
fail the test.

Refs #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-21 19:25:32 +08:00
Aleksandra Martyniuk
e0ce711e4f compaction: do not swallow compaction_stopped_exception for reshape
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.

Rethrow an exception in perfrom task for reshape compaction.

Fixes: #15058.

Closes #15067
2023-08-21 12:41:55 +03:00
Vlad Zolotarov
e13a2b687d scylla_raid_setup: make --online-discard argument useful
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.

Fixes #14963

Closes #14964
2023-08-21 12:21:23 +03:00
Anna Stuchlik
b5c4d13e36 doc: update the Seastar Perftune page
This commit updates the description of perftune.py.
It is based on the information in the reported issue (below),
the contents of help for perftune.py, and the input from
@vladzcloudius.

Fixes https://github.com/scylladb/scylladb/issues/14233

Closes #14879
2023-08-21 10:23:30 +03:00
Anna Stuchlik
57e86b05f1 doc: fix the outdated Networking section
Fixes https://github.com/scylladb/scylla-docs/issues/2467

This commit updates the Networking section. The scope is:
- Removing the outdated content, including the reference to
  the super outdated posix_net_conf.sh script.
- Adding the guidelines provided by @vladzcloudius.
- Adding the reference to the documentation for
  the perftune.py script.

Closes #14859
2023-08-21 10:17:37 +03:00
Petr Gusev
9176a3341a test_topology_smp: more logs for debug/aarch64
The test is flaky on CI in debug builds
on aarch64 (#14752), here we sprinkle more
logs for debug/aarch64 hoping it'll help to
debug it.

Ref #14752

Closes #14822
2023-08-21 10:03:09 +03:00
Kefu Chai
adfc139a74 tools/scylla-sstable: path::parent_path() when appropriate
in load_sstables(), `sst_path` is already an instace of `std::filesystem::path`,
so there is no need to cast it to `std::filesystem::path`. also,
`path.remove_filename()` returns something like
"system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/", when the
trailing slash. when we get a component's path in `sstable::filename`,
we always add a "/" in between the `dir` and the filename, so this'd
end up with two slashes in the path like:

"/var/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//mc-2-big-Data.db"

so, in order to remove the duplicated slash, let's just use
`path.parent_path()` here.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15035
2023-08-21 09:28:03 +03:00
Benny Halevy
6e416b8ff2 api: storage_service: improve description of /storage_service/host_id
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-21 09:20:39 +03:00
Benny Halevy
44c14f3e2b token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.

Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-21 09:16:42 +03:00
Benny Halevy
0f54e24519 migration_notifier: get schema_ptr by value
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.

Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.

Fixes scylladb/scylladb#15097

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15098
2023-08-20 21:36:57 +03:00
David Garcia
e23d9cd7eb docs: Autogenerate db/config.cc docs
Update layout

docs: remove output param

docs: generate cc properties on build

docs: track cc file on change

rm: note dependency

docs: clean _data

Fixes #8424.

Closes #14973
2023-08-20 21:27:37 +03:00
Kefu Chai
1aa01d63d4 test: randomized_nemesis_test: mark direct_fd_{pinger,clock} final
`raft_server` in test/raft/randomized_nemesis_test.cc manages
instances of direct_fd_pinger and direct_fd_clock with unique_ptr<>.
this unique_ptr<> deletes these managed instances using delete.
but since these two classes have virtual methods, the compiler feels
nervous when deleting them. because these two classes have virtual
functions, but they do not have virtual destructor. in other words,
in theory, these pointers could be pointing derived classes of them,
and deleting them could lead to leak.

so to silence the warning and to prevent potential issues, let's
just mark these two classes final.

this should address the warning like:

```
In file included from /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:9:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/reactor.hh:24:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/aligned_buffer.hh:24:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78:
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'direct_fd_pinger<int>' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
        delete __ptr;
        ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<direct_fd_pinger<int>>::operator()' requested here
          get_deleter()(std::move(__ptr));
          ^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1400:5: note: in instantiation of member function 'std::unique_ptr<direct_fd_pinger<int>>::~unique_ptr' requested here
    ~raft_server() {
    ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: note: in instantiation of member function 'raft_server<ExReg>::~raft_server' requested here
        delete __ptr;
        ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<raft_server<ExReg>>::operator()' requested here
          get_deleter()(std::move(__ptr));
          ^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1704:24: note: in instantiation of member function 'std::unique_ptr<raft_server<ExReg>>::~unique_ptr' requested here
            ._server = nullptr,
                       ^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1742:19: note: in instantiation of member function 'environment<ExReg>::new_node' requested here
        auto id = new_node(first, std::move(cfg));
                  ^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2113:39: note: in instantiation of member function 'environment<ExReg>::new_server' requested here
        auto leader_id = co_await env.new_server(true);
                                      ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15084
2023-08-20 21:26:08 +03:00
Avi Kivity
4db5d8dd56 Merge 'build: cmake: support Coverage and Sanitize build modes' from Kefu Chai
to mirror the build modes supported by `configure.py`.

Closes #15085

* github.com:scylladb/scylladb:
  build: cmake: support Coverage and Sanitize build modes
  build: cmake: error out if specified build type is unknown
2023-08-20 21:25:21 +03:00
Pavel Emelyanov
6bc30f1944 system_keyspace: De-bloat .setup() from messing with system.local
On boot several manipulations with system.local are performed.

1. The host_id value is selected from it with key = local

   If not found, system_keyspace generates a new host_id, inserts the
   new value into the table and returns back

2. The cluster_name is selected from it with key = local

   Then it's system_keyspace that either checks that the name matches
   the one from db::config, or inserts the db::config value into the
   table

3. The row with key = local is updated with various info like versions,
   listen, rpc and bcast addresses, dc, rack, etc. Unconditionally

All three steps are scattered over main, p.1 is called directly, p.2 and
p.3 are executed via system_keyspace::setup() that happens rather late.
Also there's some touch of this table from the cql_test_env startup code.

The proposal is to collect this setup into one place and execute it
early -- as soon as the system.local table is populated. This frees the
system_keyspace code from the logic of selecting host id and cluster
name leaving it to main and keeps it with only select/insert work.

refs: #2795

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15082
2023-08-20 21:24:31 +03:00
Tomasz Grabiec
1552044615 storage_service, tablets: Fix corrupting tablet metadata on migration concurrent with table drop
Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:

 missing_column (missing column: tablet_count)

Like in this log line:

storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")

The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.

Fixes #15061

Closes #15069
2023-08-20 21:17:57 +03:00
Avi Kivity
a4e7f9bed0 docs: cql: split DML page into one page per statement
The DML page is quite long (21 screenfuls on my monitor); split
it into one page per statement to make it more digestible.

The sections that are common to multiple statement are kept
in the main DML page, and references to them are added.

Closes #15053
2023-08-20 17:14:32 +03:00
Kefu Chai
12d6ec5a18 config: respect --log-with-color 1
scylladb overrides some of seastar logging related options with its
own options by applying them with `logging::apply_settings()`. but
we fail to inherit `with_color` from Seastar as we are using the
designated initializer, so the unspecified members are zero initialized.
that's why we always have logging message in black and white even
if scylla is running in a tty and `--log-with-color 1` is specified.

so, make the debugging life more colorful, let's inherit the option
from Seastar, and apply it when setting logging related options.

see also 29e09a3292

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15076
2023-08-20 13:47:43 +03:00
Tomasz Grabiec
bd8bb5d4b1 Merge 'Wire tablet into compaction group' from Raphael "Raph" Carvalho
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.

There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.

```
INFO  2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO  2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```

Closes #14863

* github.com:scylladb/scylladb:
  Kill scylla option to configure number of compaction groups
  replica: Wire tablet into compaction group
  token_metadata: Add this_host_id to topology config
  replica: Switch to chunked_vector for storing compaction groups
  replica: Generate group_id for compaction_group on demand
2023-08-18 15:17:17 +02:00
Kefu Chai
9fa0b9b75b build: cmake: support Coverage and Sanitize build modes
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-18 14:17:12 +08:00
Kefu Chai
3c3fb03b01 build: cmake: error out if specified build type is unknown
this should help the developer to understand what build types are
supported if the specified one is unknown.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-18 14:17:12 +08:00
Avi Kivity
1901475598 Merge 'config: mark "experimental" option unused and cleanups' from Kefu Chai
in this series, the "experimental" option is marked `Unused` as it has been marked deprecated for almost 2 years since scylla 4.6. and use `experimental_features` to specify the used experimental features explicitly.

Closes #14948

* github.com:scylladb/scylladb:
  config: remove unused namespace alias
  config: use std::ranges when appropriate
  config: drop "experimental" option
  test: disable 'enable_user_defined_functions' if experimental_features does not include udf
  test: pylib: specify experimental_features explicitly
2023-08-17 20:42:02 +03:00
Kefu Chai
7275b8967c docs: add sstablemetadata to operating-scylla/admin-tools
to note that sstablemetadata is being deprecated and encourage
user to switch over to the native tools.

Fixes #15020
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15040
2023-08-17 18:48:46 +03:00
Avi Kivity
e91256a621 Merge 'build: cmake: fix the build of rpm/deb from submodules' from Kefu Chai
in this series, the build of rpm and deb from submodules is fixed:

1. correct the path of reloc package
2. add the dependency of reloc package to deb/rpm build targets

Closes #15062

* github.com:scylladb/scylladb:
  build: cmake: correct reloc_pkg's path
  build: cmake: build rpm/deb from reloc_pkg
2023-08-17 17:58:49 +03:00
Pavel Emelyanov
3ed5b00ba2 Merge 's3/client: generate config file for tests and cleanups' from Kefu Chai
before this change, object_store/test_basic.py create a config file
for specifying the object storage settings, and pass the path of this
file as the argument of `--object-storage-config-file` option when
running scylla. we have the same requirement when testing scylla
with minio server, where we launch a minio server and manually
create a the config file and feed it to scylla.

to ease the preparation work, let's consolidate by creating the
config file in `minio_server.py`, so it always creates the config
file and put it in its tempdir. since object_store/test_basic.py
can also run against an S3 bucket, the fixture implemented
object_store/conftest.py is updated accordingly to reuse the
helper exposed by MinioServer to create the config file when it
is not available.

Closes #15064

* github.com:scylladb/scylladb:
  s3/client: avoid hardwiring env variables names
  s3/client: generate config file for tests
2023-08-17 16:39:23 +03:00
Gleb Natapov
4ffc39d885 cql3: Extend the scope of group0_guard during DDL statement execution
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.

Fixes: #13942

Message-ID: <ZNsynXayKim2XAFr@scylladb.com>
2023-08-17 15:52:48 +03:00
Kefu Chai
6788903fd6 db: config: mark config class final
in 34c3688017, we added a virtual function
to `config_file`, and we new and delete pointer pointing to a
`db::config` instance with `unique_ptr<>`. this makes the compiler
nervous, as deleting a pointer pointing to an instance of non-final
class with virtual function could lead to leak, if this pointer actually
points to a derived class of this non-final class. so, in order to
silence the warning and to prevent potential problem in future, let's
mark `db::config` final.

the warning from Clang 16 looks like:

```
In file included from /home/kefu/dev/scylladb/test/lib/test_services.cc:10:
In file included from /home/kefu/dev/scylladb/test/lib/test_services.hh:25:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78:
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'db::config' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
        delete __ptr;
        ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<db::config>::operator()' requested here
          get_deleter()(std::move(__ptr));
          ^
/home/kefu/dev/scylladb/test/lib/test_services.cc:189:16: note: in instantiation of member function 'std::unique_ptr<db::config>::~unique_ptr' requested here
    auto cfg = std::make_unique<db::config>();
               ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15071
2023-08-17 13:43:16 +03:00
Kefu Chai
fc6b8d4040 s3/client: avoid hardwiring env variables names
instead of hardwiring the names in multiple places, let's just
keep them in a single place as variables, and reference them by
these variables instead of their values.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-17 16:06:55 +08:00
Kefu Chai
ec7fa3628c s3/client: generate config file for tests
before this change, object_store/test_basic.py create a config file
for specifying the object storage settings, and pass the path of this
file as the argument of `--object-storage-config-file` option when
running scylla. we have the same requirement when testing scylla
with minio server, where we launch a minio server and manually
create a the config file and feed it to scylla.

to ease the preparation work, let's consolidate by creating the
config file in `minio_server.py`, so it always creates the config
file and put it in its tempdir. since object_store/test_basic.py
can also run against an S3 bucket, the fixture implemented
object_store/conftest.py is updated accordingly to reuse the
helper exposed by MinioServer to create the config file when it
is not available.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-17 16:06:55 +08:00
Raphael S. Carvalho
b578d6643f Kill scylla option to configure number of compaction groups
The option was introduced to bootstrap the project. It's still
useful for testing, but that translates into maintaining an
additional option and code that will not be really used
outside of testing. A possible option is to later map the
option in boost tests to initial_tablets, which may yield
the same effect for testing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-16 18:23:53 -03:00
Raphael S. Carvalho
cc60598368 replica: Wire tablet into compaction group
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.

There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.

INFO  2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO  2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf

There's a need for compaction_group_manager, as table will still support
"tabletless" mode, and we don't want to sprinkle ifs here and there,
to support both modes. It's not really a manager (it's not even supposed
to store a state), but I couldn't find a better name.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-16 18:23:53 -03:00
Raphael S. Carvalho
5d1f60439a token_metadata: Add this_host_id to topology config
The motivation is that token_metadata::get_my_id() is not available
early in the bootstrap process, as raft topology is pulled later
than new tables are registered and created, and this node is added
to topology even later.

To allow creation of compaction groups to retrieve "my id" from
token metadata early, initialization will now feed local id
into topology config which is immutable for each node anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-16 18:23:44 -03:00
Piotr Smaroń
34c3688017 db: config: add live_updatable_config_params_changeable_via_cql option
If `live_updatable_config_params_changeable_via_cql` is set to true, configuration parameters defined with `liveness::LiveUpdate` option can be updated in the runtime with CQL, i.e. by updating `system.config` virtual table.
If we don't want any configuration parameter to be changed in the
runtime by updating `system.config` virtual table, this option should be
set to false. This option should be set to false for e.g. cloud users,
who can only perform CQL queries, and should not be able to change
scylla's configuration on the fly.

Current implemenatation is generic, but has a small drawback - messages
returned to the user can be not fully accurate, consider:
```
cqlsh> UPDATE system.config SET value='2' WHERE name='task_ttl_in_seconds';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="option is not live-updateable" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
```
where `task_ttl_in_seconds` has been defined with
`liveness::LiveUpdate`, but because `live_updatable_config_params_changeable_via_cql` is set to
`false` in `scylla.yaml,` `task_ttl_in_seconds` cannot be modified in the
runtime by updating `system.config` virtual table.

Fixes #14355

Closes #14382
2023-08-16 17:56:27 +03:00
Aleksandra Martyniuk
e9d94894f1 compaction: release resources of compaction executors
Before compaction task executors started inheriting from
compaction_task_impl, they were destructed immediately after
compaction finished. Destructors of executors and their
fields performed actions that affected global structures and
statistics and had impact on compaction process.

Currently, task executors are kept in memory much longer, as their
are tracked by task manager. Thus, destructors are not called just
after the compaction, which results in compaction stats not being
updated, which causes e.g. infinite cleanup loop.

Add release_resources() method which is called at the end
of compaction process and does what destructors used to.

Fixes: #14966.
Fixes: #15030.

Closes #15005
2023-08-16 15:51:17 +03:00
Kefu Chai
564522c4a8 s3/test: remove tempdir if log does not exists
should have been use `ignore_errors=True` to ignore
the error. this issue has not poped up, because
we haven't run into the case where the log file
does not exist.

this was a regression introduced by
d4ee84ee1e

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15063
2023-08-16 15:11:00 +03:00
Kefu Chai
32c26624bf build: cmake: correct reloc_pkg's path
before this change, the filename in path of reloc package looks like:
tools-scylla-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz
but it should have been:
scylla-tools-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz
so, when repackaging the reloc tarball to rpm / deb, the scripts
just fails to find the reloc tarball and fail.

after this change, the filename is corrected to match with the one
generated using `build_reloc.sh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-16 16:15:23 +08:00
Kefu Chai
a19c7fa8d5 build: cmake: build rpm/deb from reloc_pkg
before this change, dist-${name}-rpm and dist-${name}-deb targets
do not depend on the corresponding reloc pkg from which these
prebuilt packages are created. so these scripts fail if the reloc
package does not exist.

to address this problem, the reloc package is added as their
dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-16 16:15:23 +08:00
Avi Kivity
e8f3b073c3 Merge 'Maintain sstable state explicitly' from Pavel Emelyanov
An sstable can be in one of several states -- normal, quarantined, staging, uploading. Right now this "state" is hard-wired into sstable's path, e.g. quarantined sstable would sit in e.g. /var/lib/data/ks-cf-012345/quarantine/ directory. Respectively, there's a bunch of directory names constexprs in sstables.hh defining each "state". Other than being confusing, this approach doesn't work well with S3 backend. Additionally, there's snapshot subdir that adds to the confusion, because snapshot is not quite a state.

This PR converts "state" from constexpr char* directories names into a enum class and patches the sstable creation, opening and state-changing API to use that enum instead of parsing the path.

refs: #13017
refs: #12707

Closes #14152

* github.com:scylladb/scylladb:
  sstable/storage: Make filesystem storage with initial state
  sstable: Maintain state
  sstable: Make .change_state() accept state, not directory string
  sstable: Construct it with state
  sstables_manager: Remove state-less make_sstable()
  table: Make sstables with required state
  test: Make sstables with upload state in some cases
  tools: Make sstables with normal state
  table: Open-code sstables making streaming helpers
  tests: Make sstables with normal state by default
  sstable_directory: Make sstable with required state
  sstable_directory: Construct with state
  distributed_loader: Make sstable with desired state when populating
  distributed_loader: Make sstable with upload state when uploading
  sstable: Introduce state enum
  sstable_directory: Merge verify and g.c. calls
  distributed_loader: Merge verify and gc invocations
  sstable/filesystem: Put underscores to dir members
  sstable/s3: Mark make_s3_object_name() const
  sstable: Remove filename(dir, ...) method
2023-08-15 17:44:06 +03:00
Avi Kivity
5949623e0d Merge 'sstable_set: maintain bytes on disk' from Benny Halevy
and use that in compaction_group, rather than
respective accumulators of its own.

This is part of of larger series to make cache updates exception safe.

Refs #14043

Closes #15052

* github.com:scylladb/scylladb:
  sstable_set: maintain total bytes_on_disk
  sstable_set: insert, erase: return status
2023-08-15 17:32:12 +03:00
Kefu Chai
64ed0127d7 s3/client: retry if minio server fails to start
there is a small time window after we find a free port and before
the minio server listens on that port, if another server sneaked
in the time window and listen on that port, minio server can
still fail to start even there might be free port for it.

so, in this change, we just retry with a random port for a fixed
number of times until the minio server is able to serve.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15042
2023-08-15 16:17:47 +03:00
Raphael S. Carvalho
d3f71ae4ee replica: Switch to chunked_vector for storing compaction groups
We aim for a large number of tablets, therefore let's switch
to chunked_vector to avoid large contiguous allocs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-15 09:04:05 -03:00
Raphael S. Carvalho
2590eec352 replica: Generate group_id for compaction_group on demand
There are a few good reasons for this change.
1) compaction_group doesn't have to be aware of # of groups
2) thinking forward to dynamic tablets, # of groups cannot be
statically embedded in group id, otherwise it gets stale.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-15 09:04:05 -03:00
Raphael S. Carvalho
9400b79658 gce_snitch: Fix use-after-move in load_config()
The use-after-move is not very harmful as it's only used when
handling exception. So user would be left with a bogus message.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15054
2023-08-15 10:23:57 +03:00
Kefu Chai
c82f1d2f57 tools/scylla-sstable: dump column_desc as an object
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:

```
        "regular_columns": [
          "name",
          "clustering_order",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "column_name_bytes",
          "type_name",
          "org.apache.cassandra.db.marshal.BytesType",
          "name",
          "kind",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "position",
          "type_name",
          "org.apache.cassandra.db.marshal.Int32Type",
          "name",
          "type",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type"
        ]
```

but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,

> $SERIALIZATION_HEADER_METADATA := {
>     "min_timestamp_base": Uint64,
>     "min_local_deletion_time_base": Uint64,
>     "min_ttl_base": Uint64",
>     "pk_type_name": String,
>     "clustering_key_types_names": [String, ...],
>     "static_columns": [$COLUMN_DESC, ...],
>     "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
>     "name": String,
>     "type_name": String
> }

"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.

so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.

after the change, "regular_columns" are printed like:
```
        "regular_columns": [
          {
            "name": "clustering_order",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "column_name_bytes",
            "type_name": "org.apache.cassandra.db.marshal.BytesType"
          },
          {
            "name": "kind",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "position",
            "type_name": "org.apache.cassandra.db.marshal.Int32Type"
          },
          {
            "name": "type",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          }
        ]
```

Fixes #15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15037
2023-08-15 08:22:51 +03:00
Avi Kivity
d57a951d48 Revert "cql3: Extend the scope of group0_guard during DDL statement execution"
This reverts commit 70b5360a73. It generates
a failure in group0_test .test_concurrent_group0_modifications in debug
mode with about 4% probability.

Fixes #15050
2023-08-15 00:26:45 +03:00
Patryk Jędrzejczak
e7077da12d replica: reduce the size limit of the schema commitlog
The size of the schema commitlog is incorrectly set to 10 TB. To
avoid wasting space, we reduce it to 2 * schema commitlog segment
size.

Closes #14946
2023-08-14 20:41:15 +02:00
Benny Halevy
f54ab48273 sstable_set: maintain total bytes_on_disk
and use that in compaction_group, rather than
respective accumulators of its own.

bytes_on_disk is implemented by each sstable_set_impl
and is update on insert and erase (whether directly
into the sstable_set_impl or via the sstable_set).

Although compound_sstable_set doesn't implement
insert and erase, it override `bytes_on_disk()` to return
the sum of all the underlying `sstable_set::bytes_on_disk()`.

Also, added respective unit tests for `partitioned_sstable_set`
and `time_series_sstable_set`, that test each type's
bytes_on_disk, including cloning of the set, and the
`compound_sstable_set` bytes_on_disk semantics.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-14 21:07:27 +03:00
Benny Halevy
9f77a32805 compaction_manager: run_offstrategy_compaction: retrieve owned_ranges from compaction_state
perform_offstrategy is called from try_perform_cleanup
when there are sstables in the maintenance set that require
cleanup.

The input sstables are inserted into the compaction_state
`sstables_requiring_cleanup` and `try_perform_cleanup`
expects offstrategy compaction to clean them up along
with reshape compaction.

Otherwise, the maintenance sstables that require cleanup
are not cleaned up by cleanup compaction, since
the reshape output sstable(s) are not analyzed again
after reshape compaction, where that would insert
the output sstable(s) into `sstables_requiring_cleanup`
and trigger their cleanup in the subsequent cleanup compaction.

The latter method is viable too, but it is less effficient
since we can do reshape+cleanup in one pass, vs.
reshape first and cleanup later.

Fixes scylladb/scylladb#15041

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15043
2023-08-14 18:37:34 +03:00
Avi Kivity
1937a5c1dd docs: cql: document the relative priority of SELECT clauses
Document how SELECT clauses are considered. For example, given the query

    SELECT * FROM tab WHERE a = 3 LIMIT 1

We'll get different results if we first apply the WHERE clause then LIMIT
the result set, or if we first LIMIT there result set and then apply the
WHERE clause.

Closes #14990
2023-08-14 17:40:37 +03:00
Benny Halevy
2dc9ef17be sstable_set: insert, erase: return status
To be used for maintaining disk_space_used
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-14 17:10:39 +03:00
Aleksandra Martyniuk
7a28cc60ec compaction: ignore future explicitly
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.

Explicitly ignore failed future.

Fixes: #14971.

Closes #15000
2023-08-14 16:41:15 +03:00
Pavel Emelyanov
296eb61432 sstable/storage: Make filesystem storage with initial state
The filesystem storage driver uses different paths depending on sstable
state. It's possible to keep only table directory _and_ state on it and
construct this path on demand when needed, but it's faster to keep full
path onboard. All the more so it's only exported outside via .prefix()
call which is for logs only, but still

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:40:44 +03:00
Pavel Emelyanov
5c39d61b62 sstable: Maintain state
This means -- keep state on sstable, change it with change_state() call
and (!) fix the is_<state>() helpers not to check storage->prefix()

nit: mark requires_view_building() noexcept while at it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:40:44 +03:00
Pavel Emelyanov
b06917f235 sstable: Make .change_state() accept state, not directory string
Pretty cosmetic change, but it will allow S3 to finally support moving
sstables between states (after this patch it still doesn't)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:40:44 +03:00
Pavel Emelyanov
ef25352412 sstable: Construct it with state
This just moves make_path() call from outside of sstable::sstable()
inside it. Later it will be moved even further. Also, now sstable can
know its state and keep it (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:28:54 +03:00
Pavel Emelyanov
1f247f0b05 sstables_manager: Remove state-less make_sstable()
Now all callers specify the state they want their sstables in explicitly
and the old API can be removed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:28:54 +03:00
Pavel Emelyanov
fdfec474ae table: Make sstables with required state
By default it's created with normal state, but there are some places
that need to put it into staging. Do it with new state enum

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:28:54 +03:00
Pavel Emelyanov
855f2b4b86 test: Make sstables with upload state in some cases
As was mentione in the previous patch, there are few places in tests
that put sstables in upload/ subdir and they really mean it. Those need
to use sstables manager/directory API directly (already) and specify the
state explicitly (this patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 15:28:54 +03:00
Pavel Emelyanov
9e752ca6ab tools: Make sstables with normal state
Just like tests, tool open sstable by its full path and doesn't make any
assumptions about sstable state

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:56:02 +03:00
Pavel Emelyanov
6628dc47c5 table: Open-code sstables making streaming helpers
There are two of those that call each other to end up calling plain
make_sstable() one. It's simpler to patch both if they just call the
latter directly.

While at it -- drop the unused default argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:56:02 +03:00
Pavel Emelyanov
734c0820df tests: Make sstables with normal state by default
It's assumed that sstables are not very specific about which
subdirectory an sstable is, so they can use normal state. Places that
need to move sstables between states will use sstable manager API
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:56:02 +03:00
Pavel Emelyanov
e7bbdbcef0 sstable_directory: Make sstable with required state
The state is on sstable_directory, can switch to using the new manager
API. The full path is still there, will be dropped later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:56:02 +03:00
Pavel Emelyanov
c0b922a8af sstable_directory: Construct with state
This is to replace full path sitting on this object eventually. For now
they have to co-exist, but state will be used to make_sstable()-s from
manager with its new API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:56:01 +03:00
Pavel Emelyanov
6fc62c2d9f distributed_loader: Make sstable with desired state when populating
This still needs to conver state to directory name internally as
sstable_directory instances are hashed on populator by subdir string.
Also the full string path is printed in logs. All this is now internal
to populate method and will be fixed later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:45:52 +03:00
Pavel Emelyanov
b0064f5c55 distributed_loader: Make sstable with upload state when uploading
Just make use of the new shiny sstables_manager API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:45:52 +03:00
Pavel Emelyanov
249a6a4d27 sstable: Introduce state enum
There are several states between which an sstable can migrate. Nowadays
the state is encoded into sstable directory, which is not nice. Also S3
backed sstables don't support states only keeping sstables in "normal".

This patch adds enum state in order to replace the path-encoded one
eventually. The new sstables_manager::make_sstable() method is added
that accepts table directory (without quarantine/ or staging/ component)
and the desired initial state (optional). Next patches will make use of
this maker and the existing one will be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:45:52 +03:00
Pavel Emelyanov
c257ad90e1 sstable_directory: Merge verify and g.c. calls
Name it .prepare() and remove the sstable_directory() public method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:45:51 +03:00
Pavel Emelyanov
07d4672054 distributed_loader: Merge verify and gc invocations
Both are launched on shard-0, no need to invoke_on two times

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:41:48 +03:00
Pavel Emelyanov
e955186765 sstable/filesystem: Put underscores to dir members
They are private class fields, must be _-prefixed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:00:50 +03:00
Pavel Emelyanov
84b318228a sstable/s3: Mark make_s3_object_name() const
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:00:50 +03:00
Pavel Emelyanov
f15246f5ef sstable: Remove filename(dir, ...) method
It's only used by fs storage driver that can do dir/file concatenation
on its own. Moreover, this method is not welcome to be used even
internally

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-14 14:00:50 +03:00
Pavel Emelyanov
116d08d10f distributed_loader: Mark tables for write in populator
Both, init_system_keyspace() and init_non_system_keyspaces() populate
the keyspaces with the help of distributed_loader::populate_keyspace().
That method, in turn, walks the list of keyspaces' tables to load
sstables from disk and attach to them.

After it both init_...-s take the 2nd pass over keyspaces' tables to
call the table::mark_ready_for_writes() on each. This marking can be
moved into populate_keyspace(), that's much easier and shorter because
that method already has the shard-wide table pointer and can just call
whatever it needs on the table.

This changes the initialization sequence, before the patch all tables
were populated before any of them was marked as ready for write. This
looks safe however, as marking a table for write meaks resetting its
generation generator and different tables' generators are independent
from each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #15026
2023-08-14 13:22:41 +03:00
Aleksandra Martyniuk
9ec43fd3a7 compaction: update comment in compaction_manager::submit
Closes #15023
2023-08-14 09:34:56 +03:00
Avi Kivity
b120d35c58 Merge 'Relax cql_test_env services maintenance' from Pavel Emelyanov
To add a sharded service to the cql_test_env one needs to patch it in 5 or 6 places

- add cql_test_env reference
- add cql_test_env constructor argument
- initialize the reference in initializer list
- add service variable to do_with method
- pass the variable to cql_test_env constructor
- (optionally) export it via cql_test_env public method

Steps 1 through 5 are annoying, things get much simpler if look like

- add cql_test_env variable
- (optionally) export it via cql_test_env public method

This is what this PR does

refs: #2795

Closes #15028

* github.com:scylladb/scylladb:
  cql_test_env: Drop local *this reference
  cql_test_env: Drop local references
  cql_test_env: Move most of the stuff in run_in_thread()
  cql_test_env: Open-code env start/stop and remove both
  cql_test_env: Keep other services as class variables
  cql_test_env: Keep services as class variables
  cql_test_env: Construct env early
  cql_test_env: De-static fdpinger variable
  cql_test_env: Define all services' variables early
  cql_test_env: Keep group0_client pointer
2023-08-13 20:24:52 +03:00
Benny Halevy
8fbcf1ab9f view: start: ignore also abort_requested_exception
We see the abort_requested_exception error from time
to time, instead of sleep_aborted that was expected
and quietly ignored (in debug log level).

Treat abort_requested_exception the same way since
the error is expected on shutdown and to reduce
test flakiness, as seen for example, in
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3033/artifact/logs-full.release.010/1691896356104_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_schema/node2.log
```
INFO  2023-08-13 03:12:29,151 [shard 0] compaction_manager - Asked to stop
WARN  2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted)
INFO  2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Finished main loop
WARN  2023-08-13 03:12:29,152 [shard 0] cdc - Aborted update CDC description table with generation (2023/08/13 03:12:17, d74aad4b-6d30-4f22-947b-282a6e7c9892)
INFO  2023-08-13 03:12:29,152 [shard 1] compaction_manager - Asked to stop
INFO  2023-08-13 03:12:29,152 [shard 1] compaction_manager - Stopped
INFO  2023-08-13 03:12:29,153 [shard 0] init - Signal received; shutting down
INFO  2023-08-13 03:12:29,153 [shard 0] init - Shutting down view builder ops
INFO  2023-08-13 03:12:29,153 [shard 0] view - Draining view builder
INFO  2023-08-13 03:12:29,153 [shard 1] view - Draining view builder
INFO  2023-08-13 03:12:29,153 [shard 0] compaction_manager - Stopped
ERROR 2023-08-13 03:12:29,153 [shard 0] view - start failed: seastar::abort_requested_exception (abort requested)
ERROR 2023-08-13 03:12:29,153 [shard 1] view - start failed: seastar::abort_requested_exception (abort requested)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15029
2023-08-13 18:39:09 +03:00
Gleb Natapov
70b5360a73 cql3: Extend the scope of group0_guard during DDL statement execution
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.

Fixes: #13942

Message-ID: <ZNSWF/cHuvcd+g1t@scylladb.com>
2023-08-13 14:19:39 +03:00
Patryk Jędrzejczak
2e2271f639 raft: make a replaced node a non-voter early
We make a replaced node a non-voter early, just as a removed node
in 377f87c91a.

Closes #15022
2023-08-12 22:03:46 +02:00
Avi Kivity
0cd0be6275 Merge 'Remove some stale sugar from cql_test_env' from Pavel Emelyanov
There are some asserting checks for keyspace and table existence on cql_test_env that perform some one-linee work in a complex manner, tests can do better on their own. Removing it makes cql_test_env simpler

refs: #2795

Closes #15027

* github.com:scylladb/scylladb:
  test: Remove require_..._exists from cql_test_env
  test: Open-code ks.cf name parse into cdc_test
  test: Don't use require_table_exists() in test/lib/random_schema
  test: Use BOOST_REQUIRE(!db.has_schema())
  test: Use BOOST_REQUIRE(db.has_schema())
  test: Use BOOST_REQUIRE(db.has_keyspace())
  test: Threadify cql_query_test::test_compact_storage case
  test: Threadify some cql_query_test cases
2023-08-12 22:32:47 +03:00
Pavel Emelyanov
64ddc9e4b4 cql_test_env: Drop local *this reference
The auto& env = *this is also now excessive, so drop it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:30:34 +03:00
Pavel Emelyanov
de679d7c36 cql_test_env: Drop local references
The local auto& foo = env._foo references in run_in_thread() a no longer
needed, the code that uses foo can be switched to use _foo (this->_foo)
instead

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:29:42 +03:00
Pavel Emelyanov
487ecae517 cql_test_env: Move most of the stuff in run_in_thread()
Thw do_with() method is static and cannot just access cql_test_env
variable's fields, using local references instead. To simplify this,
most of the method's content is moved to non-static run_in_thread()
method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:28:40 +03:00
Pavel Emelyanov
2c175660f2 cql_test_env: Open-code env start/stop and remove both
These two just make more churn in next patch, so drop both

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:28:03 +03:00
Pavel Emelyanov
10f9292fe8 cql_test_env: Keep other services as class variables
There are more services on do_with() stack that are not referenced from
the cql_test_env. Move them to be class variables too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:27:19 +03:00
Pavel Emelyanov
08a3be3b17 cql_test_env: Keep services as class variables
Now they are duplicated -- variables exist on do_with() stack and the
class references some of them. This patch makes is vice-versa -- all the
variables are on the cql_test_env and do_with() references them. The
latter will change soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:26:21 +03:00
Pavel Emelyanov
b31d2097b8 cql_test_env: Construct env early
Its constructor is _just_ assigning references and setting up rlimits.
Both can happen early

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:25:49 +03:00
Pavel Emelyanov
49d4760655 cql_test_env: De-static fdpinger variable
So that it could be moved onto cql_test_env as a class member

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:25:25 +03:00
Pavel Emelyanov
749c5baf21 cql_test_env: Define all services' variables early
Nowadays they are all scattered along the .do_with() function. Keeping
them in one early place makes it possible to relocate them onto the
cql_test_env later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:23:54 +03:00
Pavel Emelyanov
d36737f094 cql_test_env: Keep group0_client pointer
It's now reference, but some time later it won't be able to get
initialized construction-time, so turn it into pointer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 15:23:16 +03:00
Pavel Emelyanov
da98355bc8 test: Remove require_..._exists from cql_test_env
Not used by any code anymore. This makes cql_test_env shorter and nicer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:36 +03:00
Pavel Emelyanov
64c8a59e9b test: Open-code ks.cf name parse into cdc_test
The test uses qualified ks.cf name to find the schema, but it's the only
test case that does it. There's no point in maintaining a dedicated
helper on the cql_test_env just for that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:36 +03:00
Pavel Emelyanov
6ead9a5255 test: Don't use require_table_exists() in test/lib/random_schema
This check is pointless. The subsequent call to find_column_family()
would call on_internal_error() in case schema is not found, and since
cql_test_env sets abort-on-internal-error to true, this would fail just
like that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:36 +03:00
Pavel Emelyanov
b4c84f9174 test: Use BOOST_REQUIRE(!db.has_schema())
Surprisingly there's a dedicated helper for the check opposite to the
one fixed in the previous patch. Fix one too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:36 +03:00
Pavel Emelyanov
063baabaee test: Use BOOST_REQUIRE(db.has_schema())
Same as in previous patch, the cql_test_env::require_table_exists()
helper is exactly the same, but returns future and asserts on failures
for no gain

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:32 +03:00
Pavel Emelyanov
a128108acd test: Use BOOST_REQUIRE(db.has_keyspace())
The cql_test_env::require_keyspace_exists() performs exactly the same
check, but is future-returning function for no reason and it assert()s
on failure, that's less informative (not that it ever failed) than
BOOST_REQUIRE

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:46:29 +03:00
Pavel Emelyanov
53c309063e test: Threadify cql_query_test::test_compact_storage case
It's like in previous patch, and for the same reason, but the change is
a bit more complicated because it uses resolved futures' results in few
places, so it likely deserves separate commit

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:39:40 +03:00
Pavel Emelyanov
59db35fba0 test: Threadify some cql_query_test cases
Those two use straight .then-s sequences, no point in keeping them that
long. Being threads makes next patches shorter and nicer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-12 11:38:51 +03:00
Petr Gusev
5361de76f9 random_tables.py: add counter column type
We'll need it for fencing test.
2023-08-11 17:37:09 +04:00
Petr Gusev
f5b41a8075 raft topology: don't increment version when transitioning to node_state::normal
This version increment in not accompanied by a
global_token_metadata_barrier, which means the new
version won't be reflected in fence_version
and basically will have no effect in terms of fencing.
2023-08-11 16:22:25 +04:00
Aleksandra Martyniuk
db932c7106 compaction: hold gate immediately after task executor is created
If make_task call in compaction_manager::perform_compaction yields,
compaction_task_executor::_compaction_state may be gone and gate
won't be held.

Hold gate immediately after compaction_task_executor is created.
Add comment not to call prepare_task without preparation.

Refs: #14971.
Fixes: #14977.

Closes #14999
2023-08-11 13:56:38 +02:00
Kefu Chai
17c1b15c81 create-relocatable-package.py: add version file with tempfile
before this change, we build multiple relocatable package for
different builds in parallel using ninja. all these relocatable
packages are built using the same script of
`create-relocatable-package.py`. but this script always use the
same directory and file for the `.relocatable_package_version`
file.

so there are chances that these jobs building the relocatable
package can race and writing / accessing the same file at the same
time. so, in this change, instead of using a fixed file path
for this temporary file, we use a NamedTemporaryFile for this purpose.
this should helps us avoid the build failures like

```
[2023-08-10T09:38:00.019Z] FAILED: build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz
[2023-08-10T09:38:00.019Z] scripts/create-relocatable-package.py --mode debug 'build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz'
[2023-08-10T09:38:00.019Z] Traceback (most recent call last):
[2023-08-10T09:38:00.019Z]   File "/jenkins/workspace/scylla-master/scylla-ci/scylla/scripts/create-relocatable-package.py", line 130, in <module>
[2023-08-10T09:38:00.019Z]     os.makedirs(f'build/{SCYLLA_DIR}')
[2023-08-10T09:38:00.019Z]   File "<frozen os>", line 225, in makedirs
[2023-08-10T09:38:00.019Z] FileExistsError: [Errno 17] File exists:
'build/scylla-package'
```

Fixes #15018
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15007
2023-08-11 12:57:28 +03:00
Aleksandra Martyniuk
ae67f5d47e api: ignore future in task_manager_json::wait_task
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.

If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.

Result of done() method is ignored.

Fixes: #14914.

Closes #14915
2023-08-11 08:18:51 +03:00
Patryk Jędrzejczak
d1d1b6cf6e raft: remove a replaced node from group 0 earlier
The topology coordinator only marks a replaced node as LEFT
during the replace operation and actually removes it from the
group 0 config in cleanup_group0_config_if_needed. If this
function is called before raft has committed a replacing node as
a voter, it does not remove the replaced node from the group 0
config. Then, the coordinator can decide that it has no work to
do and starts sleeping, leaving us with an outdated config.

This behavior reduces group 0 availability and causes problems in
tests (see  #14885). Also, it makes the coordinator's logic
confusing - it claims that it has no work to do when it has some
work to do. Therefore, we modify the coordinator so that it
removes the replaced node earlier in handle_topology_transition.

Fixes #14885
Fixes #14975

Closes #15009
2023-08-11 01:32:24 +02:00
Botond Dénes
403ba9b055 Merge 'gossiper: lock_endpoint: fix review comments' from Benny Halevy
This series fixes a couple of review comments on #14845

Closes #14976

* github.com:scylladb/scylladb:
  gossiper: lock_endpoint: fix comment regarding permit_id mismatch
  gossiper: lock_endpoint: change assert to on_internal_error
2023-08-10 17:37:32 +03:00
Gleb Natapov
517f6bfa8a test: add rebuild test
Add simple rebuild test that makes sure that rebuild operation does not
fail.
2023-08-10 16:46:13 +03:00
Gleb Natapov
53120c1d57 system_keyspace: fix assertion for missing transition_state
The code assumes that if there is no transition_state there should be no
nodes that currently in transition in a state other then left_token_ring
state, but rebuild operation also creates such nodes, so add the check
for it as well.
2023-08-10 16:37:56 +03:00
Kamil Braun
8f658fb139 Merge 's3/client: check for available port before starting minio server' from Kefu Chai
there is chance that the default port of 9000 has been used on the host
running the test, in that case, we should try to use another available
port.

so, in this change, we try ports in the ranges of [9000, 9000+1000), and
use the first one which is not connectable.

Fixes #14985
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14997

* github.com:scylladb/scylladb:
  test: stop using HostRegistry in MinioServer
  s3/client: check for available port before starting minio server
2023-08-10 14:01:13 +02:00
Alejo Sanchez
e2122163f5 test/pylib: protect double call to cluster stop
test.py schedules calls to cluster .uninstall() and .stop() making
double calls to it running at the same time. Mark the cluster as not
running early on.

While there, do the same for .stop_gracefully() for consistency.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14987
2023-08-10 13:37:49 +02:00
Kamil Braun
39330b9c11 Merge 'gossiper: convict: lock_endpoint' from Benny Halevy
Currently, `mark_dead` is called with null_permit_id
from `convict`, and in this case, by contract,
it should lock the endpoint, same as `mark_as_shutdown`.

This change somehow escaped #14845 so it amends it.

Fixes #14838

Closes #15001

* github.com:scylladb/scylladb:
  gossiper: verify permit_id in all private functions
  gossiper: convict: lock_endpoint
2023-08-10 13:09:05 +02:00
Benny Halevy
623ed1a249 gossiper: verify permit_id in all private functions
Instead of acquiring the permit is the permit_id arg is null,
like in mark_as_shutdown, just asssert that the permit_id is
non-null. The functions are both private and we control all callers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-10 08:17:04 +03:00
Benny Halevy
42c1c5ead8 gossiper: convict: lock_endpoint
Currently, `mark_dead` is called with null_permit_id
from `convict`, and in this case, by contract,
it should lock the endpoint, same as `mark_as_shutdown`.

This change somehow escaped #14845 so it amends it.

Fixes #14838

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-10 07:50:33 +03:00
Kefu Chai
0c0a59bf62 test: stop using HostRegistry in MinioServer
since MinioServer find a free port by itself, there is no need to
provide it an IP address for it anymore -- we can always use
127.0.0.1.

so, in this change, we just drop the HostRegistry parameter passed
to the constructor of MinioServer, and pass the host address in place
of it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 23:40:22 +08:00
Kamil Braun
59c410fb97 Merge 'migration_manager: announce: provide descriptions for all calls' from Patryk Jędrzejczak
The `system.group0_history` table provides useful descriptions for each
command committed to Raft group 0. One way of applying a command to
group 0 is by calling `migration_manager::announce`. This function has
the `description` parameter set to empty string by default. Some calls
to `announce` use this default value which causes `null` values in
`system.group0_history`. We want `system.group0_history` to have an
actual description for every command, so we change all default
descriptions to reasonable ones.

Going further, We remove the default value for the `description`
parameter of `migration_manager::announce` to avoid using it in the
future. Thanks to this, all commands in `system.group0_history` will
have a non-null description.

Fixes #13370

Closes #14979

* github.com:scylladb/scylladb:
  migration_manager: announce: remove the default value of description
  test: always pass empty description to migration_manager::announce
  migration_manager: announce: provide descriptions for all calls
2023-08-09 16:58:41 +02:00
Kefu Chai
29554b0fc6 s3/client: check for available port before starting minio server
there is chance that the default port of 9000 has been used on the
host running the test, in that case, we should try to use another
available port.

so, in this change, we try ports in the ranges of [9000, 9000+1000),
and use the first one which is not connectable.

Fixes #14985
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 17:33:42 +08:00
Botond Dénes
108e510a23 Merge 'Update sstable_requiring_cleanup on compaction completion' from Benny Halevy
Currently `sstable_requiring_cleanup` is updated using `compacting_sstable_registration`, but that mechanism is not used by offstrategy compaction, leading to #14304.

This series introduces `compaction_manager::on_compaction_completion` that intercepts the call
to the table::on_compaction_completion. This allows us to update `sstable_requiring_cleanup` right before the compacted sstables are deleted, making sure they are no leaked to `sstable_requiring_cleanup`, which would hold a reference to them until cleanup attempts to clean them up.

`cleanup_incremental_compaction_test` was adjusted to observe the sstables `on_delete` (by adding a new observer event) to detect the case where cleanup attempts to delete the leaked sstables and fails since they were already deleted from the file system by offstrategy compaction. The test fails with the fix and passes with it.

Fixes #14304

Closes #14858

* github.com:scylladb/scylladb:
  compaction_manager: on_compaction_completion: erase sstables from sstables_requiring_cleanup
  compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
  sstable: add on_delete observer
  compaction_manager: add on_compaction_completion
  sstable_compaction_test: cleanup_incremental_compaction_test: verify sstables_requiring_cleanup is empty
2023-08-09 11:03:45 +03:00
Botond Dénes
69d6778daf Merge 'build: cmake: fixes for the release build' from Kefu Chai
before this change, we use generator expression to initialize CMAKE_CXX_FLAGS_RELEASE, this has two problems:

1. the generator expression is not expanded when setting a regular variable.
2. the ending ">" is missing in one of the generator expression.
3. the parameters are not separated with ";"

so address them, let's just

* use `add_compile_options()` directly, as the corresponding `mode.${build_mode}.cmake` is only included when the "${build_mode}" is used.
* add comma in between the command line options.
* add the missing closing ">"

Closes #14996

* github.com:scylladb/scylladb:
  build: cmake: pass --gc-sections to ld not ar
  build: cmake: use add_compile_options() in release build
2023-08-09 09:55:02 +03:00
Kefu Chai
47c9b25bac compaction_manager: correct comment on compaction_task_executor::state
when it comes to `regular_compaction_task_executor`, we repeat the
compaction until the compaction can not proceed, so after an iteration
of compaction completes successfully, the task can still continue with
yet another round of the compaction as it sees appropriate. so let's
update the comment to reflect this fact.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14891
2023-08-09 09:49:18 +03:00
Kefu Chai
6dc885a8e2 compaction: mark more member variables const
quite a few member variables serves as the configuration for
a given compaction, they are immutable in the life cycle of it,
so for better readability, let's mark them `const`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14981
2023-08-09 09:28:44 +03:00
Botond Dénes
5f65ac73ed Merge 'Remove qctx' from Pavel Emelyanov
The only place that still calls it is static force_blocking_flush method. If can be made non-static already. Also, while at it, coroutinize some system_keyspace methods and fix a FIXME regarding replica::database access in it.

Closes #14984

* github.com:scylladb/scylladb:
  code: Remove query-context.hh
  code: Remove qctx
  system_keyspace: Use system_keyspace's container() to flush
  system_keyspace: Make force_blocking_flush() non-static
  system_keyspace: Coroutinize update_tokens()
  system_keyspace: Coroutinize save_truncation_record()
2023-08-09 09:27:53 +03:00
Kefu Chai
782b1992b2 build: cmake: pass --gc-sections to ld not ar
ar is not able to tell which sections to be GC'ed, hence it does
not care about --gc-sections, but ld does. let's add this option
to CMAKE_EXE_LINKER_FLAGS.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 13:50:44 +08:00
Kefu Chai
f7377725c2 build: cmake: use add_compile_options() in release build
before this change, we use generator expression to initialize
CMAKE_CXX_FLAGS_RELEASE, this has two problems:

1. the generator expression is not expanded when setting
   a regular variable.
2. the ending ">" is missing in one of the generator
   expression.
3. the parameters are not separated with ";"

so address them, let's just

* use `add_compile_options()` directly, as the corresponding
  `mode.${build_mode}.cmake` is only included when the
  "${build_mode}" is used.
* add comma in between the command line options.
* add the missing closing ">"

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 12:56:06 +08:00
Kefu Chai
153a808f52 config: remove unused namespace alias
bpo is not used after it is defined, so drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 10:17:34 +08:00
Kefu Chai
6355270120 config: use std::ranges when appropriate
use std::ranges functions for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 10:17:34 +08:00
Kefu Chai
64bc8d2f7d config: drop "experimental" option
"experimental" was marked deprecated in 8b917f7c. this change
was included since Scylla 4.6. now that 5.3 has been branched,
this change will be included 5.4. this should be long enough
for the user's turn around if this option is ever used.

the dtests using this option has been audited and updated
accordingly. and the unit testing this option is removed as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 10:17:34 +08:00
Kefu Chai
959362f85b test: disable 'enable_user_defined_functions' if experimental_features does not include udf
"enable_user_defined_functions" is enabled by default by
`make_scylla_conf()` in pylib/scylla_cluster.py, and we've being
using `experimental` = True in this very function. this combination
works fine, as "udf" is enabled by `experimental`. but
since `experimental` is deprecated, if we drop this option and stop
handling it. this peace is broken. "enable_user_defined_function"
requires "udf" experimental feature. but test_boost_after_ip_change
feed the scylla with an empty `experimental_features` list, so
the test fails. to pave for the road of dropping `experimental`
option, let's disable `enable_user_defined_function` as well
in test_boost_after_ip_change.

the same applies to other tests changed in this commit.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 10:17:34 +08:00
Kefu Chai
91e677c6c8 test: pylib: specify experimental_features explicitly
"experimental" was marked deprecated in 8b917f7c. so let's
specify the experimental features explicitly using
`experimental_feature` option.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-09 10:17:34 +08:00
Pavel Emelyanov
f1515c610e code: Remove query-context.hh
The whole thing is unused now, so the header is no longer needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:11:07 +03:00
Pavel Emelyanov
413d81ac16 code: Remove qctx
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:10:56 +03:00
Pavel Emelyanov
d7f5d6dba8 system_keyspace: Use system_keyspace's container() to flush
In force_blocking_flush() there's an invoke-on-all invocation of
replica::database::flush() and a FIXME to get the replica database from
somewhere else rather than via query-processor -> data_dictionary.

Since now the force_blocking_flush() is non-static the invoke-on-all can
happen via system_keyspace's container and the database can be obtained
directly from the sys.ks. local instance

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:09:32 +03:00
Pavel Emelyanov
7a342ed5c0 system_keyspace: Make force_blocking_flush() non-static
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:09:20 +03:00
Pavel Emelyanov
6b8fe5ac43 system_keyspace: Coroutinize update_tokens()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:09:15 +03:00
Pavel Emelyanov
1700d79b60 system_keyspace: Coroutinize save_truncation_record()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-08 11:09:09 +03:00
Benny Halevy
7a7c8d0d23 compaction_manager: on_compaction_completion: erase sstables from sstables_requiring_cleanup
Erase retired sstable from compaction_state::sstables_requiring_cleanup
also on_compaction_completion (in addition to
compacting_sstable_registration::release_compacting
for offstrategy compaction with piggybacked cleanup
or any other compaction type that doesn't use
compacting_sstable_registration.

Add cleanup_during_offstrategy_incremental_compaction_test
that is modeled after cleanup_incremental_compaction_test to check
that cleanup doesn't attempt to cleanup already-deleted
sstables that were left over by offstrategy compaction
in sstables_requiring_cleanup.

Fixes #14304

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-08 08:16:46 +03:00
Benny Halevy
b1e164a241 compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-08 08:16:46 +03:00
Benny Halevy
b08f2ac4c6 sstable: add on_delete observer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-08 08:15:00 +03:00
Benny Halevy
df66895080 compaction_manager: add on_compaction_completion
Pass the call to the table on_compaction_completion
so we can manage the sstables requiring cleanup state
along the way.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-08 08:12:05 +03:00
Benny Halevy
ea64ae54f8 sstable_compaction_test: cleanup_incremental_compaction_test: verify sstables_requiring_cleanup is empty
Make sure that there are no sstables_requiring_cleanup
after cleanup compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-08 08:12:01 +03:00
Patryk Jędrzejczak
356e131acd migration_manager: announce: remove the default value of description
We remove the default value for the description parameter of
migration_manager::announce to avoid using it in the future.
Thanks to this, all commands in system.group0_history will
have a non-null description.
2023-08-07 14:38:11 +02:00
Patryk Jędrzejczak
866c9a904d test: always pass empty description to migration_manager::announce
In the next commit, we remove the default value for the
description parameter of migration_manager::announce to avoid
using it in the future. However, many calls to announce in tests
use the default value. We have to change it, but we don't really
care about descriptions in the tests, so we pass the empty string
everywhere.
2023-08-07 14:38:11 +02:00
Patryk Jędrzejczak
27ddf78171 migration_manager: announce: provide descriptions for all calls
The system.group0_history table provides useful descriptions
for each command committed to Raft group 0. One way of applying
a command to group 0 is by calling migration_manager::announce.
This function has the description parameter set to empty string
by default. Some calls to announce use this default value which
causes null values in system.group0_history. We want
system.group0_history to have an actual description for every
command, so we change all default descriptions to reasonable ones.

We can't provide a reasonable description to announce in
query_processor::execute_thrift_schema_command because this
function is called in multiple situations. To solve this issue,
we add the description parameter to this function and to
handler::execute_schema_command that calls it.
2023-08-07 14:38:11 +02:00
Avi Kivity
4f7e83a4d0 cql3: select_statement: reject DISTINCT with GROUP BY on clustering keys
While in SQL DISTINCT applies to the result set, in CQL it applies
to the table being selected, and doesn't allow GROUP BY with clustering
keys. So reject the combination like Cassandra does.

While this is not an important issue to fix, it blocks un-xfailing
other issues, so I'm clearing it ahead of fixing those issues.

An issue is unmarked as xfail, and other xfails lose this issue
as a blocker.

Fixes #12479

Closes #14970
2023-08-07 15:35:59 +03:00
Benny Halevy
db7a4109dd gossiper: lock_endpoint: fix comment regarding permit_id mismatch
Fixes a code review comment.
See https://github.com/scylladb/scylladb/pull/14845#discussion_r1283572889

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-07 14:39:42 +03:00
Benny Halevy
4ebd2fa09d gossiper: lock_endpoint: change assert to on_internal_error
Fixes a code review comment.
See https://github.com/scylladb/scylladb/pull/14845#discussion_r1283060243

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-07 14:36:35 +03:00
Patryk Jędrzejczak
1772433ae2 raft_group0: log gaining and losing leadership on the INFO level
Knowing that a server gained or lost leadership in group 0 is
sometimes useful for the purpose of debugging, so we log
information about these events on the INFO level.

Gaining and losing leadership are relatively rare events, so
this change shouldn't flood the logs.

Closes #14877
2023-08-07 12:13:24 +02:00
Kamil Braun
9edc98f8e9 Merge 'raft: make a removed/decommissioning node a non-voter early' from Patryk Jędrzejczak
For `removenode`, we make a removed node a non-voter early. There is no
downside to it because the node is already dead. Moreover, it improves
availability in some situations.

For `decommission`, if we decommission a node when the number of nodes
is even, we make it a non-voter early to improve availability. All
majorities containing this node will remain majorities when we make this
node a non-voter and remove it from the set because the required size of
a majority decreases.

We don't change `decommission` when the number of nodes is odd since
this may reduce availability.

Fixes #13959

Closes #14911

* github.com:scylladb/scylladb:
  raft: make a decommissioning node a non-voter early
  raft: topology_coordinator: implement step_down_as_nonvoter
  raft: make a removed node a non-voter early
2023-08-07 10:14:33 +02:00
Botond Dénes
fa4aec90e9 Merge 'test: tasks: Fix task_manager/wait_task test ' from Aleksandra Martyniuk
Rewrite test that checks whether task_manager/wait_task works properly.
The old version didn't work. Delete functions used in old version.

Closes #14959

* github.com:scylladb/scylladb:
  test: rewrite wait_task test
  test: move ThreadWrapper to rest_util.py
2023-08-07 09:04:29 +03:00
Benny Halevy
6f037549ac sstables: delete_with_pending_deletion_log: batch sync_directory
When deleting multiple sstables with the same prefix
the deletion atomicity is ensured by the pending_delete_log file,
so if scylla crashes in the middle, deletions will be replyed on
restart.

Therefore, we don't have to ensure atomicity of each individual
`unlink`.  We just need to sync the directory once, before
removing the pending_delete_log file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14967
2023-08-06 18:52:13 +03:00
Avi Kivity
6c1e44e237 Merge 'Make replica::database and cql3::query_processor share wasm manager' from Pavel Emelyanov
This makes it possible to remove remaining users of the global qctx.

The thing is that db::schema_tables code needs to get wasm's engine, alien runner and instance cache to build wasm context for the merged function or to drop it from cache in the opposite case. To get the wasm stuff, this code uses global qctx -> query_processor -> wasm chain. However, the functions (un)merging code already has the database reference at hand, and its natural to get wasm stuff from it, not from the q.p. which is not available

So this PR packs the wasm engine, runner and cache on sharded<wasm::manager> instance, makes the manager be referenced by both q.p. and database and removes the qctx from schema tables code

Closes #14933

* github.com:scylladb/scylladb:
  schema_tables: Stop using qctx
  database: Add wasm::manager& dependency
  main, cql_test_env, wasm: Start wasm::manager earlier
  wasm: Shuffle context::context()
  wasm: Add manager::remove()
  wasm: Add manager::precompile()
  wasm: Move stop() out of query_processor
  wasm: Make wasm sharded<manager>
  query_processor: Wrap wasm stuff in a struct
2023-08-06 17:00:28 +03:00
Avi Kivity
412629a9a1 Merge 'Export tablet load-balancer metrics' from Tomasz Grabiec
The metrics are registered on-demand when load-balancer is invoked, so that only leader exports the metrics. When leader changes, the old leader will stop exporting.

The metrics are divided into two levels: per-dc and per-node. In prometheus, they will have appropriate labels for dc and host_id values.

Closes #14962

* github.com:scylladb/scylladb:
  tablet_allocator: unregister metrics when leadership is lost
  tablets: load_balancer: Export metrics
  service, raft: Move balance_tablets() to tablet_allocator
  tablet_allocator: Start even if tablets feature is not enabled
  main, storage_service: Pass tablet allocator to storage_service
2023-08-06 16:58:27 +03:00
Tomasz Grabiec
f26e65d4d4 tablets: Fix crash on table drop
Before the patch, tablet metadata update was processed on local schema merge
before table changes.

When table is dropped, this means that for a while table will exist
without a corresponding tablet map. This can cause memtable flush for
this table to fail, resulting in intentional abort(). That's because
sstable writing attempts to access tablet map to generate sharding
metadata.

If auto_snapshot is enabled, this is much more likely to happen,
because we flush memtables on table drop.

To fix the problem, process tablet metadata after dropping tables, but
before creating tables.

Fixes #14943

Closes #14954
2023-08-06 16:45:43 +03:00
Pavel Emelyanov
3c6686e181 bptree: Replace assert with static_assert
The one runs under checked constexpr value anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14951
2023-08-06 16:36:12 +03:00
Tomasz Grabiec
f827cfd5b6 tablet_allocator: unregister metrics when leadership is lost
So that graphs are not polluted with stale metrics from past leaders.
2023-08-05 21:48:08 +02:00
Tomasz Grabiec
d653cbae53 tablets: load_balancer: Export metrics 2023-08-05 21:48:08 +02:00
Tomasz Grabiec
67c7aadded service, raft: Move balance_tablets() to tablet_allocator
The implementation will access metrics registered from tablet_allocator.
2023-08-05 21:48:08 +02:00
Tomasz Grabiec
cb0d763a22 tablet_allocator: Start even if tablets feature is not enabled
topology coordinator will call it. Rather than spreading ifs there,
it's simpler to start it and disable functionality in the tablet
allocator.
2023-08-05 21:48:08 +02:00
Tomasz Grabiec
5bfc8b0445 main, storage_service: Pass tablet allocator to storage_service
Tablet balancing will be done through tablet_allocator later.
2023-08-05 03:10:26 +02:00
Pavel Emelyanov
fd50ba839c schema_tables: Stop using qctx
There are two places in there that need qctx to get query_processor from
to, in turn, get wasm::manager from. Fortunately, both places have the
database reference at hand and can get the wasm::manager from it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
fa93ac9bfd database: Add wasm::manager& dependency
The dependency is needed by db::schema_tables to get wasm manager for
its needs. This patch prepares the ground. Now the wasm::manager is
shared between replica::database and cql3::query_processor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
f4e7ffa0fc main, cql_test_env, wasm: Start wasm::manager earlier
It will be needed by replica::database and should be available that
early. It doesn't depend on anything and can be moved in the starting
order safely

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
595c5abbf9 wasm: Shuffle context::context()
Add a constructor that builds context out of const manager reference.
The existing one needs to get engine and instance cache and does it via
query_processor. This change lets removing those exports and finally --
drop the wasm::manager -> cql3::query_processor friendship

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
56404ee053 wasm: Add manager::remove()
This is one of the users of query_processor's export of wasm::manager's
instance cache. Remove it in advance

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
93cb73fddb wasm: Add manager::precompile()
This is not to make query_processor export alien runner from the
wasm::manager

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
d58a2d65b5 wasm: Move stop() out of query_processor
When the q.p. stops it also "stops" the wasm manager. Move this call
into main. The cql test env doesn't need this change, it stops the whole
sharded service which stops instances on its own

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
243f2217dd wasm: Make wasm sharded<manager>
The wasm::manager is just cql3::wasm_context renamed. It now sits in
lang/wasm* and is started as a sharded service in main (and cql test
env). This move also needs some headers shuffling, but it's not severe

This change is required to make it possible for the wasm::manager to be
shared (by reference) between q.p. and replica::database further

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Pavel Emelyanov
dde285e7e9 query_processor: Wrap wasm stuff in a struct
There are three wasm-only fields on q.p. -- engine, cache and runner.
This patch groups them on a single wasm_context structure to make it
earier to manipulate them in the next patches

The 'friend' declaration it temporary, will go away soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Kamil Braun
421a5ad55c Merge 'feature_service: don't load whole topology state to check features' from Piotr Dulikowski
Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.

However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.

In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.

Fixes: #14944

Closes #14955

* github.com:scylladb/scylladb:
  feature_service: don't load whole topology state to check features
  system_keyspace: separate loading topology_features from topology
  topology_state_machine: extract features-related fields to a struct
  untyped_result_set: add missing_column_exception
2023-08-04 15:09:12 +02:00
Kamil Braun
fed775e13b Merge 'group0_state_machine: await transfer_snapshot' from Benny Halevy
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.

Fixes #14907

Also, use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()

Closes #14952

* github.com:scylladb/scylladb:
  raft: group0_state_machine: transfer_snapshot: make abortable
  raft: group0_state_machine: transfer_snapshot: hold gate
2023-08-04 14:21:57 +02:00
Botond Dénes
68d2397d01 Merge 'repair: delete unused fields' from Aleksandra Martyniuk
Delete unused shard_repair_task_impl members and incorrectly used method's argument.

Closes #14956

* github.com:scylladb/scylladb:
  repair: delete task_manager_module::get_progress argument
  repair: delete unused shard_repair_task_impl fields
2023-08-04 15:08:31 +03:00
Aleksandra Martyniuk
629f893355 test: rewrite wait_task test
Rewrite test that checks whether task_manager/wait_task works properly.
The old version didn't work. Delete functions used in old version.
2023-08-04 13:34:58 +02:00
Aleksandra Martyniuk
9d2e55fd37 test: move ThreadWrapper to rest_util.py
Move ThreadWrapper to rest_util.py so it can be reused in different tests.
2023-08-04 13:29:03 +02:00
Piotr Dulikowski
b7d9348229 feature_service: don't load whole topology state to check features
Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.

However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.

In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.

Fixes: #14944
2023-08-04 12:32:05 +02:00
Piotr Dulikowski
8f491457ae system_keyspace: separate loading topology_features from topology
Now, it is possible to load topology_features separately from the
topology struct. It will be used in the code that checks enabled
features on startup.
2023-08-04 12:32:04 +02:00
Piotr Dulikowski
f1704eeee6 topology_state_machine: extract features-related fields to a struct
`enabled_features` and `supported_features` are now moved to a new
`topology::features` struct. This will allow to move load this
information independently from the `topology` struct, which will be
needed for feature checking during start.
2023-08-04 12:21:51 +02:00
Aleksandra Martyniuk
66df686980 repair: delete task_manager_module::get_progress argument
Getting reason argument in task_manager_module::get_progress is deceiving
as the method works properly only for streaming::stream_reason::repair
(repair::shard_repair_task_impl::nr_ranges_finished isn't updated for
any other reason).
2023-08-04 11:09:37 +02:00
Aleksandra Martyniuk
93ebbdcf1d repair: delete unused shard_repair_task_impl fields 2023-08-04 10:52:24 +02:00
Botond Dénes
00a62866ac Merge 'Make database::add_column_family exception safe.' from Aleksandra Martyniuk
If some state update in database::add_column_family throws,
info about a column family would be inconsistent.

Undo already performed operations in database::add_column_family
when one throws.

Fixes: #14666.

Closes #14672

* github.com:scylladb/scylladb:
  replica: undo the changes if something fails
  replica: start table earlier in database::add_column_family
2023-08-04 10:58:17 +03:00
Botond Dénes
4d538e1363 Merge 'Task manager tasks covering compaction group compaction' from Aleksandra Martyniuk
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.

Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.

Closes #14505

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to cover compaction group tasks
  compaction: turn custom_task_executor into compaction_task_impl
  compaction: turn sstables_task_executor into sstables_compaction_task_impl
  compaction: change sstables compaction tasks type
  compaction: move table_upgrade_sstables_compaction_task_impl
  compaction: pass task_info through sstables compaction
  compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
  compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
  comapction: use optional task info in major compaction
  compaction: use perform_compaction in compaction_manager::perform_major_compaction
2023-08-04 10:11:00 +03:00
Michał Jadwiszczak
b92d47362f schema::describe: print 'synchronous_updates' only if it was specified
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.

Fixes: #14924

Closes #14928
2023-08-04 09:52:37 +03:00
Kefu Chai
d8d91379e7 test: remove unnecessary check in compaction_manager_basic_test
we wait for the same condition couple lines before, so no need to
check it again using `BOOST_CHECK_EQUAL()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14921
2023-08-04 09:26:22 +03:00
Piotr Dulikowski
fad1e82bf7 untyped_result_set: add missing_column_exception
Currently, when one tries to access a column that an untyped_result_set
does not contain, a `std::bad_variant_access` exception is thrown. This
exception's message provides very little context and it can be difficult
to even figure out where this message is coming from.

In order to improve the situation, a new exception `missing_column` is
introduced which includes the missing column's name in its error
message. The exception derives from `std::bad_variant_access` for
compatibility with existing code that may want to catch it.
2023-08-04 07:37:12 +02:00
Kefu Chai
374bed8c3d tools: do not create bpo::value unless transfer it to an option_description
`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.

before this change, we initialize positional options as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.

so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.

after this change, we create the option vector as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options vector as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.

Fixes #14929
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14939
2023-08-04 08:03:11 +03:00
Aleksandra Martyniuk
1e9b2972ea replica: undo the changes if something fails
If a step of adding a table fails, previous steps are undone.
2023-08-03 17:37:31 +02:00
Benny Halevy
46c9e3032d storage_service: get_all_ranges: reserve enough space in ranges
Commit bc5f6cf45d
added a reserve call to the `ranges` vector before
inserting all the returned token ranges into it.
However, that reservation is too small as we need
to express size+1 ranges for size tokens with
<unbound, token[0]> and <token[size-1], unbound>
ranges at the front and back, respectively.

Fixes #14849

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14938
2023-08-03 17:13:03 +03:00
Benny Halevy
357d57c82d raft: group0_state_machine: transfer_snapshot: make abortable
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-03 16:32:08 +03:00
Benny Halevy
a23b58231e raft: group0_state_machine: transfer_snapshot: hold gate
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.

Fixes #14907

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-03 15:45:34 +03:00
Botond Dénes
946c6487ee Merge 'repair: Add ranges_parallelism option' from Asias He
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job.

To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel.

Fixes #4847

Closes #14886

* github.com:scylladb/scylladb:
  repair: Add ranges_parallelism option
  repair: Change to use coroutine in do_repair_ranges
2023-08-03 11:34:05 +03:00
Kefu Chai
d4ee84ee1e s3/test: nuke tempdir but keep $tempdir/log
before this change, if the object_store test fails, the tempdir
will be preserved. and if our CI test pipeline is used to perform
the test, the test job would scan for the artifacts, and if the
test in question fails, it would take over 1 hour to scan the tempdir.

to alleviate the pain, let's just keep the scylla logging file
no matter the test fails or succeeds. so that jenkins can scan the
artifacts faster if the test fails.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14880
2023-08-03 11:07:59 +03:00
Avi Kivity
cb3b808e3f Merge 'replica/table.cc: Add per-node-per-table metrics' from Amnon Heiman
Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems.

This patch adds a small subset of per-metrics table that will be reported on the node level.

The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
  resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
  operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
  operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
  performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables
system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes

The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change.

The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it.

The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off

Fixes #2198

Closes #13293

* github.com:scylladb/scylladb:
  replica/table.cc: Add node-per-table metrics
  config: add enable_node_table_metrics flag
2023-08-02 22:17:47 +03:00
Patryk Jędrzejczak
d9137c7bdc raft: make a decommissioning node a non-voter early
If we decommission a node when the number of nodes is even, we
make it a non-voter early to improve availability. All majorities
containing this node will remain majorities when we make this node
a non-voter and remove it from the set because the required size
of a majority decreases.
2023-08-02 17:02:55 +02:00
Patryk Jędrzejczak
20b13f89a1 raft: topology_coordinator: implement step_down_as_nonvoter
We move the logic that makes topology_coordinator a non-voter to
a separate function called step_down_as_nonvoter to avoid code
duplication. We use this function in the next commit.
2023-08-02 16:52:34 +02:00
Patryk Jędrzejczak
377f87c91a raft: make a removed node a non-voter early
For removenode, we make a removed node a non-voter early. There is
no downside to it because the node is already dead. Moreover, it
improves availability in some situations. Consider a 4-node
cluster with one dead none. If we make the dead node a non-voter
at the beginning of removenode, group 0 will survive the death
of another node in the middle of removenode.
2023-08-02 16:52:33 +02:00
Aleksandra Martyniuk
9f68566038 replica: start table earlier in database::add_column_family
In database::add_column_family table::start() is called before
a table is registered in different structures.
2023-08-02 16:35:34 +02:00
Kamil Braun
39ca07c49b Merge 'Gossiper endpoint locking' from Benny Halevy
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.

We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.

An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.

Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471

Closes #14845

* github.com:scylladb/scylladb:
  gossiper: replicate: ensure non-null permit
  gossiper: add_saved_endpoint: lock_endpoint
  gossiper: mark_as_shutdown: lock_endpoint
  gossiper: real_mark_alive: lock_endpoint
  gossiper: advertise_token_removed: lock_endpoint
  gossiper: do_status_check: lock_endpoint
  gossiper: remove_endpoint: lock_endpoint if needed
  gossiper: force_remove_endpoint: lock_endpoint if needed
  storage_service: lock_endpoint when removing node
  gossiper: use permit_id to serialize state changes while preventing deadlocks
  gossiper: lock_endpoint: add debug messages
  utils: UUID: make default tagged_uuid ctor constexpr
  gossiper: lock_endpoint must be called on shard 0
  gossiper: replicate: simplify interface
  gossiper: mark_as_shutdown: make private
  gossiper: convict: make private
  gossiper: mark_as_shutdown: do not call convict
2023-08-02 13:50:08 +02:00
Konstantin Osipov
df97135583 test.py: forward the optional property file when creating a server
To support multi-DC tests we need to provide a property
file when creating a server.
Forward it from the test client to test.py.

Closes #14683
2023-08-02 13:45:19 +02:00
Kamil Braun
b835acf853 Merge 'Cluster features on raft: topology coordinator + check on boot' from Piotr Dulikowski
This PR implements the functionality of the raft-based cluster features
needed to safely manage and enable cluster features, according to the
cluster features on raft design doc.

Enabling features is a two phase process, performed by the topology
coordinator when it notices that there are no topology changes in
progress and there are some not-yet enabled features that are declared
to be supported by all nodes:

1. First, a global barrier is performed to make sure that all nodes saw
   and persisted the same state of the `system.topology` table as the
   coordinator and see the same supported features of all nodes. When
   booting, nodes are now forbidden to revoke support for a feature if all
   nodes declare support for it, a successful barrier this makes sure that
   no node will restart and disable the features.
2. After a successful barrier, the features are marked as enabled in the
   `system.topology` table.

The whole procedure is a group 0 operation and fails if the topology
table is modified in the meantime (e.g. some node changes its supported
features set).

For now, the implementation relies on gossip shadow round check to
protect from nodes without all features joining the cluster. In a
followup, a new joining procedure will be implemented which involves the
topology coordinator and lets it verify joining node's cluster features
before the new node is added to group 0 and to the cluster.

A set of tests for the new implementation is introduced, containing the
same tests as for the non-raft-based cluster feature implementation plus
one additional test, specific to this implementation.

Closes #14722

* github.com:scylladb/scylladb:
  test: topology_experimental_raft: cluster feature tests
  test: topology: fix a skipped test
  storage_service: add injection to prevent enabling features
  storage_service: initialize enabled features from first node
  topology_state_machine: add size(), is_empty()
  group0_state_machine: enable features when applying cmds/snapshots
  persistent_feature_enabler: attach to gossip only if not using raft
  feature_service: enable and check raft cluster features on startup
  storage_service: provide raft_topology_change_enabled flag from outside
  storage_service: enable features in topology coordinator
  storage_service: add barrier_after_feature_update
  topology_coordinator: exec_global_command: make it optional to retake the guard
  topology_state_machine: add calculate_not_yet_enabled_features
2023-08-02 12:32:27 +02:00
Pavel Emelyanov
c3b23fc03d Merge 'Skip mode validation for snapshots' from Benny Halevy
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Closes #14892

* github.com:scylladb/scylladb:
  distributed_loader: process_sstable_dir: do not verify snapshots
  utils/directories: verify_owner_and_mode: add recursive flag
2023-08-02 13:05:47 +03:00
Kefu Chai
d28c06b65b test: remove unused #include in sstable_*_test.cc
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed, except for "test/lib/scylla_test_case.hh"
as it brings some command line options used by scylla tests.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14922
2023-08-02 11:58:03 +03:00
Kefu Chai
1bcd9dd80a compaction: drop unnecessary type cast
get_compacted_fragments_writer() returns a instance of
`compacted_fragments_writer`, there is no need to cast it again.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14919
2023-08-02 11:36:10 +03:00
Amnon Heiman
c30d7ba5d7 replica/table.cc: Add node-per-table metrics
Per-table metrics are very valuable for the users, it does come with a
high load on both the reporting and the collecting metrics systems.

This patch adds a small subset of per-metrics table that will be
reported on the node level.

The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
  resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
  operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
  operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
  performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by
write operations in memtables
system_column_family_total_disk_space - Total disk space used
system_column_family_live_sstable - Live sstable count
system_column_family_read_latency_count - Number of reads
system_column_family_write_latency_count - Number of writes

The names of the read/write metrics is based on the histogram convention,
so when latencies histograms will be added, the names will not change.

The metrics are label with a specific label __per_table="node" so it
will be possible to easily manipulate it.

The metrics will be available when enable_metrics_reporting (the
per-table full metrics flag) is off and enable_node_table_metrics is
true.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-02 10:20:18 +03:00
Amnon Heiman
d10a3dd19a config: add enable_node_table_metrics flag
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.

There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.

This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.

For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-02 10:20:18 +03:00
Kefu Chai
6c66030b7b compaction: add formatter for compaction_task_executor
add fmt formatter for `compaction_task_executor::state` and
`compaction_task_executor` and its derived classes.

this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `compaction_task_executor`, its derived classes and
`compaction_task_executor::state` without the help of `operator<<`.

since all of the callers of 'operator<<' of these types now use
formatters, the operator<< are removed in this change. the helpers
like `to_string()` and `describe()` are removed as well, as it'd
be more consistent if we always use fmtlib for formatting instead
of inventing APIs with different names.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14906
2023-08-02 09:15:43 +03:00
Benny Halevy
949ea43034 topology: unindex_node: erase dc from datacenters when empty
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.

This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.

Fixes scylladb/scylladb#14896

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14897
2023-08-02 09:08:24 +03:00
Piotr Dulikowski
d40bb0bacb test: topology_experimental_raft: cluster feature tests
Although the implementation of cluster features on raft is not complete
yet, it makes sense to add some tests for the existing implementation.
The `test_raft_cluster_features.py` file includes the same set of tests
as the file with non-raft-based cluster feature tests, plus one
additional test which checks that a node will not allow disabling a
feature if it sees that other nodes support it (even though the feature
is not enabled yet).
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
435005b6a5 test: topology: fix a skipped test
The `test_partial_upgrade_can_be_finished_with_removenode` test does not
work because the `cql` variable is used before it is declared. It was
not noticed because the test is marked as skipped, and does not work for
the non-raft cluster feature implementation. The variable declaration is
moved higher and the test now works; it will be used to test the raft
cluster feature implementation.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
0e29abae8e storage_service: add injection to prevent enabling features
Adds the `raft_topology_suppress_enabling_features` error injection
which, while enabled, prevents the topology coordinator from enabling
features.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
b0c57f34d2 storage_service: initialize enabled features from first node
The first node in the cluster defines it and it does not need to consult
with anybody whether its features should be enabled or not. We can
immediately mark those features as enabled in raft when the first node
inserts its join request to the topology table.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
82fc6d9360 topology_state_machine: add size(), is_empty()
The latter method will be used in the next commit.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
232f2b49d2 group0_state_machine: enable features when applying cmds/snapshots
As declared in the previous commit, the group0 state machine now enables
features on command application and snapshot transfer.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
7c309549d6 persistent_feature_enabler: attach to gossip only if not using raft
The enable_features_on_join function is now only called if the node does
not use topology over raft, and so the node will not react to changes in
gossip features.

In the future, support for switching to topology coordinator in runtime
will be added and the persistent feature enabler should disconnect
itself during the upgrade procedure. We don't have such procedure yet,
so a bunch of TODOs is added instead.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
3c1ca12e62 feature_service: enable and check raft cluster features on startup
The enable_features_on_startup method is adjusted for the raft-based
cluster features. In topology coordinator mode:

- Information about enabled features is taken from system.topology
  instead of the usual system.scylla_local (`enabled_features` key).
- Features which, according to the local state, are supported by all
  nodes but not enabled yet are also checked. Support for such features
  cannot be revoked safely because the topology coordinator might have
  performed a successful global barrier and might have proceeded with
  marking the feature as enabled.
2023-08-01 18:54:58 +02:00
Piotr Dulikowski
61a44e0bc0 storage_service: provide raft_topology_change_enabled flag from outside
Information about whether we are using topology changes on raft or not
will be soon necessary for the persistent feature enabler, so that it
can do some additional checks based on the local raft topology state.
2023-08-01 18:54:57 +02:00
Piotr Dulikowski
5a45301ac8 storage_service: enable features in topology coordinator
If the topology coordinator notices that there are no nodes requesting
to be joined, no topology operations in progress and there are some
features that are declared to be supported by all normal nodes but not
enabled yet, the topology coordinator will attempt to enable those
features. This is done in the following way, under a group 0 guard:

- A global `barrier_after_feature_update` is performed to make sure
  that:
  - All nodes have already updated their supported_features column after
    boot and won't attempt to revoke any during current runtime,
  - Saw and persisted the latest topology state so that, after restart,
    the feature check won't allow them to revoke support for features
    that the topology coordinator is going to enable.
- After the barrier succeeds, the coordinator tries to add the features
  to the `enabled_features` column.
2023-08-01 18:54:57 +02:00
Benny Halevy
e7f9700836 gossiper: replicate: ensure non-null permit
Ensure that replicate is called under lock_endpoint
to serialize endpoint state changes on all shards.
Otherwise, we may end up with incosistent state
across shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
cf7858d960 gossiper: add_saved_endpoint: lock_endpoint
Modify and replicate the endpoint state
must be done under the lock_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
6fdec20b59 gossiper: mark_as_shutdown: lock_endpoint
The function manipulates the endpoint state
and calls replicate and mark_dead, therefore it
must ensure this is done under lock_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
d6fcfdcd65 gossiper: real_mark_alive: lock_endpoint
The function manipulates internal state on shard 0
and calls subscribers async callbacks so we should
lock the endpoint to serialize state changes on it.

With that, get_endpoint_state_for_endpoint_ptr after
locking the endpoint in real_mark_alive, not
before calling it, in the background continuation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
6d58be59d1 gossiper: advertise_token_removed: lock_endpoint
The function manipulates the endpoint state
and calls replicate, therefore it
must ensure this is done under lock_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
2ac1796a5c gossiper: do_status_check: lock_endpoint
The function manipulates the endpoint state by
calling remove_endpoint and evict_from_membership
(and possibly yielding in-between), so it should
serialize the state change with lock_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
3293c45682 gossiper: remove_endpoint: lock_endpoint if needed
lock_endpoint to serialize changes to endpoint state
and calling the on_remove notification.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
13124f0db4 gossiper: force_remove_endpoint: lock_endpoint if needed
lock_endpoint to serialize changes to endpoint state
via remove_endpoint and evict_from_membership.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
c7805f303d storage_service: lock_endpoint when removing node
Hold the endpoint lock across advertise_token_removed
and excise.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:42:02 +03:00
Benny Halevy
f74d154fe3 gossiper: use permit_id to serialize state changes while preventing deadlocks
Pass permit_id to subscribers when we acquire one
via lock_endpoint.  The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-01 17:41:57 +03:00
Piotr Dulikowski
082af79111 storage_service: add barrier_after_feature_update
Adds a variant of the existing `barrier` topology command which requires
from all participating nodes to confirm that they updated their features
after boot and won't remove any features from it until restart. A
successful global barrier of this type gives the topology coordinator a
guarantee that it can safely enable features that were supported by all
nodes at the moment of the barrier.
2023-08-01 14:33:20 +02:00
Piotr Dulikowski
af931553b1 topology_coordinator: exec_global_command: make it optional to retake the guard
Currently, exec_global_command takes a group 0 guard, drops it and
retakes it after the command is finished. For current uses it is fine
from the correctness' point of view and, given that an operation can
take a long time, shorter duration of the guard improves the odds of the
operation succeeding.

However, this is not sufficient for cluster features because they will
need to execute a global barrier under the group 0 guard.

This commit modifies the interface of `exec_global_command` so that
dropping and retaking the guard is optional (the default is to retake
it).
2023-08-01 14:16:49 +02:00
Piotr Dulikowski
7868d8ec17 topology_state_machine: add calculate_not_yet_enabled_features
Adds a function which calculates a set of features that are supported by
all normal nodes but are not enabled yet - according to the state of the
topology state machine.
2023-08-01 14:16:49 +02:00
Tomasz Grabiec
0239ba4527 Merge 'fencing: handle counter_mutations' from Gusev Petr
In this PR we add proper fencing handling to the `counter_mutation` verb.

As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update.

In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`.

We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests.

Closes #14564

* github.com:scylladb/scylladb:
  counter_mutation: add fencing
  encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
  counter_mutation: add replica::exception_variant to signature
2023-08-01 12:41:22 +02:00
Kamil Braun
8bb3732d66 Merge 'storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal' from Patryk Jędrzejczak
We add the CDC generation optimality check in
`storage_service::raft_check_and_repair_cdc_streams` so that it doesn't
create new generations when unnecessary. Since
`generation_service::check_and_repair_cdc_streams` already has this
check, we extract it to the new `is_cdc_generation_optimal` function to
not duplicate the code.

After this change, multiple tasks could wait for a single generation
change. Calling `signal` on `topology_state_machine.event` would't wake
them all. Moreover, we must ensure the topology coordinator wakes when
his logic expects it. Therefore, we change all `signal` calls on
`topology_state_machine.event` to `broadcast`.

We delay the deletion of the `new_cdc_generation` request to the moment
when the topology transition reaches the `publish_cdc_generation` state.
We need this change to ensure the added CDC generation optimality check
in the next commit has an intended effect. If we didn't make it, it
would be possible that a task makes the `new_cdc_generation` request,
and then, after this request was removed but before committing the new
generation, another task also makes the `new_cdc_generation` request. In
such a scenario, two generations are created, but only one should. After
delaying the deletion of `new_cdc_generation` requests, the second
request would have no effect.

Additionally, we modify the `test_topology_ops.py` test in a way that
verifies the new changes. We call
`storage_service::raft_check_and_repair_cdc_streams` multiple times
concurrently and verify that exactly one generation has been created.

Fixes #14055

Closes #14789

* github.com:scylladb/scylladb:
  storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
  storage_service: delay deletion of the new_cdc_generation request
  raft topology: broadcast on topology_state_machine.event instead of signal
  cdc: implement the is_cdc_generation_optimal function
2023-08-01 12:10:00 +02:00
Kamil Braun
84bb75ea0a Merge 'service: migration_manager: change the prepare_ methods to functions' from Patryk Jędrzejczak
The `migration_manager` service is responsible for schema convergence in
the cluster - pushing schema changes to other nodes and pulling schema
when a version mismatch is observed. However, there is also a part of
`migration_manager` that doesn't really belong there - creating
mutations for schema updates. These are the functions with `prepare_`
prefix. They don't modify any state and don't exchange any messages.
They only need to read the local database.

We take these functions out of `migration_manager` and make them
separate functions to reduce the dependency of other modules (especially
`query_processor` and CQL statements) on `migration_manager`. Since all
of these functions only need access to `storage_proxy` (or even only
`replica::database`), doing such a refactor is not complicated. We just
have to add one parameter, either `storage_proxy` or `database` and both
of them are easily accessible in the places where these functions are
called.

This refactor makes `migration_manager` unneeded in a few functions:
- `alternator::executor::create_keyspace`,
- `cql3::statements::alter_type_statement::prepare_announcement_mutations`,
- `cql3::statements::schema_altering_statement::prepare_schema_mutations`,
- `cql3::query_processor::execute_thrift_schema_command:`,
- `thrift::handler::execute_schema_command`.

We remove the `migration_manager&` parameter from all these functions.

Fixes #14339

Closes #14875

* github.com:scylladb/scylladb:
  cql3: query_processor::execute_thrift_schema_command: remove an unused parameter
  cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter
  cql3: alter_type_statement::prepare_announcement_mutations: change parameters
  alternator: executor::create_keyspace: remove an unused parameter
  service: migration_manager: change the prepare_ methods to functions
2023-08-01 11:56:56 +02:00
Patryk Jędrzejczak
233d801f39 cql3: query_processor::execute_thrift_schema_command: remove an unused parameter
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
query_processor::execute_thrift_schema_command and
thrift::handler::execute_schema_command (that calls
query_processor::execute_thrift_schema_command) has been unused.
2023-08-01 10:07:31 +02:00
Patryk Jędrzejczak
ffc3c1302e cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
schema_altering_statement::prepare_schema_mutations has been
unused by all classes inheriting from schema_altering_statement.
2023-08-01 10:07:31 +02:00
Patryk Jędrzejczak
b6ead8de10 cql3: alter_type_statement::prepare_announcement_mutations: change parameters
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
alter_type_statement::prepare_announcement_mutations has become
unneeded. However, the function needs access to
service::storage_proxy and data_dictionary::database. Passing
storage_proxy& to it is enough.
2023-08-01 10:06:38 +02:00
Patryk Jędrzejczak
928ee9616c alternator: executor::create_keyspace: remove an unused parameter
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of executor::create_key
has been unused.
2023-08-01 09:36:04 +02:00
Asias He
9b3fd9407b repair: Add ranges_parallelism option
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR #4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes #4847
2023-08-01 10:58:14 +08:00
Asias He
1a875ec0f1 repair: Change to use coroutine in do_repair_ranges
The with_semaphore was changed to use permit inside the coroutine.
2023-08-01 10:57:35 +08:00
Avi Kivity
3de7cacdf3 Merge 'De-static system_keyspace's [gs]et_scylla_local_param(_as)?' from Pavel Emelyanov
Those without `_as` suffix are just marked non-static
The `..._as` ones are made class methods (now they are local to system_keyspace.cc)
After that the `..._as` ones are patched to use `this->` instead of `qctx`

Closes #14890

* github.com:scylladb/scylladb:
  system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as()
  system_keyspace: Reuse container() and _db member for flushing
  system_keyspace: Make [gs]et_scylla_local_param_as() class methods
  system_keyspace: De-static [gs]et_scylla_local_param()
2023-07-31 21:51:04 +03:00
Botond Dénes
2d26613f28 tools: move operation-options to the operations themselves
Currently, operation-options are declared in a single global list, then
operations refer to the options they support via name. This system was
born at a time, when scylla-sstable had a lot of shared options between
its operations, so it was desirable to declare them centrally and only
add references to individual operations, to reduce duplication.
However, as the dust settled, only 2 options are shared by 2 operations
each. This is a very low benefit. Up to now the cost was also very low
-- shared options meant the same in all operations that used them.
However this is about to change and this system becomes very awkward to
use as soon as multiple operations want to have an option with the same
name, but sligthly (or very) different meaning/semantics.
So this patch changes moves the options to the operations themselves.
Each will declare the list of options it supports, without having to
reference some common list.
This also removes an entire (although very uncommon) class of bugs:
option-name referring to inexistent option.

Closes #14898
2023-07-31 20:16:41 +03:00
Benny Halevy
5f2e2a78e6 gossiper: lock_endpoint: add debug messages
Keep the endpoint address and the caller function name
around and print them in the different lock life cycle
state changes.

While at it, coroutinize gossiper::lock_endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 19:29:18 +03:00
Benny Halevy
929d03b370 utils: UUID: make default tagged_uuid ctor constexpr
So it can be used for gms::null_permit_id in the next patch

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 19:29:18 +03:00
Benny Halevy
6401348ada gossiper: lock_endpoint must be called on shard 0
We can't lock an endpoint on arbitrary shards
since collision will not be detected this way.

Assert that, and while at it, make the method private
as it is only used internally by the class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 19:29:18 +03:00
Benny Halevy
dc6e7e47c8 gossiper: replicate: simplify interface
Before making further changes to the endpoint_state_map
implementation, simplify `replicate` by providing only
one variant, replicating complete endpoint_state across
shards, instead of applying finer resolution changes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 19:29:17 +03:00
Mikołaj Grzebieluch
37b548f463 raft: stop group0 server during group0 service shutdown
When a `topology_change` command is applied, the topology state is
reloaded and `cdc::generation_service::handle_cdc_generation` is called.
This creates a dependency of group0 on `cdc::generation_service`.

Currently, the group0 server is stopped during `raft_group_registry`
shutdown. However, it is called after `cdc::generation_service`
shutdown, which can result in a segfault.

To prevent this issue, this commit stops the group0 server and removes it
from `raft_group_registry` during `group0_service` shutdown.

Fixes #14397.

Closes #14779

Reproducer:
97d6946e31

It creates two nodes. The second one is forced to stop after joining
group0. It sleeps before calling handle_cdc_generation and sleeps just
before raft_group_registry is stopped. It ensures that
handle_cdc_generation wakes up after starting the second sleep. If the
cdc_generation_service shutdown waits for raft_group_registry to stop,
handle_cdc_generation will be called without any issue. Otherwise, it
will crash since cdc_generation_service won't exist. The test passes
always. If the crash happens it can be seen in the log file of the
second node.
2023-07-31 16:17:11 +02:00
Avi Kivity
dac93b2096 Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.

The migration track is interrupted by pending topology change operations.

The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.

Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

Closes #14851

* github.com:scylladb/scylladb:
  tablets: load_balancer: Remove double logging
  tests: tablets: Check that load balancing is interrupted by topology change
  tests: tablets: Add test for load balancing with active migrations
  tablets: Balance tablets concurrently with active migrations
  storage_service, tablets: Extract generate_migration_updates()
  storage_service, tablets: Move get_leaving_replica() to tablets.cc
  locator: tablets: Move std::hash definition earlier
  storage_service: Advance tablets independently
  topology_coordinator: Fix missed notification on abort
  tablets: Add formatter for tablet_migration_info
2023-07-31 16:44:33 +03:00
Pavel Emelyanov
a596186e47 system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as()
Now those methods are non-static and can start using this's reference to
query processor instead of the global qctx thing

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-31 16:02:21 +03:00
Pavel Emelyanov
ec4040496b system_keyspace: Reuse container() and _db member for flushing
The set_scylla_local_param_as() wants to flush replica::database on all
shards. For that it uses smp::invoke_on_all() and qctx, but since the
method is now non-static one for system_keyspace it can enjoy usiing
container().invoke_on_all() and this->_db (on target shard)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-31 16:02:21 +03:00
Pavel Emelyanov
1ac4b7d2fe system_keyspace: Make [gs]et_scylla_local_param_as() class methods
These are now two .cc-local templatized helpers, but they are only
called by system_keyspace:: non-static methods, so can be such as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-31 16:02:18 +03:00
Pavel Emelyanov
04b12d24fd system_keyspace: De-static [gs]et_scylla_local_param()
All same-class callers are now non-static methods of system_keyspace,
all external callers do it via an object at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-31 16:02:18 +03:00
Benny Halevy
845b6f901b distributed_loader: process_sstable_dir: do not verify snapshots
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 16:01:46 +03:00
Benny Halevy
60862c63dd utils/directories: verify_owner_and_mode: add recursive flag
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 16:01:43 +03:00
Botond Dénes
4a02865ea1 Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.

Fixes: #13290

Closes #13349

* github.com:scylladb/scylladb:
  replica: make tables_metadata's attributes private
  replica: add methods to get a filtered copy of tables map
  replica: add methods to check if given table exists
  replica: add methods to get table or table id
  replica: api: return table_id instead of const table_id&
  replica: iterate safely over tables related maps
  replica: pass tables_metadata to phased_barrier_top_10_counts
  replica: add methods to safely add and remove table
  replica: wrap column families related maps into tables_metadata
  replica: futurize database::add_column_family and database::remove
2023-07-31 15:31:59 +03:00
Botond Dénes
72043a6335 Merge 'Avoid using qctx in schema_tables' column-mapping queries' from Pavel Emelyanov
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.

Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member

Closes #14876

* github.com:scylladb/scylladb:
  schema_tables: Extract query_processor from system_keyspace for querying
  schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
  migration_manager: Add system_keyspace argument to get_schema_mapping()
2023-07-31 15:00:59 +03:00
Botond Dénes
781721218f Merge 'storage_service: refresh_sync_nodes: restrict to normal token owners' from Benny Halevy
It is possible that topology will contain nodes that are no longer normal token owners, so they don't need to be sync'ed with.

Fixes scylladb/scylladb#14793

Closes #14798

* github.com:scylladb/scylladb:
  storage_service: refresh_sync_nodes: restrict to reachable token owners
  storage_service: refresh_sync_nodes: fix log message
  locator: topology: node::state: make fine grained
2023-07-31 14:52:19 +03:00
Avi Kivity
f2c1a214e5 Merge 'Prevent stalls in query_partition_key_range_concurrent' from Benny Halevy
Prevent stalls caused by query_partition_key_range_concurrent
nested calls when it never yields.

Fixes #14008

Closes #14884

* github.com:scylladb/scylladb:
  storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop
  storage_proxy: query_partition_key_range_concurrent: fixup indentation
  storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration
  storage_proxy: coroutinize query_partition_key_range
2023-07-31 13:36:53 +03:00
Benny Halevy
1431e2798b storage_service: refresh_sync_nodes: restrict to reachable token owners
It is possible that topology would contain nodes that do no longer
own tokens or that are unreachable, so they can't be sync'ed with.

Restrict the list to nodes in a normal or being_decommissioned state.

Fixes scylladb/scylladb#14793

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 10:49:06 +03:00
Benny Halevy
431bfd6c3a storage_service: refresh_sync_nodes: fix log message
Remove outdated args to log message.
The issue was introduced in ca61d88764

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 10:33:58 +03:00
Benny Halevy
d903d03bf8 locator: topology: node::state: make fine grained
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 10:33:48 +03:00
Kefu Chai
47e27dd2d2 test: wait until there is no pending tasks in compaction_manager_basic_test
before this change, after triggering the compaction,
compaction_manager_basic_test waits until the triggered compaction
completes. but since the regular compaction is run in a loop which
does not stop until either the daemon is stopping, or there is no
more sstables to be compacted, or the compaction is disabled.

but we only get the input sstables for compaction after swiching
to the "pending" state, and acquiring the read lock of the
compaction_state, and acquiring the read lock is implemented as
an coroutine, so there is chance that coroutine is suspended,
and the execution switches to the test. in this case, the test
will find that even after the triggered compaction completes,
there are still one or more pending compactions. hence the test
fails.

to address this problem, instead of just waiting for the compaction
to complete, we also wait until the number of pending compaction tasks
is 0. so that even if the test manages to sneak into the time window,
it won't proceed and starting check the compaction manager's stats.

Fixes #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14889
2023-07-31 10:29:18 +03:00
Benny Halevy
2902a4136f storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop
Add calls to `maybe_yield` in the per-range loops to prevent stalls
if the loop never yields.

Note: originally the stalls were detected in nested calls
to `query_partition_key_range_concurrent` (see #14008).
This series turned the tail-recursion into iteration,
but still the inner loop(s) never yield and do quite
a lot of computations - so they mioght stall when called
with a large number of ranges.

Fixes #14008

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:54:34 +03:00
Kefu Chai
1c525c02a3 tools/utils: use std::shift_left() when appropriate
instead of using a loop of std::swap(), let's use std::shift_left()
when appropriate. simpler and more readable this way.

moreover, the pattern of looking for a command and consume it from
the command line resembles what we have in main(), so let's use
similar logic to handle both of them. probably we can consolidate
them in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14888
2023-07-31 09:46:52 +03:00
Kefu Chai
eab160e947 tools/scylla-sstable: mark const variable with constexpr
this change change `const` to `constexpr`. because the string literal
defined here is not only immutable, but also initialized at
compile-time, and can be used by constexpr expressions and functions.

this change is introduced to reduce the size of the change when moving
to compile-time format string in future. so far, seastar::format() does
not use the compile-time format string, but we have patches pending on
review implementing this. and the author of this change has local
branches implementing the changes on scylla side to support compile-time
format string, which practically replaces most of the `format()` calls
with `seastar::format()`.

to reduce the size of the change and the pain of rebasing, some of the
less controversial changes are extracted and upstreamed. this one is one
of them.

this change also addresses following compilation failure:

```
/home/kefu/dev/scylladb/tools/scylla-sstable.cc:2836:44: error: call to consteval function 'fmt::basic_format_string<char, const char *const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char *, 0>' is not a constant expression
 2836 |             .description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
      |                                            ^
/usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression
 3148 |   FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
      |                                                                   ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14887
2023-07-31 09:44:00 +03:00
Benny Halevy
8d5020b8f6 storage_proxy: query_partition_key_range_concurrent: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:43:33 +03:00
Benny Halevy
3c122a87b5 storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration
Update the function state and loop for the next ranges
instead of nesting it oneself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 09:43:33 +03:00
Nadav Har'El
04e5082d52 alternator: limit expression length and recursion depth
DynamoDB limits of all expressions (ConditionExpression, UpdateExpression,
ProjectionExpression, FilterExpression, KeyConditionExpression) to just
4096 bytes. Until now, Alternator did not enforce this limit, and we had
an xfailing test showing this.

But it turns out that not enforcing this limit can be dangerous: The user
can pass arbitrarily-long and arbitrarily nested expressions, such as:

    a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...))))))

or
    (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((

and those can cause recursive algorithms in Alternator's parser and
later when applying expressions to recurse very deeply, overflow the
stack, and crash.

This patch includes new tests that demonstrate how Scylla crashes during
parsing before enforcing the 4096-byte length limit on expressions.
The patch then enforces this length limit, and these tests stop crashing.
We also verify that deeply-nested expressions shorter than the 4096-byte
limit are apparently short enough for our recursion ability, and work
as expected.

Unforuntately, running these tests many times showed that the 4096-byte
limit is not low enough to avoid all crashes so this patch needs to do
more:

The parsers created by ANTLR are recursive, and there is no way to limit
the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH).
Very deep recursion can overflow the stack and crash Scylla. After we
limited the length of expression strings to 4096 bytes this was *almost*
enough to prevent stack overflows. But unfortunetely the tests revealed
that even limited to 4096 bytes, the expression can sometimes recurse
too deeply: Consider the expression "((((((....((((" with 4000 parentheses.
To realize this is a syntax error, the parser needs to do a recursive
call 4000 times. Or worse - because of other Antlr limitations (see rants
in comments in expressions.g) it's actually 12000 recursive calls, and
each of these calls have a pretty large frame. In some cases, this
overflows the stack.

The solution used in this patch is not pretty, but works. We add to rules
in alternator/expressions.g that recurse (there are two of those - "value"
and "boolean_expression") an integer "depth" parameter, which we increase
when the rule recurses. Moreover, we add a so-called predicate
"{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached.
When the parsing is stopped, the user will see a special kind of parse
error, saying "expression nested too deeply".

With this last modification to expressions.g, the tests for deeply-nested but
still-below-4096-bytes expressions
(test_limits.py::test_deeply_nested_expression_*) would not fail sporadically
as they did without it.

While adding the "expression nested too deeply" case, I also made the
general syntax-error reporting in Alternator nicer: It no longer prints
the internal "expression_syntax_error" type name (an exception type will
only be printed if some sort of unexpected exception happens), and it
prints the character position where the syntax error (or too deep
nested expression) was recognized.

Fixes #14473

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14477
2023-07-31 08:57:54 +03:00
Botond Dénes
a637ddd09c Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL  `counterasblob()` and `blobascounter()` functions (issue #14742).

As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.

Fixes #14501
Fixes #14742

Closes #14745

* github.com:scylladb/scylladb:
  test/cql-pytest: test confirming that casting to counter doesn't work
  cql: support casting of counter to other types
  cql: implement missing counterasblob() and blobascounter() functions
  cql: implement missing type functions for "counters" type
2023-07-31 08:55:45 +03:00
Benny Halevy
fd119469d8 storage_proxy: coroutinize query_partition_key_range
Prepare for coroutinizing query_partition_key_range_concurrent.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 08:22:24 +03:00
Tomasz Grabiec
3f221b1f05 tablets: load_balancer: Remove double logging 2023-07-31 01:45:23 +02:00
Tomasz Grabiec
96d06b58df tests: tablets: Check that load balancing is interrupted by topology change
We add a special mode of load balancing, enabled through error
injection, which causes it to continuously generate plans. This
should keep the topology coordinator continuously in the tablet
migration track.

We enable this mode in test_tablets.py:test_bootstrap before
bootstrapping nodes to see that bootstrap request interrupts
tablet migration track. If this would not be the case, the
test will hang.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
8fdbc42e71 tests: tablets: Add test for load balancing with active migrations 2023-07-31 01:45:23 +02:00
Tomasz Grabiec
fe181b3bac tablets: Balance tablets concurrently with active migrations
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.

Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
c9ea215ce1 storage_service, tablets: Extract generate_migration_updates() 2023-07-31 01:45:23 +02:00
Tomasz Grabiec
fbc6076e6a storage_service, tablets: Move get_leaving_replica() to tablets.cc
For better encapsulation of tablet-specific code.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
18a59ab5ff locator: tablets: Move std::hash definition earlier
Will be needed in order to define a struct which has
unordered_set<tablet_replica> as a field.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
889f2ceb1e storage_service: Advance tablets independently
This change makes the topology state machine advance each tablet
independently which allows them to finish migrations at different
speeds, not at the speed of the slowest tablet.

It will also open the possibility of starting new transitions concurrently
with already active ones.

This is implemented by having a single transition state "tablet
migration", and handling it by scanning all the transitions and
advancing tablet state machines. Updates and barriers are batched for
all tablets in each cycle.

One complication is the tracking of streaming sessions. The operations
are no longer nested in the scope of a single handle method, and
cannot be waited on explicitly, as that would inhibit progress of the
coordinator, which starts later migrations. They live as independent
fibers, which associated with tablets in a transient data structure
which lives within the coordinator instance. This data structure is
consulted for a given tablet in each cycle of the
handle_tablet_migration() pump to check if streaming has finished and
we can move the tablet to the next stage. If the pump has no work,
only then it waits for any streaming to finish by blocking on the
_topo_sm.event.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
2811b1df0a topology_coordinator: Fix missed notification on abort
If _as is aborted while the coordinator is in the middle of handling,
and decides to go to sleep, it may go to sleep without noticing that
it was aborted. Fix by checking before blocking on the condition
variable.

In general, every condition which can cause signal() should be checked
before when(). This patch doesn't fix all the cases. For example,
signal() can be called when there arrives a new topology request. This
can happen after the coordinator checked because it releases the guard
before calling when().
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
e338679266 tablets: Add formatter for tablet_migration_info 2023-07-31 01:45:23 +02:00
Nadav Har'El
b55b8f29b9 test/cql-pytest: test confirming that casting to counter doesn't work
In the previous patch we implemented CAST operations from the COUNTER
type to various other types. We did not implement the reverse cast,
from different types to the counter type. Should we? In this patch
we add a test that shows we don't need to bother - Cassandra does not
support such casts, so it's fine that we don't too - and indeed the
test shows we don't support them.
It's not a useful operation anyway.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Nadav Har'El
b513bba201 cql: support casting of counter to other types
We were missing support in the "CAST(x AS type)" function for the counter
type. This patch adds this support, as well as extensive testing that it
works in Scylla the same as Cassandra.

We also un-xfail an existing test translated from Cassandra's unit
test. But note that this old test did not cover all the edge-cases that
the new test checks - some missing cases in the implementation were
not caught by the old test.

Fixes #14501

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Nadav Har'El
c1762750ed cql: implement missing counterasblob() and blobascounter() functions
Code in functions.cc creates the different TYPEasblob() and blobasTYPE()
functions for all type names TYPE. The functions for the "counter" type
were skipped, supposedly because "counters are not supported yet". But
counters are supported, so let's add the missing functions.

The code fix is trivial, the tests that verify that the result behaves
like Cassandra took more work.

After this patch, unimplemented::cause::COUNTERS is no longer used
anywhere in the code. I wanted to remove it, but noticed that
unimplemented::cause is a graveyard of unused causes, so decided not
to remove this one either. We should clean it up in a separate patch.

Fixes #14742

Also includes tests for tangently-related issues:
Refs #12607
Refs #14319

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Nadav Har'El
d9c2cd3024 cql: implement missing type functions for "counters" type
types.cc had eight of its functions unimplemented for the "counters"
types, throwing an "unimplemented::cause::COUNTERS" when used.
A ninth function (validate) was unimplemented for counters but did not
even throw.
Many code paths did not use any of these functions so didn't care, but
some do - e.g., the silly do-nothing "SELECT CAST(c AS counter)" when
c is already a counter column, which causes this operation to fail.

When the types.cc code encounters a counter value, it is (if I understand
it correctly) already a single uint64_t ("long_type") value, so we fall
back to the long_type implementation of all the functions. To avoid mistakes,
I simply copied the reversed_type implementation for all these functions -
whereas the reversed_type implementation falls back to using the underlying
type, the counter_type implementation always falls back to long_type.

After this patch, "SELECT CAST(c AS counter)" for a counter column works.
We'll introduce a test that verifies this (and other things) in a later
patch in this series.

The following patches will also need more of these functions to be
implemented correctly (e.g., blobascounter() fails to validate the size
of the input blob if the validate function isn't implemented for the
counter type).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-07-30 20:16:25 +03:00
Avi Kivity
accd6271bc Merge 'tools: introduce tool_app_template and migrate all tools to it' from Botond Dénes
The scaffolding required to have a working scylla tool app, is considerable, leading to a large amount of boilerplate code in each such app.  This logic is also very similar across the two tool apps we have and would presumably be very similar in any future app. This PR extracts this logic into `tools/utils.hh` and introduces `tool_app_template`, which is similar to `seastar::app_template`  in that it centralizes all the option handling and more in a single class, that each tool has to just instantiate and then call `run()` to run the app.
This cuts down on the repetition and boilerplate in our current tool apps and make prototyping new tool apps much easier.

Closes #14855

* github.com:scylladb/scylladb:
  tools/utils.hh: remove unused headers
  tools/utils: make get_selected_operation() and configure_tool_mode() private
  tools/utils.hh: de-template get_selected_operation()
  tools/scylla-types: migrate to tools_app_template
  tools/scylla-types: prepare for migration to tool_app_template
  tools/scylla-sstable.cc: fix indentation
  tools/scylla-sstables: migrate to tool_app_template
  tools/scylla-sstables: prepare for migration to tool_app_template
  tools: extract tool app skeleton to utils.hh
2023-07-30 18:31:10 +03:00
Pavel Emelyanov
b8d1c7fc0b sstables-format-selector: Add and use system_keyspace dependency
The selector keeps selected format in system.local and uses static
db::system_keyspace::(get|set)_scylla_local_param() helpers to access
it. The helpers are turning into non-static so the selector should call
those on system_keyspace object, not class

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14871
2023-07-30 18:12:16 +03:00
Benny Halevy
d9aee0929c gossiper: mark_as_shutdown: make private
It is used only internally in gossiper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-30 12:00:12 +03:00
Benny Halevy
b324bf38ea gossiper: convict: make private
It is used only internally in gossiper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-30 12:00:12 +03:00
Benny Halevy
a6a66edc84 gossiper: mark_as_shutdown: do not call convict
convict doesn't do anything useful in this case
since we're already in mark_as_shutdown and
convict is called after mark_dead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-30 12:00:12 +03:00
Avi Kivity
1c3d22b717 build: update frozen toolchain to Fedora 38
This refreshes clang to 16.0.6 and libstdc++ to 13.1.1.

compiler-rt, libasan, and libubsan are added to install-dependencies.sh
since they are no longer pulled in as depdendencies.

Closes #13730
2023-07-30 03:08:48 +03:00
Avi Kivity
14dee7a946 Revert "build: build with -O0 if Clang >= 16 is used"
This reverts commit fb05fddd7d. After
1554b5cb61 ("Update seastar submodule"),
which fixed a coroutine bug in Seastar, it is no longer necessary.

Also revert the related "build: drop the warning on -O0 might fail tests"
(894039d444).
2023-07-29 08:07:04 +03:00
Avi Kivity
1554b5cb61 Update seastar submodule
* seastar c0e618bbb...0784da876 (11):
  > Revert "metrics: Remove registered_metric::operator()"
  > build: use new behavior defined by CMP0127
  > build: pass -DBOOST_NO_CXX98_FUNCTION_BASE to C++ compiler
  > coroutine: fix a use-after-free in maybe_yield
Ref #13730.
  > Merge 'sstring: add more accessors' from Kefu Chai
  > Merge 'semaphore: semaphore_units: return units when reassigned' from Benny Halevy
  > metrics: do not define defaulted copy assignment operator
  > HTTP headers in http_response are now case insensitive
  > rpc: Make server._proto a reference
  > Merge 'Cleanup class metrics::registered_metrics' from Pavel Emelyanov
  > core: undefine fallthrough to fix compilation error

Closes #14862
2023-07-28 23:45:30 +03:00
Tomasz Grabiec
4e9d95d78c Merge 'Compact data before streaming' from Botond Dénes
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.

 Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).

Fixes: https://github.com/scylladb/scylladb/issues/3561

Closes #14756

* github.com:scylladb/scylladb:
  replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
  repair/row_level: opt in to compacting the stream
  streaming: opt-in to compacting the stream
  sstables_loader: opt-in for compacting the stream
  replica/table: add optional compacting to make_multishard_streaming_reader()
  replica/table: add optional compacting to make_streaming_reader()
  db/config: add config item for enabling compaction for streaming and repair
  repair: log the error which caused the repair to fail
  readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
  mutation/mutation_compactor: allow user to abandon current partition
2023-07-28 16:42:13 +02:00
Pavel Emelyanov
24fdd4297b schema_tables: Use query_processor argument in save_system_schema()
... instead of global qctx. The now used qctx->execute_cql() just calls
the query_processor::execute_internal with cache_internal::yes

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14874
2023-07-28 15:55:16 +02:00
Pavel Emelyanov
ab6dbe654f schema_tables: Extract query_processor from system_keyspace for querying
The schema_tables() column-mapping code runs queries over system. table,
but it needs LOCAL_ONE CL and cherry-pick on caching, so regular
system_keyspace::execute_cql() won't work here.

However, since schema_tables is somewhat part of system_keyspace, it's
natural to let the former fetch private query_processor& from the latter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 16:02:14 +03:00
Pavel Emelyanov
cf4d4d7e9b schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
The callers all have local sys_ks argument:

- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()

And a test that can get it from cql_test_env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 15:55:13 +03:00
Pavel Emelyanov
c9530eae4e migration_manager: Add system_keyspace argument to get_schema_mapping()
It will need one to pass to db::schema_tables code. The caller is paxos
code with sys_ks local variable at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 15:54:19 +03:00
Kefu Chai
cc2bbde8f1 test: use BOOST_CHECK_EQUAL when appropriate in compaction_manager_basic_test
compaction_manager_basic_test checks the stats of compaction_manager to
verify that there are no ongoing or pending compactions after the triggering
the compaction and waiting for its completion. but in #14865, there are
still active compaction(s) after the compaction_manager's stats shows there
is at least one task completed.

to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead
of `BOOST_REQUIRE()`, so that the test does not error out when the check
fails, and we can have better understanding of the status when the test
fails.

Refs #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14872
2023-07-28 15:45:07 +03:00
Botond Dénes
1eca60fe10 tools/utils.hh: remove unused headers 2023-07-28 08:41:34 -04:00
Botond Dénes
cbcb20f0f9 tools/utils: make get_selected_operation() and configure_tool_mode() private
Their only user is in tools/utils.cc, so move them there, into an
anonymous namespace.
2023-07-28 08:41:34 -04:00
Botond Dénes
fc0c87002c tools/utils.hh: de-template get_selected_operation()
It now has a single user, so it doesn't have to be a template.
For now, make the method inline, so it can stay in the header. It will
be moved to utils.cc in the next patch.
2023-07-28 08:41:16 -04:00
Botond Dénes
8caf258539 tools/scylla-types: migrate to tools_app_template
Discard the locally coded app skeleton and reuse the tool app template
instead. Reduces boilerplate greatly.
2023-07-28 08:30:53 -04:00
Botond Dénes
68a452be00 tools/scylla-types: prepare for migration to tool_app_template
Make options more declarative and create a local reference to
app.configuration() in the main lambda. To faciliate further patching.
2023-07-28 08:30:53 -04:00
Botond Dénes
7598c23359 tools/scylla-sstable.cc: fix indentation
Broken in the previous patch.
2023-07-28 08:30:53 -04:00
Botond Dénes
d082622ab9 tools/scylla-sstables: migrate to tool_app_template
Removing a great amount of boilerplate, streamlinging the main method.
2023-07-28 08:30:53 -04:00
Botond Dénes
092650b20b tools/scylla-sstables: prepare for migration to tool_app_template
Make options more declarative. To facilitate further patching.
2023-07-28 08:30:53 -04:00
Botond Dénes
89d7d80fce tools: extract tool app skeleton to utils.hh
The skeleton of the two existing scylla-native tools (scylla-types and
scylla-sstable) is very similar. By skeleton, I mean all the boilerplate
around creating and configuring a seastar::app_template, representing
operations/command and their options, and presenting and selecting
these.
To facilitate code-sharing and quick development of any new tools,
extract this skeleton from scylla-sstable.cc into tools/utils.hh,
in the form of a new tool_app_template, which wraps a
seastar::app_template and centralizes all the boilerplate logic in a
single place. The extracted code is not a simple copy-paste, although
many elements are simply copied. The original code is not removed yet.
2023-07-28 08:30:53 -04:00
Patryk Jędrzejczak
3468cbd66b service: migration_manager: change the prepare_ methods to functions
The migration_manager service is responsible for schema convergence
in the cluster - pushing schema changes to other nodes and pulling
schema when a version mismatch is observed. However, there is also
a part of migration_manager that doesn't really belong there -
creating mutations for schema updates. These are the functions with
prepare_ prefix. They don't modify any state and don't exchange any
messages. They only need to read the local database.

We take these functions out of migration_manager and make them
separate functions to reduce the dependency of other modules
(especially query_processor and CQL statements) on
migration_manager. Since all of these functions only need access
to storage_proxy (or even only replica::database), doing such a
refactor is not complicated. We just have to add one parameter,
either storage_proxy or database and both of them are easily
accessible in the places where these functions are called.
2023-07-28 13:55:27 +02:00
Botond Dénes
3a51053e66 Merge 'De-static system_keyspace::*_group0_* methods' from Pavel Emelyanov
These are users of global `qctx` variable or call `(get|set)_scylla_local_param(_as)?` which, in turn, also reference the `qctx`. Unfortunately, the latter(s) are still in use by other code and cannot be marked non-static in this PR

Closes #14869

* github.com:scylladb/scylladb:
  system_keyspace: De-static set_raft_group0_id()
  system_keyspace: De-static get_raft_group0_id()
  system_keyspace: De-static get_last_group0_state_id()
  system_keyspace: De-static group0_history_contains()
  raft: Add system_keyspace argument to raft_group0::join_group0()
2023-07-28 14:53:22 +03:00
Kefu Chai
df041c7dc8 build: cmake: add missing source file
TLS certificate authenticator registers itself using a
`class_registrator`. that's why CMake is able to build without
compiling this source file. but for the sake of completeness, and
to be sync with configure.py, let's add it to CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14866
2023-07-28 14:30:58 +03:00
Pavel Emelyanov
d311784721 system_keyspace: De-static set_raft_group0_id()
The caller is group0 code with sys_ks local variable

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:13:59 +03:00
Pavel Emelyanov
7837bc7d5a system_keyspace: De-static get_raft_group0_id()
The callers are in group0 code that have sys_ks local variable/argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:13:11 +03:00
Pavel Emelyanov
26dd7985a8 system_keyspace: De-static get_last_group0_state_id()
The caller is raft_group0_client with sys.ks. dependency reference and
group0_state_machine with raft_group0_client exporing its sys.ks.

This makes it possible to instantly drop one more qctx reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:12:04 +03:00
Pavel Emelyanov
3de0efd32c system_keyspace: De-static group0_history_contains()
The caller is raft_group0_client with sys.ks. dependency reference.
This allows to drop one qctx reference right at once

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:11:08 +03:00
Pavel Emelyanov
0dbe83ce89 raft: Add system_keyspace argument to raft_group0::join_group0()
The method will need one to access db::system_keyspace methods. The
sys.ks. is at hand and in use in both callers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:10:24 +03:00
Patryk Jędrzejczak
3f29c98394 storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
We add the CDC generation optimality check in
storage_service::raft_check_and_repair_cdc_streams so that it
doesn't create new generations when unnecessary.

Additionally, we modify the test_topology_ops.py test in a way
that verifies the new changes. We call
storage_service::raft_check_and_repair_cdc_streams multiple
times concurrently and verify that exactly one generation has been
created.
2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak
b11f42951b storage_service: delay deletion of the new_cdc_generation request
We delay the deletion of the new_cdc_generation request to the
moment when the topology transition reaches the
publish_cdc_generation state. We need this change to ensure
adding the CDC generation optimality check in the next commit
has an intended effect. If we didn't make it, it would be possible
that a task makes the new_cdc_generation request, and then, after
this request was removed but before committing the new generation,
another task also makes the new_cdc_generation request. In such
a scenario, two generations are created, but only one should.
After the change introduced by this commit, the second request
would have no effect.
2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak
c416c9ff33 raft topology: broadcast on topology_state_machine.event instead of signal
After adding the CDC generation optimality check in
storage_service::raft_check_and_repair_cdc_streams in the
following commits, multiple tasks will be waiting for a single
generation change. Calling signal on topology_state_machine.event
won't wake them all. Moreover, we must ensure the topology
coordinator wakes when his logic expects it. Therefore, we change
all signal calls on topology_state_machine.event to broadcast.
2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak
b05b4a352a cdc: implement the is_cdc_generation_optimal function
In the following commits, we add the CDC generation optimality
check to storage_service::raft_check_and_repair_cdc_streams so
that it doesn't create new CDC generations when unnecessary. Since
generation_service::check_and_repair_cdc_streams already has
this check, we extract it to the new is_cdc_generation_optimal
function to not duplicate the code.
2023-07-28 11:04:17 +02:00
Aleksandra Martyniuk
bfa3a7325a test: extend test_compaction_task.py to cover compaction group tasks 2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
139e147ae1 compaction: turn custom_task_executor into compaction_task_impl
custom_task_executor inherits both from compaction_task_executor
and compaction_task_impl.
2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
1853a5a355 compaction: turn sstables_task_executor into sstables_compaction_task_impl
sstables_task_executor inherits both from compaction_task_executor
and sstables_compaction_task_impl.

Delete unused perform_task_on_all_files version.
2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
1decf86d71 compaction: change sstables compaction tasks type 2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
59b838688b compaction: move table_upgrade_sstables_compaction_task_impl
Move table_upgrade_sstables_compaction_task_impl so that the related
classes where placed next to each other.
2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
71db8645d5 compaction: pass task_info through sstables compaction 2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
4e439ac957 compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
offstrategy_compaction_task_executor inherits both from compaction_task_executor
and offstrategy_compaction_task_impl.
2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk
92f2987217 compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
cleanup_compaction_task_executor inherits both from compaction_task_executor
and cleanup_compaction_task_impl.

Add a new version of compaction_manager::perform_task_on_all_files
which accepts only the tasks that are derived from compaction_task_impl.
After all task executors' conversions are done, the new version replaces
the original one.
2023-07-28 10:48:58 +02:00
Avi Kivity
d73a393670 main: increase Seastar reactor task quota in debug mode
Debug mode is so slow that the work:poll ratio decreases, leading
to even more slowness as more polling is done for the same amount
of work.

Increase the task quota to recover some performance.

Ref #14752.

Closes #14820
2023-07-28 10:34:18 +03:00
Aleksandra Martyniuk
8317e4dd7f comapction: use optional task info in major compaction
To make it consistent with the upcoming methods, methods triggering
major compaction get std::optional<tasks::task_info> as an argument.

Thanks to that we can distinguish between a task that has no parent
and the task which won't be registered in task manager.
2023-07-28 09:25:21 +02:00
Aleksandra Martyniuk
ef8512f65a compaction: use perform_compaction in compaction_manager::perform_major_compaction 2023-07-28 09:25:21 +02:00
Avi Kivity
cf81eef370 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables
2023-07-28 00:01:33 +03:00
Anna Stuchlik
8ee6f6ecb6 doc: add the requirement to upgrade drivers
This commit adds a requirement to upgrade ScyllaDB
drivers before upgrading ScyllaDB.

The requirement to upgrade the Monitoring Stack
has been moved to the new section so that
both prerequisites are documented together.

NOTE: The information is added to the 5.2-to-5.3
upgrade guide because all future upgrade guides
will be based on this one (as it's the latest one).

If 5.3 is released, this commit should be backported
to branch-5.3.

Refs https://github.com/scylladb/scylladb/issues/13958

Closes #14771
2023-07-27 15:21:38 +02:00
Patryk Jędrzejczak
b81a6037f1 test: pylib: ensure ScyllaCluster.add_server does not start a second cluster
If the cluster isn't empty and all servers are stopped, calling
ScyllaCluster.add_server can start a new cluster. That's because
ScyllaCluster._seeds uses the running servers to calculate the
seed node list, so if all nodes are down, the new node would
select only itself as a seed, starting a new cluster.

As a single ScyllaCluster should describe a single cluster, we
make ScyllaCluster.add_server fail when called on a non-empty
cluster with all its nodes stopped.

Closes #14804
2023-07-27 13:27:23 +02:00
Botond Dénes
7351c8424d mutation/mutation_rebuilder: add comment about validity of returned mutation reference
Closes #14853
2023-07-27 12:13:46 +03:00
Alexey Novikov
ff721ec3e3 make timestamp string format cassandra compatible
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification

Fixes #14518
Fixes #7997

Closes #14726
2023-07-27 12:01:09 +03:00
Botond Dénes
b599f15b26 replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
2023-07-27 04:57:52 -04:00
Botond Dénes
fdaf908967 repair/row_level: opt in to compacting the stream
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
2023-07-27 04:57:50 -04:00
Botond Dénes
5452fd1ce4 streaming: opt-in to compacting the stream
Use locally generated compaction time on each node. This could lead to
different nodes making different decisions on what is expired or not.
But this is already the case for streaming, as what exactly is expired
depends on when compaction last run.
2023-07-27 03:22:11 -04:00
Botond Dénes
5a73c3374e sstables_loader: opt-in for compacting the stream
No point in loading expired/covered data.
2023-07-27 03:22:11 -04:00
Botond Dénes
2f8d77e97b replica/table: add optional compacting to make_multishard_streaming_reader()
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.

Call sites are updated, but none opt in just yet.
2023-07-27 03:22:11 -04:00
Botond Dénes
42b0dd5558 replica/table: add optional compacting to make_streaming_reader()
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!

All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
2023-07-27 03:22:11 -04:00
Botond Dénes
9e3987fc96 db/config: add config item for enabling compaction for streaming and repair
Compacting can greatly reduce the amount of data to be processed by
streaming and repair, but with certain data shapes, its effectiveness
can be reduced and its CPU overhead might outweight the benefits. This
should very rarely be the case, but leave an off switch in case
this becomes a problem in a deployment.
Not wired yet.
2023-07-27 03:22:11 -04:00
Botond Dénes
a22446afe0 repair: log the error which caused the repair to fail
Instead of just a boolean _failed flag, persist the error message of the
exception which caused the repair to fail, and include it in the log
message announcing the failure.
2023-07-27 03:22:11 -04:00
Botond Dénes
ac44efea11 readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
When next_partition() or fast_forward_to() is called. Instead of trying
to simulate a properly closed partition by injecting synthetic mutation
fragments to properly close it.
2023-07-27 02:50:44 -04:00
Botond Dénes
326c3b92e5 mutation/mutation_compactor: allow user to abandon current partition
Currently, the compactor requires a valid stream and thus abandoning a
partition in the middle was not possible. This causes some complications
for the compacting reader, which implements methods such as
`next_partition()` which is possibly called in the middle of a
partition. In this case the compacting reader attempts to close the
partition properly by inserting a synthetic partition-end fragment into
the stream. This is not enough however as it doesn't close any range
tombstone changes that might be active. Instead of piling on more
complexity, add an API to the compactor which allows abandoning the
current partition.
2023-07-27 02:50:44 -04:00
Kefu Chai
1b7bde2e9e compaction_manager: use range in compacting_sstable_registration
simpler than the "begin, end" iterator pair. and also tighten the
type constraints, now require the value type to be
sstables::shared_sstable. this matches what we are expecting in
the implementation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14678
2023-07-27 09:40:20 +03:00
Pavel Emelyanov
e9218e6873 system_keyspace: Don't update schema version in .setup()
The db.get_version() called that early returns value that database got
construction-time, i.e. -- empty_version thing. It makes little sense
committing it into the system k.s. all the more so the "real" version is
calculated and updated few steps after .setup().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14833
2023-07-27 09:38:57 +03:00
Pavel Emelyanov
c017117340 system_keyspace: Remove qctx usage from load_topology_state()
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14824
2023-07-27 08:56:40 +03:00
Raphael S. Carvalho
050ce9ef1d cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818
2023-07-27 00:01:46 +02:00
Anna Stuchlik
3ed6754afc doc: update info about cassandra superuser
Fixes https://github.com/scylladb/scylla-docs/issues/4028

The goal of this update is to discourage the use of
the default cassandra superuser in favor of a custom
super user - and explain why it's a good practice.

The scope of this commit:

- Adding a new page on creating a custom superuser.
  The page collects and clarifies the information
  about the cassandra superuser from other pages.
- Remove the (incomplete) information about
  superuser from the Authorization and Authentication
  pages, and add the link to the new page instead.

Additionaly, this update will result in better
searchability and ensures language clarity.

Closes #14829
2023-07-26 23:15:31 +03:00
Avi Kivity
615544a09a Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov
This is to make m.s. initialization more solid and simplify sys.ks.::setup()

Closes #14832

* github.com:scylladb/scylladb:
  system_keyspace: Remove unused snitch arg from setup()
  messaging_service: Setup preferred IPs from config
2023-07-26 22:12:28 +03:00
Nadav Har'El
59c1498338 test/alternator: don't forget to delete tables on test failures
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.

We missed a few such cases in test_gsi.py, and this patch corrects them.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.

There are no functional changes in this patch - most of the lines
changed are just reindents.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14835
2023-07-26 21:51:22 +03:00
Benny Halevy
1e7e2eeaee gossiper: mark_alive: use deferred_action to unmark pending
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.

Fixes #14839

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14840
2023-07-26 21:24:56 +03:00
Nadav Har'El
056d04954c Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()
2023-07-26 20:04:28 +03:00
Pavel Emelyanov
6b82071064 system_keyspace: Remove unused snitch arg from setup()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-26 16:05:26 +03:00
Pavel Emelyanov
0fba57a3e8 messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-26 16:03:23 +03:00
Nadav Har'El
d2ca600eec test/*/run: kill Scylla with SIGTERM
Today, test/*/run always kills Scylla at the end of the test with
SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was
believed that a clean shutdown would take a long time, but in fact,
it turns out that 99% of the shutdown time was a silly sleep in the
gossip code, which this patch disables with the "--shutdown-announce-in-ms"
option.

After enabling this option, clean shutdown takes (in a dev build on
my laptop) just 0.02 seconds. It's worth noting that this shutdown
has no real work to do - no tables to flush, and so on, because the
pytest framework removes all the tables in its own fixture cleanup
phase.

So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL.
We then wait until a timeout of 10 seconds (much much more than 0.02
seconds!) for Scylla to exit. If for some reason it didn't exit (e.g.,
it hung during the shutdown), it is killed again with SIGKILL, which
is guaranteed to succed.

This change gives us two advantages

1. Every test run with test/*/run exercises the shutdown path. It is perhaps
   excessive, but since the shutdown is so quick, there is no big downside.

2. In a test-coverage run, a clean shutdown allows flushing the counter
   files, which wasn't possible when Scylla was killed with KILL -9.

Fixes #8543

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14825
2023-07-26 14:06:24 +03:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
d0f725c1b9 test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
A test reproducing #14819, that is, the view update builder not flushing
the buffer when only empty partitions are consumed (with only a
tombstone in them).
2023-07-26 03:09:53 -04:00
Botond Dénes
d66b07823b db/view/view_updating_consumer: account for the size of mutations
All partitions will have a corresponding mutation object in the buffer.
These objects have non-negligible sizes, yet the consumer did not bump
the _buffer_size when a new partition was consumer. This resulted in
empty partitions not moving the _buffer_size at all, and thus they could
accumulate without bounds in the buffer, never triggering a flush just
by themselves. We have recently seen this causing OOM.
This patch fixes that by bumping the _buffer_size with the size of the
freshly created mutation object.
2023-07-26 03:07:25 -04:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Benny Halevy
90b2e6515c gossiper: mark_alive: enter background_msg gate
The function dispatch a background operation that must be
waited on in stop().

Fixes scylladb/scylladb#14791

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14797
2023-07-26 00:51:22 +02:00
Tomasz Grabiec
ae8ffe23fc tests: test_tablets: Add test for bootstraping a node 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f0b9dcee04 storage_service: topology_coordinator: Implement tablet migration state machine
See the documentation in topology-over-raft.md for description of the mechanism.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
5c681a1d63 tablets: Introduce tablet_mutation_builder 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
6f4a35f9ae service: tablet_allocator: Introduce tablet load balancer
Will be invoked by the topology coordinator later to decide
which tablets to migrate.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
d59b8d316c tablets: Introduce tablet_map::for_each_tablet() 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
0e3eac29d0 topology: Introduce get_node() 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f2fdf37415 token_metadata: Add non-const getter of tablet_metadata
Needed for tests.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
1885f94474 storage_service: Notify topology state machine after applying schema change
Table construction may allocate tablets which may need rebalancing.
Notify topology change coordinator to invoke the load balancer.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
6d545b2f9e storage_service: Implement stream_tablet RPC
Performs streaming of data for a single tablet between two tablet
replicas. The node which gets the RPC is the receiving replica.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
e3a8bb7ec9 tablets: Introduce global_tablet_id
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.

Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f88220aeee stream_transfer_task, multishard_writer: Work with table sharder
So that we can use it on tablet-based tables.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
8cf92d4c86 tablets: Turn tablet_id into a struct
The IDL compiler cannot deal with enum classes like this.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
c2b18ae483 db: Do not create per-keyspace erm for tablet-based tables
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.

It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
91dee5c872 tablets: effective_replication_map: Take transition stage into account when computing replicas 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00
Tomasz Grabiec
05519bd5e5 doc: Document tablet migration state machine and load balancer 2023-07-25 21:08:02 +02:00
Tomasz Grabiec
7851694eaa locator: erm: Make get_endpoints_for_reading() always return read replicas
Just a simplification.

Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
b642e69eb3 storage_service: topology_coordinator: Sleep on failure between retries
Avoid failing in a tight loop. Can happen if some node is down, for example.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
f0e9dbf911 storage_service: topology_coordinator: Simplify coordinator loop
This refactoring removes a boolean and branching which makes it easier
to reason about the flow, and easier to extend it with more steps.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
b294932cf1 main: Require experimental raft to enable tablets
Tablets depend on the topology changes on raft feature.

Drop "tablets" from suite.yaml of the topology/ suite, which doesn't
use tablets anymore.
2023-07-25 21:08:01 +02:00
Aleksandra Martyniuk
6e6ba7309e replica: make tables_metadata's attributes private
Make _column_families and _ks_cf_to_uuid private to prevent unsafe
access. The maps can be accessed only through method which use locks
if preemption is possible.
2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
c5cad803b3 replica: add methods to get a filtered copy of tables map 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
ff26b2ba3f replica: add methods to check if given table exists 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
6796721c3d replica: add methods to get table or table id 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
e072a2341d replica: api: return table_id instead of const table_id&
Return table_id instead of const table_id& from database::find_uuid
as copying table_id does not cause much overhead and simplifies
methods signature.
2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00
Botond Dénes
fda4168300 mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
To allow const access to the mutation under construction, e.g. so the
user can query its size.
2023-07-25 10:34:31 -04:00
Botond Dénes
e6fa21d1b3 mutation/mutation: add memory_usage() 2023-07-25 10:34:30 -04:00
Aleksandra Martyniuk
a21d3357c3 replica: pass tables_metadata to phased_barrier_top_10_counts 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
8842bd87c3 replica: add methods to safely add and remove table 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
52afd9d42d replica: wrap column families related maps into tables_metadata
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
395ce87eff replica: futurize database::add_column_family and database::remove
As a preparation for further changes, database::add_column_family
and database::remove return future<>.
2023-07-25 16:13:00 +02:00
Pavel Emelyanov
c46c57d535 messaging_service: Clear list of clients on shutdown
When messaging_service shuts down it first sets _shutting_down to true
and proceeds with stopping clients and servers. Stopping clients, in
turn, is calling client.stop() on each.

Setting _shutting_down is used in two places.

First, when a client is stopped it may happen that it's in the middle of
some operation, which may result in call to remove_error_rpc_client()
and not to call .stop() for the second time it just does nothing if the
shutdown flag is set (see 357c91a076).

Second, get_rpc_client() asserts that this flag is not set, so once
shutdown started it can make sure that it will call .stop() on _all_
clients and no new ones would appear in parallel.

However, after shutdown() is complete the _clients vector of maps
remains intact even though all clients from it are stopped. This is not
very debugging-friendly, the clients are better be removed on shutdown.

fixes: #14624

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14632
2023-07-25 13:08:20 +03:00
Botond Dénes
ed025890e5 scripts/coverage.py: --run: swallow KeyboardInterrupt
It is quite common to stop a tested scylla process with ^C, which will
raise KeyboardInterrupt from subprocess.run(). Catch and swallow this
exception, allowing the post-processing to continue.
The interrupted process has to handle the interrupt correctly too --
flush the coverage data even on premature exit -- but this is for
another patch.

Closes #14815
2023-07-25 12:29:22 +03:00
Kefu Chai
2943d3c1b0 tools/scylla-sstable: s/foo.find(bar) != foo.end()/foo.count(bar) != 0/
just for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14816
2023-07-25 11:38:44 +03:00
Petr Gusev
116444a01b counter_mutation: add fencing
As for regular mutations, we do the check
twice in handle_counter_mutation, before
and after applying the mutations. The last
is important in case fence was moved while
we were handling the request - some post-fence
actions might have already happened at this
time, so we can't treat the request as successful.
For example, if topology change coordinator was
switching to write_both_read_new, streaming
might have already started and missed this update.

In mutate_counters we can use a single fencing_token
for all leaders, since all the erms are processed
without yields and should underneath share the
same token_metadata.

We don't pass fencing token for replication explicitly in
replicate_counter_from_leader since
mutate_counter_on_leader_and_replicate doesn't capture erm
and if the drain on the coordinator timed out the erm for
replication might be different and we should use the
corresponding (maybe the new one) topology version for
outgoing write replication requests. This delayed
replication is similar to any other background activity
(e.g. writing hints) - it takes the current erm and
the current token_metadata version for outgoing requests.
2023-07-25 12:10:03 +04:00
Petr Gusev
edbb5cbb5f encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
We will need it in later commit to return
exceptions from handle_counter_mutation.

We also add utils::Tuple concept restriction
for add_replica_exception_to_query_result
since its type parameters are always tuples.
2023-07-25 12:09:21 +04:00
Petr Gusev
f2cbdc7f18 counter_mutation: add replica::exception_variant to signature
We are going to add fencing for counter mutations,
this means handle_counter_mutation will sometimes throw
stale_topology_exception. RPC doesn't marshall exceptions
transparently, exceptions thrown by server are delivered
to the client as a general remote_verb_error, which is not
very helpful.

The common practice is to embed exceptions into handler
result type. In this commit we use already existing
exception_variant as an exception container. We mark
exception_variant with [[version]] attribute in the idl
file, this should handle the case when the old replica
(without exception_variant in the signature) is replying
to the new one.
2023-07-25 12:09:19 +04:00
Raphael S. Carvalho
0ac43ea877 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813
2023-07-25 10:38:20 +03:00
Botond Dénes
3eec990e4e Merge 'test: use different table names in simple_backlog_controller_test ' from Kefu Chai
in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added.

to address this problem, in this change

1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names
2. use the new ctor in simple_backlog_controller_test

Fixes #14767

Closes #14783

* github.com:scylladb/scylladb:
  test: use different table names in simple_backlog_controller_test
  test/lib/simple_schema: add ctor for customizing ks.cf
  test/lib/simple_schema: do not hardwire ks.cf
2023-07-25 10:26:33 +03:00
Anna Stuchlik
f6732865b9 doc: doc: move unified installer from web to docs
This commit adds the information on how to install ScyllaDB
without root privileges (with "unified installer", but we've
decided to drop that name - see the page title).

The content taken from the website
https://www.scylladb.com/download/?platform=tar&version=scylla-5.2#open-source
is divided into two sections: "Download and Install" and
"Configure and Run ScyllaDB".
In addition, the "Next Steps" section is also copied from
the website, and adjusted to be in sync with other installation
pages in the docs.

Refs https://github.com/scylladb/scylla-docs/issues/4091

Closes #14781
2023-07-25 10:23:02 +03:00
Benny Halevy
a07440173f storage_service: node_ops_ctl: send_to_all: fix "Node is down for" log message args order
The node and op_desc args are reversed.

Fixes #14807

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14808
2023-07-24 21:13:06 +03:00
Petr Gusev
5fb8da4181 hints: add fencing
In this commit we just pass a fencing_token
through hint_mutation RPC verb.

The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.

Closes #14580
2023-07-24 18:12:48 +02:00
Tomasz Grabiec
5b30931406 Merge 'raft topology: restore gossiper eps' from Gusev Petr
We don't load gossiper endpoint states in `storage_service::join_cluster` if `_raft_topology_change_enabled`, but gossiper is still needed even in case of `_raft_topology_change_enabled` mode, since it still contains part of the cluster state. To work correctly, the gossiper needs to know the current endpoints. We cannot rely on seeds alone, since it is not guaranteed that seeds will be up to date and reachable at the time of restart.

The problem was demonstrated by the test `test_joining_old_node_fails`, it fails occasionally with `experimental_features: [consistent-topology-changes]` on the line where it waits for `TEST_ONLY_FEATURE` to become enabled on all nodes. This doesn't happen since `SUPPORTED_FEATURES` gossiper state is not disseminated, and feature_service still relies on gossiper to disseminate information around the cluster.

The series also contains a fix for a problem in `gossiper::do_send_ack2_msg`, see commit message for details.

Fixes #14675

Closes #14775

* github.com:scylladb/scylladb:
  storage_service: restore gossiper endpoints on topology_state_load fix
  gossiper: do_send_ack2_msg fix
2023-07-24 13:55:50 +02:00
Botond Dénes
a8feb7428d Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one
2023-07-24 14:13:09 +03:00
Petr Gusev
75694aa080 storage_service: restore gossiper endpoints on topology_state_load fix
We don't load gossiper endpoint states in
storage_service::join_cluster if
_raft_topology_change_enabled, but gossiper
is still needed even in case of
_raft_topology_change_enabled mode, since it
still contains part of the cluster state.
To work correctly, the gossiper needs to know
the current endpoints. We cannot rely on seeds alone,
since it is not guaranteed that seeds will be
up to date and reachable at the time of restart.

The specific scenario of the problem: cluster with
three nodes, the second has the first in seeds,
the third has the first and second. We restart all
the nodes simultaneously, the third node uses its
seeds as _endpoints_to_talk_with in the first gossiper round
and sends SYN to the first and sedond. The first node
hasn't started its gossiper yet, so handle_syn_msg
returns immediately after if (!this->is_enabled());
The third node receives ack from the second node and
no communication from the first node, so it fills
its _live_endpoints collection with the second node
and will never communicate with the first node again.

The problem was demonstrated by the test
test_joining_old_node_fails, it fails occasionally with
experimental_features: [consistent-topology-changes]
on the line where it waits for TEST_ONLY_FEATURE
to become enabled on all nodes. This doesn't happen
since SUPPORTED_FEATURES gossiper state is not
disseminated because of the problem described above.

The first commit is needed since add_saved_endpoint
adds the endpoint with some default app states with locally
incrementing versions and without that fix gossiper
refuses to fill the real app states for this endpoint later.

Fixes: #14675
2023-07-24 12:36:39 +04:00
Kamil Braun
e6099c4685 Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak
Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable
2023-07-24 10:23:34 +02:00
Petr Gusev
87cd7e8741 gossiper: do_send_ack2_msg fix
This commit is a first part of the fix for #14675.
The issue is about the test test_joining_old_node_fails
faling occasionally with
experimental_features: [consistent-topology-changes].
The next commit contains a fix for it, here we
solve the pre-existing gossiper problem
which we stumble upon after the fix.

Local generation for addr may have been
increased since the current node sent
an initial SYN. Comparing versions across different
generations in get_state_for_version_bigger_than
could result in loosing some app states with
smaller versions.

More specifically, consider a cluster with nodes
.1, .2, .3, .3 has .1 and .2 as seeds, .2 has .1
as a seed. Suppose .2 receives a SYN from .3 before
its gossiper starts, and it has a
version 0.24 for .1 in endpoint_states.

The digest from .3 contains 0.25 as a version for .1,
so examine_gossiper produces .1->0.24 as a digest
and this digest is send to .3 as part of the ack.
Before processing this ack, .3 processed an ack from
.1 (scylla sends SYN to many nodes) and updates
its endpoint_states according to it, so now it
has .1->100500.32 for .1. Then
we get to do_send_ack2_msg and call
get_state_for_version_bigger_than(.1, 24).
This returns properties which has version > 24,
ignoring a lot of them with smaller versions
which has been received from .1. Also,
get_state_for_version_bigger_than updates
generation (it copies get_heart_beat_state from
.3), so when we apply the ack in handle_ack2_msg
at .2 we update the generation and now the
skipped app states will only be updated on .2
if somebody change them and increment their version.

Cassandra behaviour is the same in this case
(see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/GossipDigestAckVerbHandler.java#L86). This is probably less
of a problem for them since most of the time  they
send only one SYN in one gossiper round
(save for unreachable nodes), so there is less
room for conflicts.
2023-07-24 11:52:56 +04:00
Kefu Chai
3ad844a4bb build: cmake: set scylla version strings as CACHED strings
before this change, add_version_library() is a single function
which accomplishes two tasks:

1. build scylla-version target using
2. add an object library

but this has two problems:

1. we should run `SCYLLA-VERSION-GEN` at configure time, instead
   of at build time. otherwise the targets which read from the
   SCYLLA-{VERSION, RELEASE, PRODUCT}-FILE cannot access them,
   unless they are able to read them in their build rules. but
   they always use `file(STRINGS ..)` to read them, and thsee
   `file()` command is executed at configure time. so, this
   is a dead end.
2. we repeat the `file(STRING ..)` multiple places. this is
   not ideal if we want to minimize the repeatings.

so, to address this problem, in this change:

1. use `execute_process()` instead of `add_custom_command()`
   for generating these *-FILE files. so they are always ready
   at build time. this partially reverts bb7d99ad37.
2. extract `generate_scylla_version()` out of `add_version_library()`.
   so we can call the former much earlier than the latter.
   this would allow us to reference the variables defined by
   the `generate_scylla_version()` much earlier.
3. define cached strings in the extracted function, so that
   they can consumed by other places.
4. reference the cached variables in `build_submodule.cmake`.

also, take this opportunity to fix the version string
used in build_submodule.cmake: we should have used
`scylla_version_tilde`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14769
2023-07-24 08:57:19 +03:00
Michał Jadwiszczak
246728cbbb querier_cache: add stats of scheduling group mismatches
Add stats to count dropped queriers because of scheduling group
mismatch.
2023-07-21 19:05:55 +02:00
Michał Jadwiszczak
a5fc53aa11 querier_cache: check semaphore mismatch during querier lookup
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.

This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.

What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
2023-07-21 19:05:50 +02:00
Michał Jadwiszczak
e5c965b280 querier_cache: add reference to replica::database::is_user_semaphore() 2023-07-21 18:58:57 +02:00
Jan Ciolek
decbc841b7 cql3/prepare_expr: fix partially preparing function arguments
Before choosing a function, we prepare the arguments that can be
prepared without a receiver. Preparing an argument makes
its type known, which allows to choose the best overload
among many possible functions.

The function that prepared the argument passes the unprepared
argument by mistake. Let's fix it so that it actually uses
the prepared argument.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #14786
2023-07-21 18:59:56 +03:00
Jan Ciolek
cbc97b41d4 cql.g: make the parser reject INSERT JSON without a JSON value
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```

When no JSON value is specified, the query should be rejected.

Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.

A unit test is added to prevent regressions.

Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #14785
2023-07-21 18:52:47 +03:00
Kefu Chai
d78c6d5f50 test: use different table names in simple_backlog_controller_test
in `simple_backlog_controller_test`, we need to have multiple tables
at the same time. but the default constructor of `simple_schema` always
creates schema with the table name of "ks.cf". we are going to have
a per-table metrics. and the new metric group will use the table name
as its counter labels, so we need to either disable this per-table
metrics or use a different table name for each table.

as in real world, we don't have multiple tables at the same time. it
would be better to stop reusing the same table name in a single test
session. so, in this change, we use a random cf_name for each of
the created table.

Fixes #14767
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:08:29 +08:00
Kefu Chai
1f596e4669 test/lib/simple_schema: add ctor for customizing ks.cf
some low level tests, like the ones exercising sstables, creates
multiple tables. and we are going to add per-table metrics and
the new metrics uses the ks.cf as part of its unique id. so,
once the per-table metrics is enabled, the sstable tests would fail.
as the metrics subsystem does not allow registering multiple
metric groups with the same name.

so, in this change, we add a new constructor for `simple_schema`,
so that we can customize the the schema's ks and cf when creating
the `simple_schema`. in the next commit, we will use this new
constructor in a sstable test which creates multiple tables.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:07:45 +08:00
Kefu Chai
306439d3aa test/lib/simple_schema: do not hardwire ks.cf
instead, query the name of ks and cf from the scheme. this change
prepare us for the a simple_schema whose ks and cf can be customized
by its contructor.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:07:45 +08:00
Mikołaj Grzebieluch
37ceef23a6 test: raft: skip test_old_ip_notification_repro in debug mode
Closes #14777
2023-07-21 12:41:03 +02:00
Pavel Emelyanov
db1c6e2255 system_keyspace: Make save_truncation_record() non-static
... and stop using qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:12:50 +03:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Pavel Emelyanov
1ef34a5ada db: Add sharded<system_keyspace>& to legacy_schema_migrator
One of the class' methods calls db::drop_table_on_all_shards() that will
need sys.ks. in the next patch.

The reference in question is provided from the only caller -- main.cc

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 12:38:46 +03:00
Kefu Chai
a87b0d68cd s3/test: remove the tempdir if test succeeds
in 46616712, we tried to keep the tmpdir only if the test failed,
and keep up to 1 of them using the recently introduced
option of `tmp_path_retention_count`. but it turns out this option
is not supported by the pytest used by our jenkins nodes, where we
have pytest 6.2.5. this is the one shipped along with fedora 36.

so, in this change, the tempdir is removed if the test completes
without failures. as the tempdir contains huge number of files,
and jenkins is quite slow scanning them. after nuking the tempdir,
jenkins will be much faster when scanning for the artifacts.

Fixes #14690
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14772
2023-07-21 12:21:51 +03:00
Nadav Har'El
5860820934 Merge 'mutation/mutation_compactor: validate the input stream' from Botond Dénes
The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to
the validator for the most part.

This mistake was introduced by https://github.com/scylladb/scylladb/commit e2c9cdb576, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor.

This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.

The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios.

Closes #13817

* github.com:scylladb/scylladb:
  test/mutation_test: test_compactor_validator_sanity_test
  mutation/mutation_compactor: fix indentation
  mutation/mutation_compactor: validate the input stream
  mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator
  readers: reader-from-fragment: don't modify stream when created without range
2023-07-21 00:26:46 +03:00
Avi Kivity
e00811caac cql3: grammar: reject intValue with no contents
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).

Easily fixed by removing the empty alternative. A unit test is
added.

Fixes #14705.

Closes #14707
2023-07-21 00:24:51 +03:00
Pavel Emelyanov
98609e2115 Merge 's3/test: close using deferred_close() or deferred()' from Kefu Chai
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.

Closes #14765

* github.com:scylladb/scylladb:
  s3/test: use seastar::deferred() to perform cleanup
  s3/test: close using deferred_close()
2023-07-20 20:05:34 +03:00
Botond Dénes
bf6186ed7e Update tools/java submodule
* tools/java 9f63a96f...585b30fd (1):
  > cassandra-stress: add support for using RackAwareRoundRobinPolicy
2023-07-20 18:13:32 +03:00
Botond Dénes
819b45d107 Merge 'Remove dead replacing_nodes_pending_ranges_updater manipulations' from Pavel Emelyanov
The set in question is read-and-delete-only and thus always empty. Originally it was removed by commit c9993f020d (storage_service: get rid of handle_state_replacing), but some dangling ends were left. Consequentially, the on_alive() callback can get rid of few dead if-else branches

Closes #14762

* github.com:scylladb/scylladb:
  storage_service: Relax on_alive()
  storage_service: Remove _replacing_nodes_pending_ranges_updater
2023-07-20 16:55:44 +03:00
Pavel Emelyanov
9df750fd4c storage_service: Remove dead get_rpc_address()
Unused. Locator calls gossiper directly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14761
2023-07-20 16:54:24 +03:00
Botond Dénes
53da97416a Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.

Closes #14758

* github.com:scylladb/scylladb:
  db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
  db/system_keyspace: Make paxos methods non-static
  service/paxos: Add db::system_keyspace& argument to some methods
  test: Optionally initialize proxy remote for cql_test_env
  proxy/remote: Keep sharded<db::system_keyspace>& dependency
2023-07-20 16:53:25 +03:00
Botond Dénes
e62325babc Merge 'Compaction reshard task' from Aleksandra Martyniuk
Task manager tasks covering reshard compaction.

Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4.
Regression test added.

Closes #14739

* github.com:scylladb/scylladb:
  test: add test for resharding with non-empty owned_ranges_ptr
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  compaction: add reshard_sstables_compaction_task_impl
  compaction: create resharding_compaction_task_impl
2023-07-20 16:43:22 +03:00
Botond Dénes
a35f4f6985 test/mutation_test: test_compactor_validator_sanity_test
Greatly expand this test to check that the compactor validates the input
stream properly.
The test is renamed (the _sanity_test suffix is removed) to reflect the
expanded scope.
2023-07-20 08:48:50 -04:00
Botond Dénes
18ed94e60b mutation/mutation_compactor: fix indentation
Left broken by the previous patch.
2023-07-20 08:48:50 -04:00
Botond Dénes
3d5b70e0d7 mutation/mutation_compactor: validate the input stream
The mutation compactor has a validator which it uses to validate the
stream of mutation fragments that passes through it. This validator is
supposed to validate the stream as it enters the compactor, as opposed
to its compacted form (output). This was true for most fragment kinds
except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by e2c9cdb576, which itself was a flawed
attempt at fixing an error seen because purged tombstones were not
terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly:
on page-cut, if the validator has an active tombstone, a closing
tombstone is generated for it, to avoid the false-positive error. With
this, range tombstones can be validated again as they come in.
2023-07-20 08:48:50 -04:00
Botond Dénes
dbb2a6f03a mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator 2023-07-20 08:48:50 -04:00
Botond Dénes
93dd16fccc readers: reader-from-fragment: don't modify stream when created without range
The fragment reader currently unconditionally forwards its buffer to the
passed-in partition range. Even if this range is
`query::full_partition_range`, this will involve dropping any fragments
up to the first partitions tart. This causes problems for test users who
intentionally create invalid fragment streams, that don't start with a
partition-start.
Refactor the reader to not do any modifications on the stream, when
neither slice, nor partition-range was passed by the user.
2023-07-20 08:48:50 -04:00
Kefu Chai
fdf61d2f7c compaction_manager: prevent gc-only sstables from being compacted
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.

in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.

so, in this change,

* to prevent sstables in the transient stage from being picked
  up by regular compactions, a new interface class is introduced
  so that the sstable is always added to registration before
  it is added to sstable set, and removed from registration after
  it is removed from sstable set. the struct helps to consolidate
  the regitration related logic in a single place, and helps to
  make it more obvious that the timespan of an sstable in
  the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
  overlap with the output sstable run. the run_id for the
  gc sstable run is created only when the gc sstable writer
  is created. because the gc sstables is not always created
  for all compactions.

please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.

Fixes #14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14725
2023-07-20 15:47:48 +03:00
Asias He
865891cf02 doc: Repair system_auth with nodetool repair -pr option
Since repair is performed on all nodes, each node can just repair the
primary ranges instead of all owned ranges. This avoids repair ranges
more than once.

Closes #14766
2023-07-20 15:12:20 +03:00
Anna Stuchlik
6c70aef2d1 doc: document customizing CPUSET
Fixes https://github.com/scylladb/scylla-docs/issues/4004

This commit adds a Knowledge Base article on how to customize
CUPUSET.

Closes #13941
2023-07-20 15:07:32 +03:00
Michał Jadwiszczak
d7a3aa2698 replica:database: add method to determine if semaphore is user one
Add method to compare semaphore with system ones (streaming, compaction,
system read) to be able if the semaphore belongs to a user.
2023-07-20 10:24:21 +02:00
Raphael S. Carvalho
3117f2f066 tests: Add test for table's mutation source excluding staging
Commit f5e3b8df6d introduced an optimization for
as_mutation_source_excluding_staging() and added a test that
verifies correctness of single key and range reads based
on supplied predicates. This new test aims to improve the
coverage by testing directly both table::as_mutation_source()
and as_mutation_source_excluding_staging(), therefore
guaranteeing that both supply the correct predicate to
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14763
2023-07-20 07:14:36 +03:00
Kefu Chai
77faec4f38 s3/test: use seastar::deferred() to perform cleanup
let's use RAII to remove the object use as a fixture, so we don't
leave some object in the bucket for testing. this might interfere
with other tests which share the same minio server with the test
which fails to do its clean up if an exception is thrown.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-20 10:04:54 +08:00
Kefu Chai
7a9c802fc3 s3/test: close using deferred_close()
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-20 10:04:54 +08:00
Pavel Emelyanov
7a3d61ce2c storage_service: Relax on_alive()
Now when there's always-false local variable, it can also be dropped and
all the associated if-else branches can be simplified

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 22:02:12 +03:00
Pavel Emelyanov
61a37cf6bf storage_service: Remove _replacing_nodes_pending_ranges_updater
The set in question is always empty, so it can be removed and the only
check for its contents can be constified

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 22:01:06 +03:00
Pavel Emelyanov
ea9db1b35c Merge 'cql3: expr: remove the default constructor' from Avi Kivity
`expression`'s default constructor is dangerous as an it can leak
into computations and generate surprising results. Fix that by
removing the default constructor.

This is made somewhat difficult by the parser generator's reliance
on default construction, and we need to expand our workaround
(`uninitialized<>`) capabilities to do so.

We also remove some incidental uses of default-constructed expressions.

Closes #14706

* github.com:scylladb/scylladb:
  cql3: expr: make expression non-default-constructible
  cql3: grammar: don't default-construct expressions
  cql3: grammar: improve uninitialized<> flexibility
  cql3: grammar: adjust uninitialized<> wrapper
  test: expr_test: don't invoke expression's default constructor
  cql3: statement_restrictions: explicitly initialize expressions in index match code
  cql3: statement_restrictions: explicitly intitialize some expression fields
  cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions
  cql3: expr: prepare_expression: avoid default-constructed expression
  cql3: broadcast_tables: prepare new_value without relying on expression default constructor
2023-07-19 21:46:03 +03:00
Pavel Emelyanov
8a87c87824 db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
This template call is only used by system keyspace paxos methods. All
those methods are no longer static and can use system_keyspace::_qp
reference to real query processor instead of global qctx. The
execute_cql_with_timeout() wrapper is moved to system_keyspace to make
it work

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Pavel Emelyanov
b9ef16c06f db/system_keyspace: Make paxos methods non-static
The service::paxos_state methods that call those already have system
keyspace reference at hand and can call method on an object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Pavel Emelyanov
d9ba8eb8df service/paxos: Add db::system_keyspace& argument to some methods
The paxos_state's .prepare(), .accept(), .learn() and .prune() methods
access system keyspace via its static methods. The only caller of those
(storage_proxy::remote) already has the sharded system k.s. reference
and can pass its .local() one as argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Pavel Emelyanov
b4fc1076e3 test: Optionally initialize proxy remote for cql_test_env
Some test cases that use cql_test_env involve paxos state updates. Since
this update is becoming via proxy->remote->system_keyspace those test
cases need cql_test_env to initialize the remote part of the proxy too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Aleksandra Martyniuk
bfb81b8cdd test: add test for resharding with non-empty owned_ranges_ptr 2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk
4fc4c2527c test: extend test_compaction_task.py to test resharding compaction 2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk
77dcdd743e compaction: add shard_reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction on one shard.
2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk
f73178a114 compaction: invoke resharding on sharded database
In reshard_sstables_compaction_task_impl::run() we call
sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed
to that method, we use both sharded sstable_directory service
and its local instance.

To make it straightforward that sharded and local instances are
dependend, we call sharded<replica::database>::invoke_on_all
instead and access local directory through the sharded one.
2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk
fa10c352a1 compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() 2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk
7a7e287d8c compaction: add reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction.

A struct and some functions are moved from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
2023-07-19 17:15:40 +02:00
Pavel Emelyanov
b0b91bf5ec proxy/remote: Keep sharded<db::system_keyspace>& dependency
This dependency will be needed to call service::paxos_state:: calls and
all of them are done in storage_proxy::remote() methods only

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 17:36:42 +03:00
Avi Kivity
fc71f49907 Update seastar submodule
* seastar bac344d58...c0e618bbb (7):
  > resource: take kernel min_free_kbytes into account when allocating memory
Fixes #14721
  > build: append JOB_POOLS definition instead of setting it
  > net: use designated initialization when appropriate
  > websocket: print logging message before handle_ping()
  > circular_buffer_fixed_capacity_test: enable randomize test to reverse
  > prometheus: do not qualify return type with const.
  > alien: do not define customized copy ctor

Closes #14755
2023-07-19 16:47:05 +03:00
Patryk Jędrzejczak
ee1c240f2a replica: do not derive the commitlog sync period for schema commitlog
We don't want to apply the value of the commitlog_sync_period_in_ms
variable to schema commitlog. Schema commitlog runs in batch mode,
so it doesn't need this parameter.
2023-07-19 14:16:50 +02:00
Patryk Jędrzejczak
b3be9617dc config: set schema_commitlog_segment_size_in_mb to 128
We increase the default schema commitlog segment size so that the
large mutations do not fail. We have agreed that 128 MB is sufficient.
2023-07-19 14:16:49 +02:00
Patryk Jędrzejczak
5b167a4ad7 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
2023-07-19 14:16:41 +02:00
Kefu Chai
8f390997cb db: do not use std::cmp_not_equal() when appropriate
this change is a follow-up of 3129ae3c8c.
since in both cases in this change, the `num_ranges` should always
be greater than zero, there is no need to use `int` for its type,
and "num_ranges" returned by the CQL query should always be greater
or equal to zero, so there is no need to check if it is positive.

in this change, we

* change the type of `num_ranges` to `size_t`
* change std::cmp_not_equal() to !=

to avoid using the verbose `std::cmp_not_equal()` helper, for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14754
2023-07-19 13:25:21 +03:00
Mikołaj Grzebieluch
00db47292b test: raft: do not update raft address map with obsolete gossip data
Regression test for #14257.

It starts two nodes. It introduces a sleep in raft_group_registry::on_alive
(in raft_group_registry.cc) when receiving a gossip notification about HOST_ID
update from the second node. Then it restarts the second node with a different IP.
Due to the sleep, the old notification from the old IP arrives after the second
node has restarted. If the bug is present, this notification overrides the address
map entry and the second read barrier times out, since the first node cannot reach
the second node with the old IP.

Closes #14609.

Closes #14728
2023-07-19 11:57:49 +02:00
Botond Dénes
8916aa311e Merge 'build: cmake: build: cmake: build submodules ' from Kefu Chai
this series enables CMake to build submodules. it helps developers to build, for instance, the java tools on demand.

Closes #14751

* github.com:scylladb/scylladb:
  build: cmake: build submodules
  build: cmake: generate version files with add_custom_command()
2023-07-19 12:04:29 +03:00
Kefu Chai
665135553d build: cmake: remove nonexistent test
the test of "type_json_test" was added locally, and has not landed
on master. but it somehow was spilled into 87170bf07a by accident.

so, let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14749
2023-07-19 11:58:34 +03:00
Pavel Emelyanov
312184c0c7 keys: Move exploded_clustering_prefix's operator<< to keys.cc
Now it sits in replicate/database.cc, but the latter is overloaded with
code, worth keeping less, all the more so the ..._prefix itself lives in
the keys.hh header.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14748
2023-07-19 11:57:27 +03:00
Pavel Emelyanov
5162028c71 storage_service: Remove do_stop_ms()
The helper was left from the storage-service shutdown-vs-drain rework
(long ago), now it just occupies space in code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14747
2023-07-19 11:56:27 +03:00
Avi Kivity
460b28d067 Merge 'Introduce SELECT MUTATION FRAGMENTS statement' from Botond Dénes
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table.  The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
    - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
    - partition_region (int): represents the enum with the same name.
    - the copy of the table's clustering columns
    - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
    - metadata (text): the JSON representation of the mutation-fragment's metadata.
    - value (text): the JSON representation of the mutation-fragment's value.

Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.

More details in the documentation commit (last commit).

Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));

cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;

 pk | ck | v
----+----+---
  1 |  0 | 0
  0 |  0 | 0
  0 |  1 | 0
  0 |  2 | 0

(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);

 pk | mutation_source | partition_region | ck | position_weight | metadata                                                                                                                 | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
  1 |      memtable:0 |                0 |    |                 |                                                                                                         {"tombstone":{}} |        partition start |      null
  1 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} |         clustering row | {"v":"0"}
  1 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null
  0 |      memtable:0 |                0 |    |                 |                                      {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} |        partition start |      null
  0 |      memtable:0 |                2 |  0 |               0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  0 |               1 |                                      {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  1 |               0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                2 |  2 |              -1 |                                                                                                         {"tombstone":{}} | range tombstone change |      null
  0 |      memtable:0 |                2 |  2 |               0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} |         clustering row | {"v":"0"}
  0 |      memtable:0 |                3 |    |                 |                                                                                                                     null |          partition end |      null

(10 rows)
```

Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```

Before:
```
median 141596.39 tps ( 62.1 allocs/op,  13.1 tasks/op,   43688 insns/op,        0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op,  13.1 tasks/op,   43692 insns/op,        0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```

Fixes: https://github.com/scylladb/scylladb/issues/11130

Closes #14347

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
  test/topology_custom: add test_select_from_mutation_fragments.py
  test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
  test/cql-pytest: add test_select_mutation_fragments.py
  test/cql-pytest: move scylla_data_dir fixture to conftest.py
  cql3/statements: wire-in mutation_fragments_select_statement
  cql3/restrictions/statement_restrictions: fix indentation
  cql3/restrictions/statement_restrictions: add check_indexes flag
  cql3/statments/select_statement: add mutation_fragments_select_statement
  cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
  service/pager: allow passing a query functor override
  service/storage_proxy: un-embed coordinator_query_options
  replica: add mutation_dump
  replica: extract query_state into own header
  replica/table: add make_nonpopulating_cache_reader()
  replica/table: add select_memtables_as_mutation_sources()
  tools,mutation: extract the low-level json utilities into mutation/json.hh
  tools/json_writer: fold SstableKey() overloads into callers
  tools/json_writer: allow writing metadata and value separately
  tools/json_writer: split mutation_fragment_json_writer in two classes
  tools/json_writer: allow passing custom std::ostream to json_writer
2023-07-19 11:54:11 +03:00
Asias He
c29e7e4644 Revert "Revert "view_update_generator: Increase the registration_queue_size""
This reverts commit 4cee8206f8.

The test is fixed.

Closes #14750
2023-07-19 11:46:28 +03:00
Aleksandra Martyniuk
e486f4eba6 compaction: create resharding_compaction_task_impl
resharding_compaction_task_impl serves as a base class of all
concrete resharding compaction task classes.
2023-07-19 10:41:35 +02:00
Kefu Chai
6ce0d3a202 build: cmake: build api/api-doc/metrics.json
metrics.json was added in d694a42745,
`configure.py` was updated accordingly. this change mirrors this
change in CMake building system.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14753
2023-07-19 11:39:21 +03:00
Avi Kivity
503d21b570 cql3: expr: avoid separating column_mutation_attribute from its column_value when levellizing aggregation depth
Since ec77172b4b (" Merge 'cql3: convert
the SELECT clause evaluation phase to expressions' from Avi Kivity"),
we rewrite non-aggregating selectors to include an aggregation, in order
to have the rest of the code either deal with no aggregation, or
all selectors aggregating, with nothing in between. This is done
by wrapping column selectors with "first" function calls: col ->
first(col).

This broke non-aggregating selectors that included the ttl() or
writetime() pseudo functions. This is because we rewrote them as
writetime(first(col)), and writetime() isn't a function that operates
on any values; it operates on mutations and so must have access to
a column, not an expression.

Fix by detecting this scenario and rewriting the expression as
first(writetime(col)).

Unit and integration tests are added.

Fixes #14715.

Closes #14716
2023-07-19 11:35:01 +03:00
Botond Dénes
718f57c510 docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement 2023-07-19 01:28:28 -04:00
Botond Dénes
a8fc71dbc0 test/topology_custom: add test_select_from_mutation_fragments.py 2023-07-19 01:28:28 -04:00
Botond Dénes
7540e62522 test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
Checking that the generated schema has deterministic id and version.
2023-07-19 01:28:28 -04:00
Botond Dénes
6709a71b96 test/cql-pytest: add test_select_mutation_fragments.py 2023-07-19 01:28:28 -04:00
Botond Dénes
05e010b1d3 test/cql-pytest: move scylla_data_dir fixture to conftest.py
It will soon be used by more than one test file.
2023-07-19 01:28:28 -04:00
Botond Dénes
6458ff9917 cql3/statements: wire-in mutation_fragments_select_statement
This commit contains all the changes required to wire-in the new select
from mutation_fragment() statement.
2023-07-19 01:28:28 -04:00
Botond Dénes
81175b5ffc cql3/restrictions/statement_restrictions: fix indentation
Left broken in the previous patch.
2023-07-19 01:28:28 -04:00
Botond Dénes
c7b3faccd2 cql3/restrictions/statement_restrictions: add check_indexes flag
Allowing caller to turn off checking for indexes. Useful if the
restrictions are applied on a pseudo-table, which has no corresponding
table object, and therefore no index manager (or indexes for that
matter).
2023-07-19 01:28:28 -04:00
Botond Dénes
0b6b00178e cql3/statments/select_statement: add mutation_fragments_select_statement
Not wired in yet. SELECT * FROM MUTATION_FRAGMENTS($table) is a new
select statement sub-type, which allows dumping the underling mutations
making up the data of a given table. The output of this statement is
mutation-fragments presented as CQL rows. Each row corresponds to a
mutation-fragment. Subsequently, the output of this statement has a
schema that is different than that of the underlying table.
Data is always read from the local replica, on which the query is
executed. Migrating queries between coordinators is not allowed.
2023-07-19 01:28:28 -04:00
Botond Dénes
aa31321da9 cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
SELECT * FROM MUTATION_FRAGMENTS($table) is a new select statement
sub-type. More information will be provided in the patch which introduces
it. This patch adds only the Cql.g changes and what is further strictly
necessary.
2023-07-19 01:28:28 -04:00
Botond Dénes
ccf9eba521 service/pager: allow passing a query functor override
To allow paging for requests that don't go through storage-proxy
directly. By default, there is no override and the code falls-back to
directly invoking storage_proxy::query() as before.
2023-07-19 01:28:28 -04:00
Botond Dénes
2174276bb7 service/storage_proxy: un-embed coordinator_query_options
So it can be forward declared. Add an embedded alias to reduce churn.
Requires similarly un-embedding clock_type.
2023-07-19 01:28:28 -04:00
Botond Dénes
a507ff5d88 replica: add mutation_dump
This file contains facilities to dump the underlying mutations contained
in various mutation sources -- like memtable, cache and sstables -- and
return them as query results. This can be used with any table on the
system. The output represents the mutation fragments which make up said
mutations, and it will be generated according to a schema, which is a
transformation of the table's schema.
This file provides a method, which can be used to implement the backend
of a select-statement: it has a similar signature to regular query
methods.
2023-07-19 01:28:28 -04:00
Botond Dénes
8643e23d0d replica: extract query_state into own header
So it can be reused outside of replica/table.cc.
2023-07-19 01:28:28 -04:00
Botond Dénes
3053996371 replica/table: add make_nonpopulating_cache_reader()
Allows reading the content of the cache, without populating it.
2023-07-19 01:28:28 -04:00
Botond Dénes
e2936b1eda replica/table: add select_memtables_as_mutation_sources()
Allowing reading from each individual memtable which contains the given
token, without exposing the memtables themselves to the caller. Exposing
the memtables directly to any code outside of table is undesired because
they are mutable objects.
2023-07-19 01:28:28 -04:00
Botond Dénes
665f69b80d tools,mutation: extract the low-level json utilities into mutation/json.hh
Soon, we will want to convert mutation fragments into json inside the
scylla codebase, not just in tools. To avoid scylla-core code having to
include tools/ (and link against it), move the low-level json utilities
into mutation/.
2023-07-19 01:28:28 -04:00
Botond Dénes
36bca5a6af tools/json_writer: fold SstableKey() overloads into callers
These are very simple methods, and we want to make the low lever writers
not depend on knowing the sstable type.
2023-07-19 01:28:28 -04:00
Botond Dénes
043b0f316f tools/json_writer: allow writing metadata and value separately
The values of cells are potentially very large and thus, when presenting
row content as json in SELECT * FROM MUTATION_FRAGMENTS($table) queries,
we want to separate metadata and cell values into separate columns, so
users can opt out from the potentially big values being included too.
To support this use-case, write(row) and its downstream write methods
get a new `include_value` flag, which defaults to true. When set to
false, cell values will not be included in the json output. At the same
time, new methods are added to convert only cell values of a row to
json.
2023-07-19 01:28:28 -04:00
Botond Dénes
1df004db8c tools/json_writer: split mutation_fragment_json_writer in two classes
1) mutation_partition_json_writer - containing all the low level
   utilities for converting sub-fragment level mutation components (such
   as rows, tombstones, etc.) and their components into json;
2) mutation_fragment_stream_json_writer - containing all the high level
   logic for converting mutation fragment streams to json;

The latter using the former behind the scenes. The goal is to enable
reuse of converting mutation-fragments into json, without being forced
to work around differences in how the mutation fragments are reprenented
in json, on the higher level.
2023-07-19 01:28:28 -04:00
Botond Dénes
0a5b67d6d9 tools/json_writer: allow passing custom std::ostream to json_writer
To allow for use-cases where the user wants to write the json into a
string.
2023-07-19 01:28:28 -04:00
Kefu Chai
959bfae665 build: cmake: build submodules
this mirrors what we have in the `build.ninja` generated by
`configure.py`. with this change, we can build for instance,
`dist-tool-tar` from the `build.ninja` generated by CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-19 13:08:35 +08:00
Kefu Chai
bb7d99ad37 build: cmake: generate version files with add_custom_command()
instead of using execute_process(), let's use add_custom_command()
to generate the SCYLLA-{VERSION,RELEASE,PRODUCT}-FILE, so that we
can let other targets depend on these generated files. and generate
them on demand.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-19 13:08:35 +08:00
Kamil Braun
bfaac5192a gossiper: call on_remove subscriptions in the foreground in remove_endpoint
`gossiper::remove_endpoint` performs `on_remove` callbacks for all
endpoint change subscribers. This was done in the background (with a
discarded future) due the following reason:
```
    // We can not run on_remove callbacks here because on_remove in
    // storage_service might take the gossiper::timer_callback_lock
```
however, `gossiper::timer_callback_lock` no longer exists, it was
removed in 19e8c14.

Today it is safe to perform the `storage_service::on_remove` callback in
the foreground -- it's only taking the token metadata lock, which is
also taken and then released earlier by the same fiber that calls
`remove_endpoint` (i.e.  `storage_service::handle_state_normal`).

Furthermore, we want to perform it in the foreground. First, there
already was a comment saying:
```
     // do subscribers first so anything in the subscriber that depends on gossiper state won't get confused
```
it's not too precise, but it argues that subscriber callbacks should be
serialized with the rest of `remove_endpoint`, not done concurrently
with it.

Second, we now have a concrete reason to do them in the foreground. In
issue #14646 we observed that the subcriber callbacks are racing with
the bootstrap procedure. Depending on scheduling order, if
`storage_service::on_remove` is called too late, a bootstrapping node
may try to wait for a node that was earlier replaced to become UP which
is incorrect. By putting the `on_remove` call into foreground of
`remove_endpoint`, we ensure that a node that was replaced earlier will
not be included in the set of nodes that the bootstrapping node waits
for (because `storage_service::on_remove` will clear it from
`token_metadata` which we use to calculate this set of nodes).

We also get rid of an unnecessary `seastar::async` call.

Fixes #14646

Closes #14741
2023-07-18 21:29:29 +02:00
Pavel Emelyanov
8bc42f54d4 Merge 'feature_service: handle deprecated features correctly in feature check' from Piotr Dulikowski
The feature check in `enable_features_on_startup` loads the list
of features that were enabled previously, goes over every one of them
and checks whether each feature is considered supported and whether
there is a corresponding `gms::feature` object for it (i.e. the feature
is "registered"). The second part of the check is unnecessary
and wrong. A feature can be marked as supported but its `gms::feature`
object not be present anymore: after a feature is supported for long
enough (i.e. we only support upgrades from versions that support the
feature), we can consider such a feature to be deprecated.

When a feature is deprecated, its `gms::feature` object is removed and
the feature is always considered enabled which allows to remove some
legacy code. We still consider this feature to be supported and
advertise it in gossip, for the sake of the old nodes which, even
though they always support the feature, they still check whether other
nodes support it.

The problem with the check as it is now is that it disallows moving
features to the disabled list. If one tries to do it, they will find
out that upgrading the node to the new version does not work:
`enable_features_on_startup` will load the feature, notice that it is
not "registered" (there is no `gms::feature` object for it) and fail
to boot.

This commit fixes the problem by modifying `enable_features_on_startup`
not to look at the registered features list at all. In addition to
this, some other small cleanups are performed:

- "LARGE_COLLECTION_DETECTION" is removed from the deprecated features
  list. For some reason, it was put there when the feature was being
  introduced. It does not break anything because there is
  a `gms::feature` object for it, but it's slightly confusing
  and therefore is removed.
- The comment in `supported_feature_set` that invites developers to add
  features there as they are introduced is removed. It is no longer
  necessary to do so because registered features are put there
  automatically. Deprecated features should still be put there,
  as indicated as another comment.

Fortunately, this issue does not break any upgrades as of now - since
we added enabled cluster feature persisting, no features were
deprecated, and we only add registered features to the persisted feature
list.

An error injection and a regression test is added.

Closes #14701

* github.com:scylladb/scylladb:
  topology_custom: add deprecated features test
  feature_service: add error injection for deprecated cluster feature
  feature_service: move error injection check to helper function
  feature_service: handle deprecated features correctly in feature check
2023-07-18 21:01:48 +03:00
Kamil Braun
6f22ed9145 Merge 'raft: move group0_state_machine::merger to its own header and add unit test for it' from Mikołaj Grzebieluch
Move `merger` to its own header file. Leave the logic of applying
commands to `group0_state_machine`. Remove `group0_state_machine`
dependencies from `merger` to make it an independent module.

Add a test that checks if `group0_state_machine_merger` preserves
timeuuid monotonicity. `last_id()` should be equal to the largest
timeuuid, based on its timestamps.

This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid
order and uuid order. Consequently, the resulting command should have a
more recent timeuuid.

Fixes #14568

Closes #14682

* github.com:scylladb/scylladb:
  raft: group0_state_machine_merger: add test for timeuuid ordering
  raft: group0_state_machine: extract merger to its own header
2023-07-18 17:43:50 +02:00
Kamil Braun
56c91473f2 Merge 'storage_proxy: silence abort_requested_exception on reads and writes' from Patryk Jędrzejczak
Fixes #10447

This issue is an expected behavior. However, `abort_requested_exception` is not handled properly.

-- Why this issue appeared

1. The node is drained.
2. `migration_manager::drain` is called and executes `_as.request_abort();`.
3. The coordinator sends read RPCs to the drained replica. On the replica side, `storage_proxy::handle_read` calls `migration_manager::get_schema_for_read`, which is defined like this:
```cpp
future<schema_ptr> migration_manager::get_schema_for_write(/* ... */) {
    if (_as.abort_requested()) {
        co_return coroutine::exception(std::make_exception_ptr(abort_requested_exception()));
    }
    /* ... */
```
So, `abort_requested_exception` is thrown.
4. RPC doesn't preserve information about its type, and it is converted to a string containing its error message.
5. It is rethrown as `std::runtime_error` on the coordinator side, and `abstract_resolve_reader::error()` logs information about it. However, we don't want to report `abort_requested_exception` there. This exception should be catched and ignored:
```cpp
void error(/* ... */) {
    /* ... */
   else if (try_catch<abort_requested_exception>(eptr)) {
        // do not report aborts, they are trigerred by shutdown or timeouts
    }
    /* ... */
```

-- Proposed solution

To fix this issue, we can add `abort_requested_exception` to `replica::exception_variant` and make sure that if it is thrown by `migration_manager::get_schema_for_write`, `storage_proxy::handle_read` correctly  encodes it. Thanks to this change, `abstract_read_resolver::error` can correctly handle `abort_requested_exception` thrown on the replica side by not reporting it.

-- Side effect of the proposed solution

If the replica supports it, the coordinator doesn't, and all nodes support `feature_service::typed_errors_in_read_rpc`, the coordinator will fail to decode `abort_requested_exception` and it will be decoded to `unknown_exception`. It will still be rethrown as `std::runtime_error`, however the message will change from *abort requested* to *unknown exception*.

-- Another issue

Moreover, `handle_write` reports abort requests for the same reason. This also floods the logs (this time on the replica side) for the same reason. I don't think it is intended, so I've changed it too. This change is in the last commit.

Closes #14681

* github.com:scylladb/scylladb:
  service: storage_proxy: do not report abort requests in handle_write
  service: storage_proxy: encode abort_requested_exception in handle_read
  service: storage_proxy: refactor encode_replica_exception_for_rpc
  replica: add abort_requested_exception to exception_variant
2023-07-18 17:04:05 +02:00
Nadav Har'El
4ce46a998a cql-pytest: translate Cassandra's tests for BATCH operations
This is a translation of Cassandra's CQL unit test source file
BatchTest.java into our cql-pytest framework.

This test file an old (2014) and small test file, with only a few minimal
testing of mostly error paths in batch statements. All test tests pass in
both Cassandra and Scylla.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14733
2023-07-18 17:01:18 +03:00
Raphael S. Carvalho
da18a9badf Fix test.py with compaction groups
test.py with --x-log2-compaction-groups option rotted a little bit.
Some boost tests added later didn't use the correct header which
parses the option or they didn't adjust suite.yaml.
Perhaps it's time to set up a weekly (or bi-weekly) job to verify
there are no regressions with it. It's important as it stresses
the data plane for tablets reusing the existing tests available.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14732
2023-07-18 16:57:11 +03:00
Botond Dénes
7d5cca1958 Merge 'Regular compaction task' from Aleksandra Martyniuk
Task manager's tasks covering regular compaction.

Uses multiple inheritance on already existing
regular_compaction_task_executor to keep track of
the operation with task manager.

Closes #14377

* github.com:scylladb/scylladb:
  test: add regular compaction task test
  compaction: turn regular_compaction_task_executor into regular_compaction_task_impl
  compaction: add compaction_manager::perform_compaction method
  test: modify sstable_compaction_test.cc
  compaction: add regular_compaction_task_impl
  compaction: switch state after compaction is done
2023-07-18 16:52:53 +03:00
Kefu Chai
4661671220 s3/test: do not keep the tempdir forever
by default, up to 3 temporary directories are kept by pytest.
but we run only a single time for each of the $TMPDIR. per
our recent observation, it takes a lot more time for jenkins
to scan the tempdir if we use it for scylla's rundir.

so, to alleviate this symptom, we just keep up to one failed
session in the tempdir. if the test passes, the tempdir
created by pytest will be nuked. normally it is located at
scylladb/testlog/${mode}/pytest-of-$(whoami).

see also
https://docs.pytest.org/en/7.3.x/reference/reference.html#confval-tmp_path_retention_policy

Refs #14690
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14735

[xemul: Withdrawing from PR's comments

    object_store is the only test which

        is using tmpdir fixture
        starts / stops scylla by itself
        and put the rundir of scylla in its own tmpdir

    we don't register the step of cleaning up [the temp dir] using the utilities provided by
    cql-pytest. we rely on pytest to perform the cleanup. while cql-pytest performs the
    cleanup using a global registry.
]
2023-07-18 16:49:25 +03:00
Kamil Braun
69e22de54d Merge 'minor test/pylib type fixes' from Alecco
Some minor fixes reported by `mypy`.

Closes #14693

* github.com:scylladb/scylladb:
  test/pylib: fix function attribute
  test/pylib: check cmd is defined before using it
  test/pylib: fix return type hint
  test/pylib: remove redundant method
2023-07-18 15:17:51 +02:00
Avi Kivity
a51fdadfed Merge 'treewide: remove #includes not use directly' from Kefu Chai
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.

if a forward declaration suffice, the declaration is added instead.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Closes #14740

* github.com:scylladb/scylladb:
  treewide: remove #includes not use directly
  size_tiered_backlog_tracker: do not include remove header
2023-07-18 14:45:33 +03:00
Alejo Sanchez
8fceb7b7a0 test/pylib: fix function attribute
Instead of globally hardcoding an attribute, set it in the function
itself.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-18 13:33:46 +02:00
Alejo Sanchez
f7ee4ee7f6 test/pylib: check cmd is defined before using it
Add an assert to check cmd is defined. Helps the type checker.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-18 13:33:46 +02:00
Alejo Sanchez
ff564583a4 test/pylib: fix return type hint
Fix type hint of return when using @asynccontextmanager.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-18 13:33:46 +02:00
Alejo Sanchez
2194d8864b test/pylib: remove redundant method
The ManagerClient.get_cql method is defined twice. Remove one and fix
the assert.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-18 13:33:46 +02:00
Kamil Braun
eb6202ef9c Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak
Fixes #9405

`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.

Closes #14534

* github.com:scylladb/scylladb:
  db: hints: add checksum to sync point encoding
  db: hints: add the version_size constant
2023-07-18 13:05:10 +02:00
Kefu Chai
bab16eb30e treewide: remove #includes not use directly
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.

if a forward declaration suffice, the declaration is added instead.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 17:36:31 +08:00
Kefu Chai
58302ab145 size_tiered_backlog_tracker: do not include remove header
according to cppreference,

> <ctgmath> is deprecated in C++17 and removed in C++20

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 17:36:31 +08:00
Michał Jadwiszczak
62ced66702 schema: add scylla specific options to schema description
Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates`
options to schema description.

Fixes: #12389
Fixes: scylladb/scylla-enterprise#2979

Closes #14275
2023-07-18 11:16:19 +03:00
Botond Dénes
21ff6efd74 test/boost/view_build_test: improve test_view_update_generator_register_semaphore_unit_leak
By making it independent of the number of units the view update
generator's registration semaphore is created with. We want to increase
this number significantly and that would destabilize this test
significantly. To prevent this, detach the test from the number of units
completely, while stil preserving the original intent behind it, as best
as it could be determined.

Closes #14727
2023-07-18 09:18:28 +03:00
Alejo Sanchez
13e31eaeca test.py: show mode and suite name when listing tests
For --list, show also mode and suite name.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14729
2023-07-18 09:06:47 +03:00
Botond Dénes
b3cb611be7 Merge 'treewide: enable -Wsign-compare and address the warnings from this option' from Kefu Chai
in order to identify the problems caused by integer type promotion when comparing unsigned and signed integers, in this series, we

- address the warnings raised by `-Wsign-compare` compiler option
- add `-Wsign-compare` compiler option to the building systems

Closes #14652

* github.com:scylladb/scylladb:
  treewide: use unsigned variable to compare with unsigned
  treewide: compare signed and unsigned using std::cmp_*()
2023-07-18 09:05:30 +03:00
Botond Dénes
6961fbcec7 Merge 'Add the metrics config api' from Amnon Heiman
This series is based on top of the seastar relabel config API.

The series adds a REST API for the configuration, it allows to get and set it.

The API is registered under the V2 prefix and uses the swagger 2.0 definition.

After this series to get the current relabel-config configuration:

```
    curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/'
```

A set config example:
```
    curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \
       { \
         "source_labels": [ \
           "__name__" \
         ], \
         "action": "replace", \
         "target_label": "level", \
         "replacement": "1", \
         "regex": "io_que.*" \
       } \
     ]' 'http://localhost:10000/v2/metrics-config/'
```

This is how it looks like in the UI
![image](https://user-images.githubusercontent.com/2118079/230763730-bafcaf8b-ea6d-4a6c-a778-6271fa3b6f82.png)

Closes #12670

* github.com:scylladb/scylladb:
  api: Add the metrics API
  api/config: make it optional if the config API is the first to register
  api: Add the metrics.json Swagger file
  Preparing for V2 API from files
2023-07-18 07:10:31 +03:00
Botond Dénes
f03efd7ea9 Merge 'build: cmake: fix the build of some tests' from Kefu Chai
this series addresses the FTBFS of tests with CMake, and also checks for the unknown parameters in `add_scylla_test()`

Closes #14650

* github.com:scylladb/scylladb:
  build: cmake: build SEASTAR tests as SEASTAR tests
  build: cmake: error out if found unknown keywords
  build: cmake: link tests against necessary libraries
2023-07-18 06:51:40 +03:00
Kefu Chai
4c1a26c99f compaction_manager: sort sstables when compaction is enabled
before this change, we sort sstables with compaction disabled, when we
are about to perform the compaction. but the idea of of guarding the
getting and registering as a transaction is to prevent other compaction
to mutate the sstables' state and cause the inconsistency.

but since the state is tracked on per-sstable basis, and is not related
to the order in which they are processed by a certain compaction task.
we don't need to guard the "sort()" with this mutual exclusive lock.

for better readability, and probably better performance, let's move the
sort out of the lock. and take this opportunity to use
`std::ranges::sort()` for more concise code.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14699
2023-07-18 06:40:43 +03:00
Kefu Chai
fa3129fa29 treewide: use unsigned variable to compare with unsigned
some times we initialize a loop variable like

auto i = 0;

or

int i = 0;

but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Kefu Chai
3129ae3c8c treewide: compare signed and unsigned using std::cmp_*()
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:

```
int main(int argc, char **argv) {
    int x = -1;
    unsigned y = 2;
    fmt::print("{}\n", x < y);
    return 0;
}
```

this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Amnon Heiman
123dd44c21 api: Add the metrics API
This patch adds a metrics API implementation.
The API supports get and set the metric relabel config.

Seastar supports metrics relabeling in runtime, following Prometheus
relabel_config.

Based on metrics and label name, a user can add or remove labels,
disable a metric and set the skip_when_empty flag.

The metrics-config API support such configuration to be done using the
RestFull API.

As it's a new API it is placed under the V2 path.

After this patch the following API will be available
'http://localhost:10000/v2/metrics-config/' GET/POST.

For example:
To get the current config:
```
curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/'
```

To set a config:
```
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \
   { \
     "source_labels": [ \
       "__name__" \
     ], \
     "action": "replace", \
     "target_label": "level", \
     "replacement": "1", \
     "regex": "io_que.*" \
   } \
 ]' 'http://localhost:10000/v2/metrics-config/'
```
2023-07-17 17:09:36 +03:00
Amnon Heiman
eeac846ea7 api/config: make it optional if the config API is the first to register
Until now, only the configuration API was part of the V2 API.

Now, when other APIs are added, it is possible that another API would be
the first to register. The first to register API is different in the
sense that it does not have a leading ',' to it.

This patch adds an option to mark the config API if it's the first.
2023-07-17 17:09:35 +03:00
Amnon Heiman
d694a42745 api: Add the metrics.json Swagger file
This patch adds the swagger definition for the metrics API.

Currently, the API defines a get and set of the metric_relabel_config.
2023-07-17 17:09:35 +03:00
Amnon Heiman
9e0ec3afba Preparing for V2 API from files
This patch changes the base path of the V2 of the API to be '/'.  That
means that the v2 prefix will be part of the path definition.
Currently, it only affect the config API that is created from code.

The motivation for the change is for Swagger definitions that are read
from a file.  Currently, when using the swagger-ui with a doc path set
to http://localhost:10000/v2 and reading the Swagger from a file swagger
ui will concatenate the path and look for
http://localhost:10000/v2/v2/{path}

Instead, the base path is now '/' and the /v2 prefix will be added by
each endpoint definition.

From the user perspective, there is no change in current functionality.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-07-17 17:09:35 +03:00
Patryk Jędrzejczak
02618831ef db: hints: add checksum to sync point encoding
sync point API provided with incorrect sync point id might allocate
crazy amount of memory and fail with std::bad_alloc.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering
it with a checksum calculated in db::hints::decode before decoding.
2023-07-17 16:05:07 +02:00
Patryk Jędrzejczak
0a424e1760 db: hints: add the version_size constant
The next commit changes the format of encoding sync points to V2. The
new format appends the checksum to the encoded sync points and its
implementation uses the checksum_size constant - the number of bytes
required to store the checksum. To increase consistency and readability,
we can additionally add and use the version_size constant.

Definitions of sync_point::decode and sync_point::encode are slightly
changed so that they don't depend on the version_size value and make
implementation of the V2 format easier.
2023-07-17 16:02:18 +02:00
Aleksandra Martyniuk
7dbe624dee test: add regular compaction task test 2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk
2e87ba1879 compaction: turn regular_compaction_task_executor into regular_compaction_task_impl
regular_compaction_task_executor inherits both from compaction_task_executor
and regular_compaction_task_impl.
2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk
e3b068be4d compaction: add compaction_manager::perform_compaction method 2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk
ab4ae6b84a test: modify sstable_compaction_test.cc
Modify sstable_compaction_test.cc so that it does not depend on
how quick compaction manager stats are updated after compaction
is triggered.

It is required since in the following changes the context may
switch before the stats are updated.
2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk
9fdd130943 compaction: add regular_compaction_task_impl
regular_compaction_task_impl serves as a base class of all
concrete regular compaction task classes.
2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk
33cb156ee3 compaction: switch state after compaction is done
Compaction task executors which inherit from compaction_task_impl
may stay in memory after the compaction is finished.
Thus, state switch cannot happen in destructor.

Switch state to none in perform_task defer.
2023-07-17 15:54:33 +02:00
Mikołaj Grzebieluch
bdf3959ae6 raft: group0_state_machine_merger: add test for timeuuid ordering
This test checks if `group0_state_machine_merger` preserves timeuuid monotonicity.
`last_id()` should be equal to the largest timeuuid, based on its timestamps.

This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid order and
uuid order. Consequently, the resulting command should have a more recent timeuuid.

Closes #14568
2023-07-17 15:51:20 +02:00
Mikołaj Grzebieluch
96c6e0d0f7 raft: group0_state_machine: extract merger to its own header
Move `merger` to its own header file. Leave the logic of applying commands to
`group0_state_machine`. Remove `group0_state_machine` dependencies from `merger`
to make it an independent module. Add `static` and `const` keywords to its
methods signature. Change it to `class`. Add documentation.

With this patch, it is easier to write unit tests for the merger.
2023-07-17 15:45:49 +02:00
Anna Stuchlik
2aa3672e5f doc: fix the 5.2-to-5.3 upgrade guide
Fixes https://github.com/scylladb/scylladb/issues/13993

This commit applies feedback from @mykaul added in
https://github.com/scylladb/scylladb/pull/13960 after
it was merged.

In addition, I've removed the information about
the Ubuntu version the images are based - the info
doesn't belong here, and, it addition, it causes
maintenance issues.

Closes #14703
2023-07-17 15:26:33 +02:00
Patryk Jędrzejczak
7ae7be0911 locator: remove this_host_id from topology::config
The `locator::topology::config::this_host_id` field is redundant
in all places that use `locator::topology::config`, so we can
safely remove it.

Closes #14638

Closes #14723
2023-07-17 14:57:36 +02:00
Patryk Jędrzejczak
56bd9b5db3 service: storage_proxy: do not report abort requests in handle_write
We don't want to report aborts in storage_proxy::handle_write, because it
can be only triggered by shutdowns and timeouts.

Before this change, such reports flooded logs when a drained node still
received the write RPCs.
2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak
f9db9f5943 service: storage_proxy: encode abort_requested_exception in handle_read
storage_proxy::handle_read now makes sure that abort_requested_exception
is encoded in a way that preserves its type information. This allows
the coordinator to properly deserialize and handle it.

Before this change, if a drained replica was still receiving the read
RPCs, it would flood the coordinator's logs with std::runtime_error
reports.
2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak
68bd0424c2 service: storage_proxy: refactor encode_replica_exception_for_rpc
To properly handle abort_requested_exception thrown from
migration_manager::get_schema_for_read in storage_proxy::handle_read (we
do in the next commit) we have to somehow encode and return it. The
encode_replica_exception_for_rpc function is not suitable for that because
it requires the SourceTuple type (of a value returned by do_query()) which
we don't know when calling get_schema_for_read.

We move the part of encode_replica_exception_for_rpc responsible for
handling exceptions to a new function and rewrite it in a way that doesn't
require the SourceTuple type. As this function fits the name
encode_replica_exception_for_rpc better, we name it this way and rename
the previous encode_replica_exception_for_rpc.
2023-07-17 12:27:33 +02:00
Patryk Jędrzejczak
7f83dbd9e7 test: disable raft-topology in test_remove_garbage_group0_members
With Raft-topology enabled, test_remove_garbage_group0_members has been
flaky when it should always fail. This has been discussed in #14614.

Disabling Raft-topology in the topology suite is problematic because
the initial cluster size is non-zero, so we have nodes that already use
Raft-topology at the beginning of the test. Therefore, we move
test_topology_remove_garbage_group0.py to the topology_custom suite.
Apart from disabling Raft-topology, we have to start 4 servers instead
of 1 because of the different initial cluster sizes.

Closes #14692
2023-07-17 11:42:57 +02:00
Anna Stuchlik
c53bbbf1b9 doc: document nodetool checkAndRepairCdcStreams
Fixes https://github.com/scylladb/scylladb/issues/13783

This commit documents the nodetool checkAndRepairCdcStreams
operation, which was missing from the docs.

The description is added in a new file and referenced from
the nodetool operations index.

Closes #14700
2023-07-17 11:41:54 +02:00
Avi Kivity
bfaac3a239 Merge 'Make replace sstables implementations exception safe' from Benny Halevy
This is the first phase of providing strong exception safety guarantees by the generic `compaction_backlog_tracker::replace_sstables`.

Once all compaction strategies backlog trackers' replace_sstables provide strong exception safety guarantees (i.e. they may throw an exception but must revert on error any intermediate changes they made to restore the tracker to the pre-update state).

Once this series is merged and ICS replace_sstables is also made strongly exception safe (using infrastructure from size_tiered_backlog_tracker introduced here), `compaction_backlog_tracker::replace_sstables` may allow exceptions to propagate back to the caller rather than disabling the backlog tracker on errors.

Closes #14104

* github.com:scylladb/scylladb:
  leveled_compaction_backlog_tracker: replace_sstables: provide strong exception safety guarantees
  time_window_backlog_tracker: replace_sstables: provide strong exception safety guarantees
  size_tiered_backlog_tracker: replace_sstables: provide strong exception safety guarantees
  size_tiered_backlog_tracker: provide static calculate_sstables_backlog_contribution
  size_tiered_backlog_tracker: make log4 helper static
  size_tiered_backlog_tracker: define struct sstables_backlog_contribution
  size_tiered_backlog_tracker: update_sstables: update total_bytes only if set changed
  compaction_backlog_tracker: replace_sstables: pass old and new sstables vectors by ref
  compaction_backlog_tracker: replace_sstables: add FIXME comments about strong exception safety
2023-07-17 12:32:27 +03:00
Botond Dénes
c4f35d67e5 Merge 'utils: add fmt formatter for pretty printers' from Kefu Chai
add fmt formatter for `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput`.

this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput` without the help of `operator<<`.

please note, despite that it's more popular to use the IEC prefixes
when presenting the size of storage, i.e., MiB for 1024**2 bytes instead
of MB for 1000**2 bytes, we are still using the SI binary prefixes as
the default binary prefix, in order to preserve the existing behavior.
the operator<< for these types are removed.

the tests are updated accordingly.

Refs #13245

Closes #14719

* github.com:scylladb/scylladb:
  utils: drop operator<< for pretty printers
  utils: add fmt formatter for pretty printers
2023-07-17 12:06:00 +03:00
Kefu Chai
ed5825ebdb s3/test: correct outdated comments
these comments or docstrings are not in-sync with the code they
are supposed to explain. so let's update them accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14545
2023-07-17 12:03:11 +03:00
Kefu Chai
3ed982df87 query_context: do not include unused header
in this header, none of the exceptions defined by
`exceptions/exceptions.hh` is used. so let's drop the `#include`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14718
2023-07-17 12:00:49 +03:00
Kefu Chai
18166e0e43 sstable: do not include unused header
`db/query_context.hh` contains the declaration of class
`db::query_context`. but `replica/table.cc` does not use or need
`db::query_context`.

so, in this change, the `#include` is removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14717
2023-07-17 11:47:02 +03:00
Aleksandra Martyniuk
241b56b7b5 test: drain old compaction tasks from task manager
When running compaction task test on the same scylla instantion
other tests are run, some compaction tasks from other test cases may
be left in task manager. If they stay in memory long enough, they may
get unregistered during the compaction task test and cause bad_request
status.

Drain old compaction tasks before and after each test.

Fixes: #14584.

Closes #14585
2023-07-17 10:57:36 +03:00
Harsh Soni
78c8e92170 dbuild: fix ulimits hard value for docker on osx
Docker-on-osx cannot parse "unlimited" as the hard limit value of ulimit, so, hardcode it to a fixed value.

Closes #14295
2023-07-17 10:30:39 +03:00
Kefu Chai
a8254111ef utils: drop operator<< for pretty printers
since all callers of these operators have switched to fmt formatters.
let's drop them. the tests are updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-17 14:02:13 +08:00
Kefu Chai
fc6b84ec1f utils: add fmt formatter for pretty printers
add fmt formatter for `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput`.

this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput` without the help of `operator<<`.

please note, despite that it's more popular to use the IEC prefixes
when presenting the size of storage, i.e., MiB for 1024**2 bytes instead
of MB for 1000**2 bytes, we are still using the SI binary prefixes as
the default binary prefix, in order to preserve the existing behavior.

also, we use the singular form of "byte" when formating "1". this is
more correct.

the tests are updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-17 14:02:13 +08:00
Botond Dénes
3945721dd6 Merge 'test/boost/database_test: split mutation sub-tests' from Alecco
Split long-runing database mutation tests.

At a trade-off with verbosity, split these sub-tests for the long running tests `database_with_data_in_sstables_is_a_mutation_source_`*.

Refs #13905

Closes #14455

* github.com:scylladb/scylladb:
  test/lib/mutation_source_test: bump ttl
  test/boost/memtable_test: split memtable sub-tests
  test/boost/database_test: split mutation sub-tests
2023-07-17 08:29:28 +03:00
Botond Dénes
1f5b1679b0 Merge 'test: use different table names in sstable_expired_data_ratio and cleanups' from Kefu Chai
it turns out we are creating two tables with the same name in
sstable_expired_data_ratio. and when creating the second table,
we don't destroy the first one.

this does not happen in the real world, we could tolerate this
in test. but this matters if we're going to have a system-wide per-table
registry which use the name of table as the table's identifier in the
registry. for instance, the metrics name for the tables would conflict.

so, in this series, we use different name for the tables under
testing. they can share the same set of sstables though. this fulfills
the needs of this test in question. also, we rename some variables
for better readability in this series.

Fixes https://github.com/scylladb/scylladb/issues/14657

Closes #14665

* github.com:scylladb/scylladb:
  test: rename variables with better names
  test: use different table names in sstable_expired_data_ratio
  test: explicitly capture variables
2023-07-17 08:27:30 +03:00
Kefu Chai
567b453689 utils: avoid using out-of-range index in pretty_printers
before this change, if the formatter size is greater than a pettabyte,
`exp` would be 6. but we still use it as the index to find the suffix
in `suffixes`, but the array's size is 6. so we would be referencing
random bits after "PB" for the suffix of the formatted size.

in this change

* loop in the suffix for better readability. and to avoid
  the off-by-one errors.
* add tests for both pretty printers

Branches: 5.1,5.2,5.3
Fixes #14702
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14713
2023-07-16 18:46:09 +03:00
Kefu Chai
6459bf9c0b test: randomized_nemesis_test: do not perform tautogical comparision
it is not supported by C++, and might not yield expected result.
as "0 <= d" evaluates to true, which is always less than "magic".

so let's avoid using it.

```
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2908:23: error: result of comparison of constant 54313 with expression of type 'bool' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
 2908 |         assert(0 <= d < magic);
      |                ~~~~~~ ^ ~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14695
2023-07-16 18:30:58 +03:00
Avi Kivity
4fc870a31a cql3: expr: avoid redoing prepare work when evaluating field_selection
prepare_expression() already validates the types and computes
the index of the field; no need to redo that work when
evaluating the expression.

The tests are adjusted to also prepare the expression.

Closes #14562
2023-07-16 14:29:19 +03:00
Alejo Sanchez
6d9709679d test/lib/mutation_source_test: bump ttl
Use a large ttl (2h+) to avoid deletions for database_test.

An actual fix would be to make database_test to not ignore query_time,
but this is much harder.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-15 10:51:09 +02:00
Alejo Sanchez
a9350493e3 test/boost/memtable_test: split memtable sub-tests
Split long-runing memtable tests.

At a trade-off with verbosity, split these sub-tests for the long
running tests
test_memtable_with_many_versions_conforms_to_mutation_source*.

Refs #13905
2023-07-15 10:51:09 +02:00
Alejo Sanchez
79eedded35 test/boost/database_test: split mutation sub-tests
Split long-runing database mutation tests.

At a trade-off with verbosity, split these sub-tests for the long
running tests database_with_data_in_sstables_is_a_mutation_source_*.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-15 10:51:06 +02:00
Kefu Chai
42bba50727 test: rename variables with better names
we first use `cf` and then `lcs_table` later on in
`sstable_expired_data_ratio` to represent "tables_for_tests"
with schema of different compaction strategies.

to improve the readability, we rename the variables which are
related to STCS (Sized-Tiered Compaction Strategy) to "stcs_*", so
better reflect their relations, for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-15 13:29:55 +08:00
Kefu Chai
f7af971181 test: use different table names in sstable_expired_data_ratio
it turns out we are creating two tables with the same name in
sstable_expired_data_ratio. and when creating the second table,
we don't destroy the first one.

this does not happen in the real world, we could tolerate this
in test. this matters if we're going to have a system-wide per-table
registry which use the name of table as the table's identifier in the
registry. for instance, the metrics name for the tables would conflict.

to avoid creating multiples tables with the same ${ks}.${cf},
after this change, we use different name for the tables under
testing, and they can share the same set of sstables though. this
fulfills the needs of this test in question, and the needs of
having per-table metrics with table id as their identifiers.

Fixes #14657
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-15 13:29:55 +08:00
Kefu Chai
c836e7940e test: explicitly capture variables
* sstable_expired_data_ratio: capture variables explictly for better
  readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-15 13:29:55 +08:00
Avi Kivity
b54265034d cql3: expr: make expression non-default-constructible
There is no obvious default expression, so better not to allow
default construction of expressions to prevent unintended values
from leaking in. Resolves a FIXME.
2023-07-14 18:35:59 +03:00
Avi Kivity
c0ba7040d5 cql3: grammar: don't default-construct expressions
Use uninitialized<expression> for that. Since it's heavily used,
alias it as "uexpression".

To prevent uninitialized<> from leaking into the rest of the
system, change do_with_parser() to unwrap it. We add an
unwrap_uninitialized_t template type alias for that.

Lots of std::move()s are sprinkled around to make things compile,
as uninitialized<T> refuses to convert to T without them.
2023-07-14 18:33:06 +03:00
Anna Stuchlik
a93fd2b162 doc: fix internal links
Fixes https://github.com/scylladb/scylladb/issues/14490

This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.

Closes #14664
2023-07-14 18:32:47 +03:00
Avi Kivity
ee2607324b cql3: grammar: improve uninitialized<> flexibility
uninitialized<> is used to work around the parser generator's propensity
to default-construct return values by supplying a default constructor
to otherwise non-default-constructible types. Make it easier to initialize
it not only from the wrapped type, but also from types convertible to
the wrapped type.

This is useful to initialize an uninitialized<expression> from an
expression element (say a binary_operator), without an explicit
conversion.
2023-07-14 18:31:54 +03:00
Avi Kivity
4bc0b42639 cql3: grammar: adjust uninitialized<> wrapper
The grammar generator relies on everything having a default
constuctor, and to accomodate it we have an uninitialized<>
template that fakes a default constructor where one doesn't
exist. For convenience we have implicit conversion operators
from uninitialized<T> to T. Currently, we have them for both
rvalue-reference and normal reference wrappers.

It turns out that C++ isn't clever enough to deal with both
of them when templates are involved. When it needs a T but
as an uninitialized_wrapper<T>&&, it sees both conversion
operators and can't pick one.

Aid it by removing the non-rvalue conversion operator. The
rvalue conversion operator is more efficient, and is all that
is needed, since we don't use values more than once in the grammar.

Sprinkle std::move()s on the rest of the grammar to keep it
compiling. In a few places the odd "$production" syntax
is changed to the more common "var=production ... { var }".
2023-07-14 16:48:16 +03:00
Botond Dénes
eb8d7fa1c2 Merge 'test/pylib: handle paging for run_async' from Alecco
Provide a way to fetch all pages for `run_async`.

While there, move the code to a common helper module.

Fixes https://github.com/scylladb/scylladb/issues/14451

Closes #14688

* github.com:scylladb/scylladb:
  test/pylib: handle paged results for async queries
  test/pylib: move async query wrapper to common module
2023-07-14 16:37:24 +03:00
Raphael S. Carvalho
d6029a195e Remove DateTieredCompactionStrategy
This is the last step of deprecation dance of DTCS.

In Scylla 5.1, users were warned that DTCS was deprecated.

In 5.2, altering or creation of tables with DTCS was forbidden.

5.3 branch was already created, so this is targetting 5.4.

Users that refused to move away from DTCS will have Scylla
falling back to the default strategy, either STCS or ICS.

See:
WARN  2023-07-14 09:49:11,857 [shard 0] schema_tables - Falling back to size-tiered compaction strategy after the problem: Unable to find compaction strategy class 'DateTieredCompactionStrategy

Then user can later switch to a supported strategy with
alter table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14559
2023-07-14 16:20:48 +03:00
Avi Kivity
5f6d00babf test: expr_test: don't invoke expression's default constructor
std::unordered_map::operator[] requires the default constructor of
expression, which we're about to remove. Use std::unordered_map::at()
instead.
2023-07-14 16:06:36 +03:00
Avi Kivity
4b6e38e704 cql3: statement_restrictions: explicitly initialize expressions in index match code
The index match code has some default-initialized expressions. These won't
compile when we remove expression's default constructor, so replace them
with the current default value, an empty conjunction.

An empty conjunction doesn't make any special sense here; the code
should be refactored not to rely on this random initial value. But this
is delicate code and the refactoring shouldn't be done in the middle of
an unrelated series.
2023-07-14 16:03:14 +03:00
Avi Kivity
a5921e4923 cql3: statement_restrictions: explicitly intitialize some expression fields
_partition_key_restrictions, _clustering_columns_restrictions, and
_nonprimary_key_restrictions are currently default-initialized. As
we're about to remove expression's default constructor, we need
to initialize them with something.

Use conjunction({}). Not only is this what the default constructor does,
that's what those fields' manipulators assume - they adjust field x
using make_conjunction(y, x). This dates to expression's roots as
a replacement for restrictions.
2023-07-14 15:57:41 +03:00
Avi Kivity
f94eb708e9 cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions
We have some gnarly code that classifies restrictions by the column
they restrict. This uses std::unordered_map::operatorp[], which uses
the value's default constructor. This happens to be "expression", and
as we're about to remove the default constructor, this won't do.

Fix by using try_emplace(), which makes the code nicer and more
efficient. It could be further improved, but it's better to demolish it
instead.
2023-07-14 15:52:03 +03:00
Avi Kivity
61be544431 cql3: expr: prepare_expression: avoid default-constructed expression
We're about to remove expression's default constructor, so adjust
the usertype_constructor code that checks whether a field has an
initializer or whether we must supply a NULL to not rely on it.
2023-07-14 15:49:51 +03:00
Kefu Chai
8b10b1408b migration_manager: correct format string when printing warning
we intent to print the error message. but failed to pass it to the
formatter. if we actually run into this case, fmtlib would throw.

so in this change, we also print the error when
announcing schema change fails.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14623
2023-07-14 15:47:10 +03:00
Avi Kivity
f01c4b3094 cql3: broadcast_tables: prepare new_value without relying on expression default constructor
A broadcast_table modification query consists of the the key, the new value,
and the condition. When preparing it, we construct the query with
a default new_value expression, and pass it to
operation::prepare_for_broadcast_tables() to fill .new_value.

Since we're removing expression's default constructor, this won't work.
So instead nothing to a (renamed)
operation::prepare_new_value_for_broadcast_tables(), and use the return
value to fill the query.
2023-07-14 15:42:58 +03:00
Piotr Dulikowski
39e41dec84 topology_custom: add deprecated features test
Adds a test which simulates marking a cluster feature as deprecated.
2023-07-14 12:41:37 +02:00
Piotr Dulikowski
794d3f0b03 feature_service: add error injection for deprecated cluster feature
Adds an error injection which allows to enable the TEST_ONLY_FEATURE as
a deprecated feature, i.e. it is assumed to be always enabled, but still
considered to be supported by the node and advertised in gossip.
2023-07-14 12:41:37 +02:00
Piotr Dulikowski
a775f929df feature_service: move error injection check to helper function
And also extract "features_enable_test_feature" literal to a string
constant. This should slightly improve readability and make it more
consistent with the next commit.
2023-07-14 12:41:37 +02:00
Piotr Dulikowski
1704d7e4f0 feature_service: handle deprecated features correctly in feature check
The feature check in `enable_features_on_startup` loads the list
of features that were enabled previously, goes over every one of them
and checks whether each feature is considered supported and whether
there is a corresponding `gms::feature` object for it (i.e. the feature
is "registered"). The second part of the check is unnecessary
and wrong. A feature can be marked as supported but its `gms::feature`
object not be present anymore: after a feature is supported for long
enough (i.e. we only support upgrades from versions that support the
feature), we can consider such a feature to be deprecated.

When a feature is deprecated, its `gms::feature` object is removed and
the feature is always considered enabled which allows to remove some
legacy code. We still consider this feature to be supported and
advertise it in gossip, for the sake of the old nodes which, even
though they always support the feature, they still check whether other
nodes support it.

The problem with the check as it is now is that it disallows moving
features to the disabled list. If one tries to do it, they will find
out that upgrading the node to the new version does not work:
`enable_features_on_startup` will load the feature, notice that it is
not "registered" (there is no `gms::feature` object for it) and fail
to boot.

This commit fixes the problem by modifying `enable_features_on_startup`
not to look at the registered features list at all. In addition to
this, some other small cleanups are performed:

- "LARGE_COLLECTION_DETECTION" is removed from the deprecated features
  list. For some reason, it was put there when the feature was being
  introduced. It does not break anything because there is
  a `gms::feature` object for it, but it's slightly confusing
  and therefore is removed.
- The comment in `supported_feature_set` that invites developers to add
  features there as they are introduced is removed. It is no longer
  necessary to do so because registered features are put there
  automatically. Deprecated features should still be put there,
  as indicated as another comment.

Fortunately, this issue does not break any upgrades as of now - since
we added enabled cluster feature persisting, no features were
deprecated, and we only add registered features to the persisted feature
list.
2023-07-14 12:41:18 +02:00
Asias He
dad5caf141 streaming: Add stream_plan_ranges_percentage
This option allows user to change the number of ranges to stream in
batch per stream plan.

Currently, each stream plan streams 10% of the total ranges.

With more ranges per stream plan, it reduces the waiting time between
two stream plans. For example,

stream_plan1: shard0 (t0), shard1 (t1)
stream_plan2: shard0 (t2), shard1 (t3)

We start stream_plan2 after all shards finish streaming in stream_plan1.
If shard0 and shard1 in stream_plan1 finishes at different time. One of
the shards will be idle.

If we stream more ranges in a single stream plan, the waiting time will
be reduced.

Previously, we retry the stream plan if one of the stream plans is
failed. That's one of the reasons we want more stream plans. With RBNO
and 1f8b529e08 (range_streamer: Disable restream logic), the
restream factor is not important anymore.

Also, more ranges in a single stream plan will create bigger but fewer
sstables on the receiver side.

The default value is the same as before: 10% percentage of total ranges.

Fixes #14191

Closes #14402
2023-07-14 09:03:01 +03:00
Botond Dénes
5c5c56820c Merge 'Automatically close exhausted SSTable readers for cleanup' from Raphael "Raph" Carvalho
This is a followup to 1545ae2d3b

A new reader is introduced that automatically closes the underlying sstable reader once it's exhausted after a fast forward call.

Allowing us to revert 1fefe597e6 which is fragile.

Closes #14669

* github.com:scylladb/scylladb:
  Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call"
  sstables: Automatically close exhausted SSTable readers in cleanup
2023-07-14 09:00:57 +03:00
Patryk Jędrzejczak
a21c4abad7 replica: add abort_requested_exception to exception_variant
If migration_manager::get_schema_for_write is called after
migration_manager::drain, it throws abort_requested_exception.
This exception is not present in replica::exception_variant, which
means that RPC doesn't preserve information about its type. If it is
thrown on the replica side, it is deserialized as std::runtime_error
on the coordinator. Therefore, abstract_read_resolver::error logs
information about this exception, even though we don't want it (aborts
are triggered on shutdown and timeouts).

To solve this issue, we add abort_requested_exception to
replica::exception_variant and, in the next commits, refactor
storage_proxy::handle_read so that abort_requested_exception thrown in
migration_manager::get_schema_for_write is properly serialized. Thanks
to this change, unchanged abstract_read_resolver::error correctly
handles abort_requested_exception thrown on the replica side by not
reporting it.
2023-07-13 16:57:10 +02:00
Alejo Sanchez
9fefb601ef test/pylib: handle paged results for async queries
Provide a flag to fetch all pages for run_async().

Add a simple test to random tables. Runs within 6 seconds in debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-13 16:56:01 +02:00
Mikołaj Grzebieluch
b165f1e88b utils: error injection: check if it is an ongoing one-shot injection in is_enabled
Change it for consistency with `enabled_injections`.

Closes #14597
2023-07-13 15:56:33 +02:00
Botond Dénes
4cee8206f8 Revert "view_update_generator: Increase the registration_queue_size"
This reverts commit d3034e0fab.

The test modified by this commit
(view_build_test.test_view_update_generator_register_semaphore_unit_leak)
often fails, breaking build jobs.
2023-07-13 16:48:50 +03:00
Pavel Emelyanov
ddbccf1952 main: Use invoke_on_all(&class::method, ...) where possible
The sharded<service>::invoke_on_all() has the ability to call method by
pointer with automagical unwrapping of sharded references. This makes
the code shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14684
2023-07-13 16:31:14 +03:00
Anna Stuchlik
9db9dedb41 doc: document the minimum_keyspace_rf option
Fixes https://github.com/scylladb/scylladb/issues/14598

This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)

This commit must be backported to branch-5.3, because
the feature is already on that branch.

Closes #14686
2023-07-13 15:37:52 +03:00
Kefu Chai
057701299c compaction_manager: remove unnecessary include
also, remove unnecessary forward declarations.

* compaction_manager_test_task_executor is only referenced
  in the friend declaration. but this declaration does not need
  a forward declaration of the friend class
* compaction_manager_test_task_executor is not used anywhere.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14680
2023-07-13 14:59:39 +03:00
Patryk Jędrzejczak
ed5627cb78 test: raft: add more unit tests for raft address map
https://github.com/scylladb/scylladb/pull/12035 and
https://github.com/scylladb/scylladb/pull/14329 have introduced a few
features to the raft address map that haven't been tested yet:
- mappings without an actual IP address (the first PR)
- marking entries with generation numbers (the second PR)

This commit adds unit tests that verify these changes.

Closes #14572
2023-07-13 12:00:43 +02:00
Kamil Braun
a2fe63349d Merge 'utils: error injection: add a string-to-string map of injection's parameters' from Mikołaj Grzebieluch
Add `parameters` map to `injection_shared_data`. Now tests can attach
string data to injections that can be read in injected code via
`injection_handler`.

Closes #14521

Closes #14608

* github.com:scylladb/scylladb:
  tests: add a `parameters` argument to code that enables injections
  api/error_injection: add passing injection's parameters to enable endpoint
  tests: utils: error injection: add test for injection's parameters
  utils: error injection: add a string-to-string map of injection's parameters
  utils: error injection: rename received_messages_counter to injection_shared_data
2023-07-13 11:52:15 +02:00
Kefu Chai
a871de33e6 test.py: remove redundant message in report
before this change, we would have report in Jenkins like:

```
[Info] - 1 out of 3 times failed: failed.
 == [File] - test/boost/commitlog_test.cc
 == [Line] - 298

[Info] - passed: release=1, dev=1
 == [File] - test/boost/commitlog_test.cc
 == [Line] - 298

[Info] - failed: debug=1
 == [File] - test/boost/commitlog_test.cc
 == [Line] - 298
```

the first section is rendered from the an `Info` tag,
created by `test.py`. but the ending "failed" does not
help in this context, as we already understand it's failing.
so, in this change, it is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14546
2023-07-13 11:31:13 +03:00
Nadav Har'El
e01a369708 alternator: detect errors in AttributeDefinitions parameter
Add missing validation of the AttributeDefinitions parameter of the
CreateTable operation in Alternator. This validation isn't needed
for correctness or safety - the invalid entries would have been
ignored anyway. But this patch is useful for user-experience - the
user should be notified when the request is malformed instead of
ignoring the error.

The fix itself is simple (a new validate_attribute_definitions()
function, calling it in the right place), but much of the contents
of this patch is a fairly large set of tests covering all the
interesting cases of how AttributeDefinitions can be broken.
Particularly interesting is the case where the same AttributeName
appears more than once, e.g., attempting to give two different types
to the same key attribute - which is not allowed.

One of the new tests remains xfail even after this patch - it checks
the case that a user attempts to add a GSI to an existing table where
another GSI defined the key's type differently. This test can't
succeed until we allow adding GSIs to existing tables (Refs #11567).

Fixes #13870.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14556
2023-07-13 11:28:47 +03:00
Tomasz Grabiec
6449c59963 gossiper: Bring back abort on listener failure
The refactoring in c48dcf607a dropped
the noexcept around listener notification. This is probably
unintentional, as the comment which explains why we need to abort was
preserved.

Closes #14573
2023-07-13 11:26:23 +03:00
Kefu Chai
565f5c7380 transport: correct format string when printing logging message
we print the stream id in the logging messages, but in this case,
we forgot to pass `stream` to `log::debug()`. but the placeholder
for `stream` was added. if the underlying fmtlib actually formats
the argument with the format string, it would throw.

fortunately, we don't enable debug level logging often, guess that's
why we haven't spotted this issue yet.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14620
2023-07-13 11:21:43 +03:00
Kefu Chai
3a67c31df0 compaction_manager: pass const reference to ctor
the callers of the constructor does not move variable into this
parameter, and the constructor itself is not able to consume it.
as the parameter is a vector while `compaction_sstable_registration`
use an `unordered_set` for tracking the sstables being compacted.

so, to avoid creating a temporary copy of the vector, let's just
pass by reference.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14661
2023-07-13 11:19:44 +03:00
Petr Gusev
3737bf8fa2 topology.cc: unindex_node: _dc_racks removal fix
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.

The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.

With this fix the test passes both with
--consistent-cluster-management option and
without it.

The specific unit test for this problem was added.

Fixes: #14184

Closes #14673
2023-07-13 11:16:01 +03:00
Mikołaj Grzebieluch
382d797d81 tests: add a parameters argument to code that enables injections 2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch
507f750754 api/error_injection: add passing injection's parameters to enable endpoint 2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch
ef712e5d21 tests: utils: error injection: add test for injection's parameters 2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch
f60580ab3e utils: error injection: add a string-to-string map of injection's parameters
Add `parameters` map. Now tests can attach string data to
injections that can be read in injected code via `injection_handler`.
2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch
b33714a0f0 utils: error injection: rename received_messages_counter to injection_shared_data
For now, `received_messages_counter` have only data for messaging the injection.
In future, there will be more data to keep, for example, a string-to-string map of
injection's parameters.

Rename this class and its attributes.
2023-07-13 10:10:52 +02:00
Asias He
1b577e0414 repair: Release permit earlier when the repair_reader is done
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677
2023-07-13 11:00:35 +03:00
Nadav Har'El
6a7d980a5d docs/alternator: list more DynamoDB features not in Alternator
This patch adds to docs/alternator/compatibility.md mentions of three
recently-added DynamoDB features (ReturnValuesOnConditionCheckFailure,
DeletionProtectionEnabled and TableClass) which Alternator does not yet
support.

Each of these mentions also links to the github issue we have on each
feature - issues #14481, #14482 and #10431 respectively.

During a review of this patch, the reviewers didn't like that I used
words like "recent" and "new" to describe recently-added DynamoDB
features, and asked that I use specific dates instead. So this is what
I do in this patch for the new features - and I also went back and
fixed a few pre-existing references to "recent" and "new" features,
and added the dates.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14483
2023-07-13 09:52:08 +02:00
Kamil Braun
9d4b3c6036 test: use correct timestamp resolution in test_group0_history_clearing_old_entries
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.

Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.

Fix it by using microseconds everywhere.

Fixes #14653

Closes #14670
2023-07-13 10:33:52 +03:00
Kefu Chai
aeb160a654 sstables: use sstables_manager::uuid_stable_identifier()
instead of accessing the `feature_service`'s member variable, use
the accessor provided by sstable_manager. so we always access the
this setting via a single channel. this should helps with the
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14658
2023-07-13 10:31:06 +03:00
Tomasz Grabiec
1ecd3c1a9a test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
The new test cases are a mirror of old test cases, but with updated digests.
2023-07-12 21:21:55 +02:00
Tomasz Grabiec
b7bc991aa1 Merge 'Fix test_node_isolation flakiness' from Kamil Braun
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.

The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.

At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.

Closes #14663

* github.com:scylladb/scylladb:
  test: test_node_isolation: use `ManagerClient.con_gen` to create CQL connection
  test: manager_client: make `con_gen` for `ManagerClient.__init__` nonoptional
2023-07-12 16:36:54 +02:00
Raphael S. Carvalho
8829ff02c5 Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call"
This reverts commit 1fefe597e6.

Can be reverted after auto-closed reader.

Refs #12998.
2023-07-12 10:48:28 -03:00
Raphael S. Carvalho
ca8705bd82 sstables: Automatically close exhausted SSTable readers in cleanup
Add a reader that will automatically close the underlying sstable
reader if fast forward is called with a range past the range
spanned by the SSTable. This is only to be used in the context
of fast forward calls in cleanup, as combined reader in full
scans can proactively close the readers that returned EOS.

Regular reads that go through cache enable fast forwarding to
position range, therefore won't enable auto-closed reader.
Compactions don't enable any kind of forward, and they won't
have it enabled either.

The overhead is minimal, with cleanup being able to reach the
same 38MB/s as before this patch.

Refs #12998.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-12 10:48:14 -03:00
Calle Wilund
890f1f4ad3 generic_server: Handle TLS error codes indicating broken pipe
Fixes  #14625

In broken pipe detection, handle also TLS error codes.

Requires https://github.com/scylladb/seastar/pull/1729

Closes #14626
2023-07-12 16:04:33 +03:00
Botond Dénes
6a63abcb9f Merge 'doc: fix broken links reported by the link checker' from Anna Stuchlik
This PR fixes or removes broken links reported by an online link checker.

Fixes https://github.com/scylladb/scylladb/issues/14488

Closes #14462

* github.com:scylladb/scylladb:
  doc: update the link to ABRT
  doc: fix broken links on the Scylla SStable page
2023-07-12 16:02:23 +03:00
Asias He
d3034e0fab view_update_generator: Increase the registration_queue_size
When repair writes a sstable to disk, we check if the sstable needs view
update processing. If yes, the sstable will be placed into the staging
dir for processing, with the _registration_sem semaphore to prevent too
many pending unprocessed sstables.

We have seen multiple cases in the field where view update processing is
inefficient and way too slow which blocks the base table repair to
finish on time.

This patch increases the registration_queue_size to a bigger number to
mitigate the problem that slow view update processing blocks repair.

It is better to have a consistent base table + inconsistent view table
than inconsistent base table + inconsistent view table.

Currently, sstables in staging dir are not compacted. So we could not
increase the _registration_sem with too big number to avoid accumulate
too many sstables.

The view_build_test.cc is updated to make the test pass.

Closes #14241
2023-07-12 15:51:35 +03:00
Tomasz Grabiec
e8ee0a2f86 Merge 'group0_state_machine: use correct comparison for timeuuids in merger' from Kamil Braun
In d2a4079bbe, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row).

However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the *smaller* of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse.

Fix it by using the correct comparison method.

Fixes: #14600

Closes #14616

* github.com:scylladb/scylladb:
  utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment
  group0_state_machine: use correct comparison for timeuuids in `merger`
  utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&`
  utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t*`
2023-07-12 14:48:18 +02:00
Botond Dénes
296837120d db: move virtual tables into virtual_tables.cc
The definitions of virtual tables make up approximately a quarter of the
huge system_keyspace.cc file (almost 4K lines), pulling in a lot of
headers only used by them.
Move them to a separate source file to make system_keyspace.cc easier
for humans and compilers to digest.
This patch also moves the `register_virtual_tables()`,
`install_virtual_readers()` as well as the `virtual_tables` global.

Closes #14308
2023-07-12 15:26:54 +03:00
Anna Stuchlik
a414ac8fde doc: update the link to ABRT 2023-07-12 14:13:42 +02:00
Kefu Chai
8f31f28446 build: cmake: add test/raft tests
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14656
2023-07-12 15:06:59 +03:00
Kamil Braun
820d7e9520 test: test_node_isolation: use ManagerClient.con_gen to create CQL connection
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.

The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.

At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.
2023-07-12 12:34:02 +02:00
Kefu Chai
20c7b6057b test: silence the deprecation warning.
because `lw_shared_ptr::operator=(T&&)` was deprecated. we started to
have following waring:

```
/home/kefu/dev/scylladb/test/boost/statement_restrictions_test.cc:394:41: warning: 'operator=' is deprecated: call make_lw_shared<> and assign the result instead [-Wdeprecated-declarations]
  394 |         definition.column_specification = std::move(specification);
      |                                         ^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:346:7: note: 'operator=' has been explicitly marked deprecated here
  346 |     [[deprecated("call make_lw_shared<> and assign the result instead")]]
      |       ^
1 warning generated.
```

so, in this change, we use the recommended way to update a lw_shared_ptr.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14648
2023-07-12 13:10:33 +03:00
Kamil Braun
3464877276 test: manager_client: make con_gen for ManagerClient.__init__ nonoptional
`ManagerClient` is given a function that is used to create CQL
connections to the Scylla cluster. For some reason it was typed as
`Optional` even though it was never passed `None`. Fix it.
2023-07-12 11:44:15 +02:00
Kefu Chai
5443bf69f7 storage_proxy: print the expected ex.what()
before this change, the format string contains two placeholders,
but only one extra argument is passed in. if we actually format
this logging message, fmtlib would throw.

after this change, we pass the exception's error message as yet
another argument.

this logging message is printed with "trace" level, guess that's
why we haven't have the exception thrown by fmtlib.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14628
2023-07-12 12:34:51 +03:00
Nadav Har'El
a4087f58df alternator: fix error path for size() function on constants
The DynamoDB documentation for the size() function claims that it only
works on paths (attribute names or references), but it actually works on
constants from the query (e.g., ":val") as well.

It turns out that Alternator supports this undocumented case already, but
gets the error path wrong: Usually, when size() is calculated on the data,
if the data has the wrong type of size() (e.g., an integer), the condition
simply doesn't match. But if the value comes from the query - it should
generate an error that the query is wrong - ValidationException.

This patch fixes this case, and also adds tests for it that pass on both
DynamoDB and Alternator (after this patch).

Fixes #14592

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14593
2023-07-12 12:29:05 +03:00
Pavel Emelyanov
eb549234b0 scylla-gdb: Fix tables filtering
There's -k|--keyspace argument to the tables command that's supposed to
filter tables belonging to specific keyspace that doesn't work. Fix it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14634
2023-07-12 12:26:25 +03:00
Avi Kivity
0fc067a54c build: add -Wimplicit-fallthrough to cmake
In 0cabf4eeb9 ("build: disable implicit fallthrough"), we added
-Wimplicit-fallthrough to configure.py, but forgot to add it to cmake.

Closes #14629
2023-07-12 12:24:22 +03:00
Nadav Har'El
f08bc83cb2 cql-pytest: translate Cassandra's tests for CAST operations
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.

There are 13 tests, 9 of them currently xfail.

The failures are caused by one recently-discovered issue:

Refs #14501: Cannot Cast Counter To Double

and by three previously unknown or undocumented issues:

Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
             milliseconds
Refs #14522: Support CAST function not only in SELECT

Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14528
2023-07-12 11:42:04 +03:00
Nadav Har'El
599636b307 test/alternator: fix flaky test test_ttl_expiration_gsi_lsi
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!

So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.

This patch also fixes a small bug which prevented this test from running
on DynamoDB.

Fixes #14495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14496
2023-07-12 11:23:12 +03:00
Botond Dénes
968421a3e0 Merge 'Stop task manager compaction module properly' from Aleksandra Martyniuk
Due to wrong order of stopping of compaction services, shutdown needs
to wait until all compactions are complete, which may take really long.

Moreover, test version of compaction manager does not abort task manager,
which is strictly bounded to it, but stops its compaction module. This results
in tests waiting for compaction task manager's tasks to be unregistered,
which never happens.

Stopping and aborting of compaction manager and task manager's compaction
module are performed in a proper order.

Closes #14461

* github.com:scylladb/scylladb:
  tasks: test: abort task manager when wrapped_compaction_manager is destructed
  compaction: swap compaction manager stopping order
  compaction: modify compaction_manager::stop()
2023-07-12 09:54:00 +03:00
Avi Kivity
118fa59ba8 tools: add cqlsh shortcut
Add bin/cqlsh as a shortcut to tools/cqlsh/bin/cqlsh, intended for
developers.

Closes #14362
2023-07-12 09:36:59 +03:00
Pavel Emelyanov
033e5348aa scylla-gdb: Print all clients from all idx's
The scylla netw command prints clients from [0] index only, but there
are more of them on messaging service. Print all

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14633
2023-07-12 09:29:02 +03:00
Botond Dénes
c5cb23a825 Merge 'Add scylla table to scylla-gdb' from Pavel Emelyanov
The command is to print interesting and/or hard-to-get-by-hand info about individual tables

Closes #14635

* github.com:scylladb/scylladb:
  test: Add 'scylla table' cmd test
  scylla-gdb: Print table phased barriers
  scylla-gdb: Add 'table' command
2023-07-12 09:26:59 +03:00
Kefu Chai
cca8db5f03 build: cmake: build SEASTAR tests as SEASTAR tests
both tagged_integer_test and tablets_test are drived by
"scylla_test_case", and they use seastar thread. let's make them
"SEASTAR" tests, so they can link against the used libraries.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-12 12:21:14 +08:00
Kefu Chai
bfe169a41c build: cmake: error out if found unknown keywords
this should helps to identify the error passing
wrong keywords to this function.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-12 12:21:14 +08:00
Kefu Chai
7c6ecb1c54 build: cmake: link tests against necessary libraries
* link alternator_unit_test against alternator
* link schema_loader_test against tools

since alternator_unit_test referecens the symbols defined by alternator.  let's
link the test against the library. otherwise, we'd have following
link failure:

```
FAILED: test/boost/alternator_unit_test
: && /usr/bin/clang++ -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw"  -O0 -g -gz -Wl,--build-id=sha1 -fuse-ld=lld -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o -o test/boost/alternator_unit_test  test/lib/libtest-lib.a  seastar/libseastar.a  /usr/lib64/libxxhash.so  /usr/lib64/libboost_unit_test_framework.so.1.78.0  libscylla-main.a  /usr/lib64/libabsl_hash.so.2206.0.0  /usr/lib64/libabsl_city.so.2206.0.0  /usr/lib64/libabsl_bad_variant_access.so.2206.0.0  /usr/lib64/libabsl_low_level_hash.so.2206.0.0  -Xlinker --push-state -Xlinker --whole-archive  auth/libscylla_auth.a  -Xlinker --pop-state  /usr/lib64/libcrypt.so  cdc/libcdc.a  compaction/libcompaction.a  mutation_writer/libmutation_writer.a  -Xlinker --push-state -Xlinker --whole-archive  dht/libscylla_dht.a  -Xlinker --pop-state  gms/libgms.a  types/libtypes.a  index/libindex.a  -Xlinker --push-state -Xlinker --whole-archive  locator/libscylla_locator.a  -Xlinker --pop-state  sstables/libsstables.a  /usr/lib64/libz.so  readers/libreaders.a  schema/libschema.a  -Xlinker --push-state -Xlinker --whole-archive  tracing/libscylla_tracing.a  -Xlinker --pop-state  service/libservice.a  -lsystemd  raft/libraft.a  repair/librepair.a  streaming/libstreaming.a  replica/libreplica.a  db/libdb.a  mutation/libmutation.a  data_dictionary/libdata_dictionary.a  cql3/libcql3.a  transport/libtransport.a  cql3/libcql3.a  transport/libtransport.a  lang/liblang.a  /usr/lib64/liblua-5.4.so  -lm  /usr/lib64/libsnappy.so.1.1.9  /usr/lib64/libabsl_raw_hash_set.so.2206.0.0  /usr/lib64/libabsl_bad_optional_access.so.2206.0.0  /usr/lib64/libabsl_hashtablez_sampler.so.2206.0.0  /usr/lib64/libabsl_exponential_biased.so.2206.0.0  /usr/lib64/libabsl_synchronization.so.2206.0.0  /usr/lib64/libabsl_graphcycles_internal.so.2206.0.0  /usr/lib64/libabsl_stacktrace.so.2206.0.0  /usr/lib64/libabsl_symbolize.so.2206.0.0  /usr/lib64/libabsl_malloc_internal.so.2206.0.0  /usr/lib64/libabsl_debugging_internal.so.2206.0.0  /usr/lib64/libabsl_demangle_internal.so.2206.0.0  /usr/lib64/libabsl_time.so.2206.0.0  /usr/lib64/libabsl_strings.so.2206.0.0  /usr/lib64/libabsl_int128.so.2206.0.0  /usr/lib64/libabsl_throw_delegate.so.2206.0.0  /usr/lib64/libabsl_strings_internal.so.2206.0.0  /usr/lib64/libabsl_base.so.2206.0.0  /usr/lib64/libabsl_spinlock_wait.so.2206.0.0  /usr/lib64/libabsl_raw_logging_internal.so.2206.0.0  /usr/lib64/libabsl_log_severity.so.2206.0.0  /usr/lib64/libabsl_civil_time.so.2206.0.0  /usr/lib64/libabsl_time_zone.so.2206.0.0  rust/libwasmtime_bindings.a  rust/rust-debug/librust_combined.a  /usr/lib64/libdeflate.so  utils/libutils.a  seastar/libseastar.a  /usr/lib64/libboost_program_options.so  /usr/lib64/libboost_thread.so  /usr/lib64/libboost_chrono.so  /usr/lib64/libboost_atomic.so  /usr/lib64/libcares.so  /usr/lib64/libcryptopp.so  /usr/lib64/libfmt.so.9.1.0  /usr/lib64/liblz4.so  /usr/lib64/libgnutls.so  -latomic  /usr/lib64/libsctp.so  /usr/lib64/libyaml-cpp.so  -fsanitize=address  -fsanitize=undefined  -fno-sanitize=vptr  /usr/lib64/libhwloc.so  //usr/lib64/liburing.so  /usr/lib64/libnuma.so  /usr/lib64/libxxhash.so  -lcryptopp  /usr/lib64/libboost_regex.so.1.78.0  /usr/lib64/libicui18n.so  /usr/lib64/libicuuc.so  -ldl && :
ld.lld: error: undefined symbol: alternator::internal::get_magnitude_and_precision(std::basic_string_view<char, std::char_traits<char>>)
>>> referenced by alternator_unit_test.cc:148 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:148)
>>>               test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method())
>>> referenced by alternator_unit_test.cc:158 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:158)
>>>               test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method())
>>> referenced by alternator_unit_test.cc:160 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:160)
>>>               test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method())
>>> referenced 2 more times
```

also, schema_loader_test references tools::load_schemas(). let's
link the test against the library. otherwise, we'd have following
link failure:

```
FAILED: test/boost/schema_loader_test
: && /usr/bin/clang++ -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw"  -O0 -g -gz -Wl,--build-id=sha1 -fuse-ld=lld -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o -o test/boost/schema_loader_test  test/lib/libtest-lib.a  seastar/libseastar.a  /usr/lib64/libxxhash.so  seastar/libseastar_testing.a  libscylla-main.a  /usr/lib64/libabsl_hash.so.2206.0.0  /usr/lib64/libabsl_city.so.2206.0.0  /usr/lib64/libabsl_bad_variant_access.so.2206.0.0  /usr/lib64/libabsl_low_level_hash.so.2206.0.0  -Xlinker --push-state -Xlinker --whole-archive  auth/libscylla_auth.a  -Xlinker --pop-state  /usr/lib64/libcrypt.so  cdc/libcdc.a  compaction/libcompaction.a  mutation_writer/libmutation_writer.a  -Xlinker --push-state -Xlinker --whole-archive  dht/libscylla_dht.a  -Xlinker --pop-state  gms/libgms.a  types/libtypes.a  index/libindex.a  -Xlinker --push-state -Xlinker --whole-archive  locator/libscylla_locator.a  -Xlinker --pop-state  sstables/libsstables.a  /usr/lib64/libz.so  readers/libreaders.a  schema/libschema.a  -Xlinker --push-state -Xlinker --whole-archive  tracing/libscylla_tracing.a  -Xlinker --pop-state  service/libservice.a  -lsystemd  raft/libraft.a  repair/librepair.a  streaming/libstreaming.a  replica/libreplica.a  db/libdb.a  mutation/libmutation.a  data_dictionary/libdata_dictionary.a  cql3/libcql3.a  transport/libtransport.a  cql3/libcql3.a  transport/libtransport.a  lang/liblang.a  /usr/lib64/liblua-5.4.so  -lm  /usr/lib64/libsnappy.so.1.1.9  /usr/lib64/libabsl_raw_hash_set.so.2206.0.0  /usr/lib64/libabsl_bad_optional_access.so.2206.0.0  /usr/lib64/libabsl_hashtablez_sampler.so.2206.0.0  /usr/lib64/libabsl_exponential_biased.so.2206.0.0  /usr/lib64/libabsl_synchronization.so.2206.0.0  /usr/lib64/libabsl_graphcycles_internal.so.2206.0.0  /usr/lib64/libabsl_stacktrace.so.2206.0.0  /usr/lib64/libabsl_symbolize.so.2206.0.0  /usr/lib64/libabsl_malloc_internal.so.2206.0.0  /usr/lib64/libabsl_debugging_internal.so.2206.0.0  /usr/lib64/libabsl_demangle_internal.so.2206.0.0  /usr/lib64/libabsl_time.so.2206.0.0  /usr/lib64/libabsl_strings.so.2206.0.0  /usr/lib64/libabsl_int128.so.2206.0.0  /usr/lib64/libabsl_throw_delegate.so.2206.0.0  /usr/lib64/libabsl_strings_internal.so.2206.0.0  /usr/lib64/libabsl_base.so.2206.0.0  /usr/lib64/libabsl_spinlock_wait.so.2206.0.0  /usr/lib64/libabsl_raw_logging_internal.so.2206.0.0  /usr/lib64/libabsl_log_severity.so.2206.0.0  /usr/lib64/libabsl_civil_time.so.2206.0.0  /usr/lib64/libabsl_time_zone.so.2206.0.0  rust/libwasmtime_bindings.a  rust/rust-debug/librust_combined.a  /usr/lib64/libdeflate.so  utils/libutils.a  /usr/lib64/libxxhash.so  -lcryptopp  /usr/lib64/libboost_regex.so.1.78.0  /usr/lib64/libicui18n.so  /usr/lib64/libicuuc.so  /usr/lib64/libboost_unit_test_framework.so.1.78.0  seastar/libseastar.a  /usr/lib64/libboost_program_options.so  /usr/lib64/libboost_thread.so  /usr/lib64/libboost_chrono.so  /usr/lib64/libboost_atomic.so  /usr/lib64/libcares.so  /usr/lib64/libcryptopp.so  /usr/lib64/libfmt.so.9.1.0  /usr/lib64/liblz4.so  -ldl  /usr/lib64/libgnutls.so  -latomic  /usr/lib64/libsctp.so  /usr/lib64/libyaml-cpp.so  -fsanitize=address  -fsanitize=undefined  -fno-sanitize=vptr  /usr/lib64/libhwloc.so  //usr/lib64/liburing.so  /usr/lib64/libnuma.so  /usr/lib64/libboost_unit_test_framework.so && :
ld.lld: error: undefined symbol: tools::load_schemas(std::basic_string_view<char, std::char_traits<char>>)
>>> referenced by schema_loader_test.cc:14 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:14)
>>>               test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_empty::do_run_test_case() const)
>>> referenced by schema_loader_test.cc:15 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:15)
>>>               test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_empty::do_run_test_case() const)
>>> referenced by schema_loader_test.cc:19 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:19)
>>>               test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_keyspace_only::do_run_test_case() const)
>>> referenced 21 more times
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-12 12:21:14 +08:00
Kamil Braun
dc6f6cb6b0 cql_test_env: load host ID from sstables after restart
Performance tests such as `perf-fast-forward` are executed in our CI
environments in two steps (two invocations of the `scylla` process):
first by populating data directories (with `--populate` option), then by
running the actual test.

These tests are using `cql_test_env`, which did not load the previously
saved (in the populate step) Host ID of this node, but generated a new
one randomly instead.

In b39ca97919 we enabled
`consistent_cluster_management` by default. This caused the perf tests
to hang in `setup_group0` at `read_barrier` step. That's because Raft
group 0 was initialized with old configuration -- the one created during
the populate step -- but the Raft server was started with a newly
generated Host ID (which is used as the server's Raft ID), so the server
considered itself as being outside the configuration.

Fix this by reloading the Host ID from disk, simulating more closely the
behavior of main.cc initialization.

Fixes #14599

Closes #14640
2023-07-11 23:30:44 +03:00
Avi Kivity
1545ae2d3b Merge 'Make SSTable cleanup more efficient by fast forwarding to next owned range' from Raphael "Raph" Carvalho
Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node.

That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317.

To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead.

Without further ado,

before:

`INFO  2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.`

after:

`INFO  2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.`

Fixes #12998.
Fixes #14317.

Closes #14469

* github.com:scylladb/scylladb:
  test: Extend cleanup correctness test to cover more cases
  compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
  sstables: Close SSTable reader if index exhaustion is detected in fast forward call
  sstables: Simplify sstable reader initialization
  compaction: Extend make_sstable_reader() interface to work with mutation_source
  test: Extend sstable partition skipping test to cover fast forward using token
2023-07-11 23:28:15 +03:00
Avi Kivity
9cdae78d04 test: expr_test: add copyright/license
Closes #14613
2023-07-11 21:45:27 +03:00
Raphael S. Carvalho
60ba1d8b47 test: Extend cleanup correctness test to cover more cases
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-11 13:56:24 -03:00
Raphael S. Carvalho
8d58ff1be6 compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
Today, SSTable cleanup skips to the next partition, one at a time, when it finds
that the current partition is no longer owned by this node.

That's very inefficient because when a cluster is growing in size, existing
nodes lose multiple sequential tokens in its owned ranges. Another inefficiency
comes from fetching index pages spanning all unowned tokens, which was described
in #14317.

To solve both problems, cleanup will now use multi range reader, to guarantee
that it will only process the owned data and as a result skip unowned data.
This results in cleanup scanning an owned range and then fast forwarding to the
next one, until it's done with them all. This reduces significantly the amount
of data in the index caching, as index will only be invoked at each range
boundary instead.

Without further ado,

before:

... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.

after:

... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.

Fixes #12998.
Fixes #14317.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-11 13:56:24 -03:00
Raphael S. Carvalho
1fefe597e6 sstables: Close SSTable reader if index exhaustion is detected in fast forward call
When wiring multi range reader with cleanup, I found that cleanup
wouldn't be able to release disk space of input SSTables earlier.

The reason is that multi range reader fast forward to the next range,
therefore it enables mutation_reader::forwarding, and as a result,
combined reader cannot release readers proactively as it cannot tell
for sure that the underlying reader is exhausted. It may have reached
EOS for the current range, but it may have data for the next one.

The concept of EOS actually only applies to the current range being
read. A reader that returned EOS will actually get out of this
state once the combined reader fast forward to the next range.

Therefore, only the underlying reader, i.e. the sstable reader,
can for certain know that the data source is completely exhausted,
given that tokens are read in monotonically increasing order.

For reversed reads, that's not true but fast forward to range
is not actually supported yet for it.

Today, the SSTable reader already knows that the underlying SSTable
was exhausted in fast_forward_to(), after it call index_reader's
advance_to(partition_range), therefore it disables subsequent
reads. We can take a step further and also check that the index
was exhausted, i.e. reached EOF.

So if the index is exhausted, and there's no partition to read
after the fast_forward_to() call, we know that there's nothing
left to do in this reader, and therefore the reader can be
closed proactively, allowing the disk space of SSTable to be
reclaimed if it was already deleted.

We can see that the combined reader, under multi range reader,
will incrementally find a set of disjoint SSTable exhausted,
as it fast foward to owned ranges

1:
INFO  2023-07-05 10:51:09,570 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-4525396453480898112, start},{-4525396453480898112, end}]
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db, start == *end, eof ? true
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - closing reader 0x60100029d800 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-3-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false

2:
INFO  2023-07-05 10:51:09,572 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-2253424581619911583, start},{-2253424581619911583, end}]
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db, start == *end, eof ? true
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - closing reader 0x60100029d400 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == *end, eof ? false
INFO  2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false

And so on.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-11 13:56:24 -03:00
Raphael S. Carvalho
f08a4eaacb sstables: Simplify sstable reader initialization
It's odd that we see things like:

    if (!is_initialized()) {
        return initialize().then([this] {
            if (!is_initialized()) {

    and

    return ensure_initialized().then([this, &pr] {
        if (!is_initialized()) {

One might think initialize will actually initialize the reader by
setting up context, and ensure_initialized() will even have stronger
guarantees, meaning that the reader must be initialized by it.

But none are true.

In the context of single-partition read, it can happen initialize()
will not set up context, meaning is_initialized() returns false,
which is why initialization must be checked even after we call
ensure_initialized().

Let's merge ensure_initialized() and initialize() into a
maybe_initialize() which returns a boolean saying if the reader
is initialized.

It makes the code initializing the reader easier to understand.
2023-07-11 13:56:23 -03:00
Michał Chojnowski
b511d57fc8 Revert "Merge 'Compaction resharding tasks' from Aleksandra Martyniuk"
This reverts commit 2a58b4a39a, reversing
changes made to dd63169077.

After patch 87c8d63b7a,
table_resharding_compaction_task_impl::run() performs the forbidden
action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard,
which is a data race that can cause a use-after-free, typically manifesting
as allocator corruption.

Note: before the bad patch, this was avoided by copying the _contents_ of the
lw_shared_ptr into a new, local lw_shared_ptr.

Fixes #14475
Fixes #14618

Closes #14641
2023-07-11 19:11:37 +03:00
Calle Wilund
e1a52af69e messaging_service: Do TLS init early
Fixes #14299

failure_detector can try sending messages to TLS endpoints before start_listen
has been called (why?). Need TLS initialized before this. So do on service creation.

Closes #14493
2023-07-11 18:19:01 +03:00
Alejo Sanchez
7b621617a9 test/pylib: move async query wrapper to common module
Move async query wrapper code out of topology and into its own module.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-07-11 15:22:00 +02:00
Kefu Chai
b4dc3f7cd9 scylla-gdb: add sstable::generation_type printer
to inspect the sstable generation after uuid-based generation
change. in this change:

* a pretty printer for sstable::generation_type is added
* now that the pretty printer for the generation_type is registered,
  we can just leverage it when printing the sstable name, so
  instead of checking if `_generation` member variable contains
  `_value`, we use delegate it to `str()`, which is used by
  `str.format()`. as the behavior of `str()` is similar to that of
  the gdb `print` command, and calls `value.format_string()`, which
  in turn calls into `to_string()` if the "value" in question has
  a pretty printer.

after this change, the printer is able to print both the generations
before the uuid change and the ones after the change.

a typical gdb session looks like:

```
(gdb) p generation._value
$5 = f0770b40-1c7c-11ee-b136-bf28f8d18b88
(gdb) p generation
$10 = 3g7g_0bu7_0jpvk2p0mmtlsb8lu0
(gdb) p/x generation._value.least_sig_bits
$7 = 0xb136bf28f8d18b88
(gdb) p/x generation._value.most_sig_bits
$8 = 0xf0770b401c7c11ee
```

if we use `scripts/base36-uuid.py` to encode
the msb and lsb, we'd need to:
```console
scripts/base36-uuid.py -e 0xf0770b401c7c11ee 0xb136bf28f8d18b88
3g7g_0bu7_0jpvk2p0mmtlsb8lu0
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14561
2023-07-11 15:56:20 +03:00
Raphael S. Carvalho
3b1829f0d8 compaction: base compaction throughput on amount of data read
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.

A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.

Fixes #14533.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14615
2023-07-11 15:48:05 +03:00
Kefu Chai
25f4a7c400 sstables: format using format string
instead of concatenating strings, let's format using the builtin
support of `log::debug()`. for two reasons:

1. better performance, after this change, we don't need to
   materialize the concatenated string, if the "debug" level logging
   is not enabled. seasetar::log only formats when a certain log
   level is enabled.
2. better readability. with the format string, it is clear what
   is the fixed part, and which arguments are to be formatted.
   this also helps us to move to compile-time formatting check,
   as fmtlib requires the caller to be explicit when it wants
   to use runtime format string.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14627
2023-07-11 15:31:20 +03:00
Pavel Emelyanov
5518502085 test: Add 'scylla table' cmd test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-11 15:12:43 +03:00
Pavel Emelyanov
2c2ad09d3c scylla-gdb: Print table phased barriers
These barriers show if there's any operation in progress (read, write,
flush or stream). These are crucial to know if stopping fails, e.g. see
issue #13100

These barriers are symmarized in 'scylla memory' command, but they are
also good to know on per-table basis

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-11 15:10:47 +03:00
Pavel Emelyanov
1948b8fa17 scylla-gdb: Add 'table' command
There's 'scylla tables' one that lists tables on the given/current
shard, but the list is unable to show lots of information. It prints the
table address so it can be explored by hand, but some data is more handy
to be parsed and printed with the script

The syntax is

  $ scylla table ks.cf

For now just print the schema version. To be extended in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-11 15:08:55 +03:00
Botond Dénes
bc5174ced6 Merge 'doc: move the package installation instructions to the documentation' from Anna Stuchlik
Refs: https://github.com/scylladb/scylla-docs/issues/4091
Fixes https://github.com/scylladb/scylla-docs/issues/3419

This PR moves the installation instructions from the [website](https://www.scylladb.com/download/) to the documentation. Key changes:
- The instructions are mostly identical, so they were squeezed into one page with different tabs.
- I've merged the info for Ubuntu and Debian, as well as CentOS and RHEL.
- The page uses variables that should be updated each release (at least for now).
- The Java requirement was updated from Java 8 to Java 11 following [this issue](https://github.com/scylladb/scylla-docs/issues/3419).
- In addition, the title of the Unified Installer page has been updated to communicate better about its contents.

Closes #14504

* github.com:scylladb/scylladb:
  doc: update the prerequisites section
  doc: improve the tile of Unified Installer page
  doc: move package install instructions to the docs
2023-07-11 14:30:11 +03:00
Kamil Braun
051728318d utils/UUID: reference timeuuid_tri_compare in UUID::operator<=> comment 2023-07-11 13:19:50 +02:00
Avi Kivity
f26e36f448 Update seastar submodule
* seastar 2b7a341210...bac344d584 (3):
  > tls: Export error_category instance used by tls + some common error codes
  > reactor: cast enum to int when formatting it
  > cooking: bump up zlib to 1.2.13
2023-07-11 13:24:32 +03:00
Kamil Braun
5779230d28 group0_state_machine: use correct comparison for timeuuids in merger
In d2a4079bbe, `merger` was modified so
that when we merge a command, `last_group0_state_id` is taken to be the
maximum of the merged command's state_id and the current
`last_group0_state_id`. This is necessary for achieving the same
behavior as if the commands were applied individually instead of being
merged -- where we take the maximum state ID from `group0_history` table
which was applied until now (because the table is sorted using the state
IDs and we take the greatest row).

However, a subtle bug was introduced -- the `std::max` function uses the
`utils::UUID` standard comparison operator which is unfortunately not
the same as timeuuid comparison that Scylla performs when sorting the
`group0_history` table. So in rare cases it could return the *smaller*
of the two timeuuids w.r.t. the correct timeuuid ordering. This would
then lead to commands being applied which should have been turned to
no-ops due to the `prev_state_id` check -- and then, for example,
permanent schema desync or worse.

Fix it by using the correct comparison method.

Fixes: #14600
2023-07-11 11:48:02 +02:00
Kamil Braun
5ce802676f utils/UUID: introduce timeuuid_tri_compare for const UUID&
The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte
buffers. Introduce a version which operates directly on the
`utils::UUID` type.

To reuse existing comparison code, we serialize to a buffer before
comparing. But we avoid allocations by using `std::array`. Since the
serialized size needs to be known at compile time for `std::array`, mark
`UUID::serialized_size()` as `constexpr`.
2023-07-11 11:48:02 +02:00
Kamil Braun
668beedadc utils/UUID: introduce timeuuid_tri_compare for const int8_t*
`timeuuid_tri_compare` takes `bytes_view` parameters and converts them
to `const int8_t*` before comparing.

Extract the part that operates on `const int8_t*` to separate function
which we will reuse in a later commit.
2023-07-11 11:48:02 +02:00
Kefu Chai
ef78b31b43 s3/client: add tagging ops
with tagging ops, we will be able to attach kv pairs to an object.
this will allow us to mark sstable components with taggings, and
filter them based on them.

* test/pylib/minio_server.py: enable anonymous user to perform
  more actions. because the tagging related ops are not enabled by
  "mc anonymous set public", we have to enable them using "set-json"
  subcommand.
* utils/s3/client: add methods to manipulate taggings.
* test/boost/s3_test: add a simple test accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14486
2023-07-11 09:30:46 +03:00
Kefu Chai
3b6e37051b build: cmake: add more tests to CMake
to be in-sync with configure.py

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14479
2023-07-11 09:21:26 +03:00
Botond Dénes
37dd2503ff Merge 'replica,sstable: do not assign a value to a shared_ptr' from Kefu Chai
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.

unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.

this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.

in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.

Closes #14470

* github.com:scylladb/scylladb:
  sstables: use try_emplace() when appropriate
  replica,sstable: do not assign a value to a shared_ptr
2023-07-11 09:19:48 +03:00
Kefu Chai
0dca0a7f27 build: cmake: include pretty_printers.cc in util
we added pretty_printers.cc back in
83c70ac04f, in which configure.py is
updated. so let's sync the CMake building system accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14442
2023-07-11 09:16:33 +03:00
Pavel Emelyanov
2eebb1312e scylla-gdb: Format IPs with network byte order
The scylla netw command prints connections IPs reversed:

(gdb) scylla netw
Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>}
Outgoing connections:
IP: 31.0.142.10, (netw::messaging_service::rpc_protocol_client_wrapper*) 0x600008d6d490:
  stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0}
  outstanding: 0

It should unpack the address as if it was in big-endian to have it like

(gdb) scylla netw
Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>}
Outgoing connections:
IP: 10.142.0.31, (netw::messaging_service::rpc_protocol_client_wrapper*) 0x600008d6d490:
  stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0}
  outstanding: 0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14611
2023-07-11 09:12:12 +03:00
Raphael S. Carvalho
bd50943270 compaction: Extend make_sstable_reader() interface to work with mutation_source
As the goal is to make compaction filter to the next owned range,
make_sstable_reader() should be extended to create a reader with
parameters forwarded from mutation_source interface, which will
be used when wiring cleanup with multi range reader.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-10 17:19:30 -03:00
Avi Kivity
2de168e568 dist: sysctl: increase vm.vfs_cache_pressure
Our usage of inodes is dual:

 - the Index.db and Data.db components are pinned in memory as
   the files are open
 - all other components are read once and never looked at again

As such, tune the kernel to prefer evicting dcache/inodes to
memory pages. The default is 100, so the value of 2000 increases
it by a factor of 20.

Ref https://github.com/scylladb/scylladb/issues/14506

Closes #14509
2023-07-10 21:24:57 +03:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Avi Kivity
d645e7a515 Update seastar submodule
locator/*_snitch.cc updated for http::reply losing the _status_code
member without a deprecation notice.

* seastar 99d28ff057...2b7a341210 (23):
  > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity
Fixes #8828.
  > reactor: use structured binding when appropriate
  > Simplify payload length and mask parsing.
  > memcached: do not used deprecated API
  > build: serialize calls to openssl certificate generation
  > reactor: epoll backend: initialize _highres_timer_pending
  > shared_ptr: deprecate lw_shared_ptr operator=(T&&)
  > tests: fail spawn_test if output is empty
  > Support specifying the "build root" in configure
  > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov
  > build: correct the syntax error in comment
  > util: print_safe: fix hex print functions
  > Add code examples for handling exceptions
  > smp: warn if --memory parameter is not supported
  > Merge 'gate: track holders' from Benny Halevy
  > file: call lambda with std::invoke()
  > deleter: Delete move and copy constructors
  > file: fix the indent
  > file: call close() without the syscall thread
  > reactor: use s/::free()/::io_uring_free_probe()/
  > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai
  > reactor: Don't re-evaliate local reactor for thread_pool
  > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov

Closes #14602
2023-07-10 16:07:12 +03:00
Kamil Braun
3d58e8e424 Revert "cql3: Extend the scope of group0_guard during DDL statement execution"
This reverts commit c42a91ec72.

A significant performance regression was observed due to this change.

From Avi:
> perf-simple-query --smp 1
>
> before:
>
> 216489.88 tps ( 61.1 allocs/op, 13.1 tasks/op, 43558 insns/op, 0 errors)
> 217708.69 tps ( 61.1 allocs/op, 13.1 tasks/op, 43542 insns/op, 0 errors)
> 219495.02 tps ( 61.1 allocs/op, 13.1 tasks/op, 43538 insns/op, 0 errors)
> 216863.84 tps ( 61.1 allocs/op, 13.1 tasks/op, 43567 insns/op, 0 errors)
> 218936.48 tps ( 61.1 allocs/op, 13.1 tasks/op, 43546 insns/op, 0 errors)
>
> after:
>
> 201773.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44600 insns/op, 0 errors)
> 210875.48 tps ( 63.1 allocs/op, 15.1 tasks/op, 44558 insns/op, 0 errors)
> 210186.55 tps ( 63.1 allocs/op, 15.1 tasks/op, 44588 insns/op, 0 errors)
> 211021.76 tps ( 63.1 allocs/op, 15.1 tasks/op, 44569 insns/op, 0 errors)
> 208597.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44587 insns/op, 0 errors)
>
> Two extra allocations, two extra tasks, 1k extra instructions, for
> something that is DDL only.

Fixes #14590
2023-07-10 13:20:49 +02:00
Pavel Emelyanov
ec292721d6 open_coredump: Add --scylla-build-id CLI option
The script gets the build id on its own but eu-unstrip-ing the core file
and searching for the necessary value in the output. This can be
somewhat lenghthy operation especially on huge core files. Sometimes
(e.g. in tests) the build id is known and can be just provided as an
argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14574
2023-07-10 11:17:54 +03:00
Tomasz Grabiec
65a5942ec0 Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs' from Kamil Braun
Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs.  `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

We also introduce regression tests with necessary extensions
to the test framework.

Fixes #14468
Fixes #14487

Closes #14507

* github.com:scylladb/scylladb:
  test: rename `test_topology_ip.py` to `test_replace.py`
  test: test bootstrap after IP change
  test: scylla_cluster: return the new IP from `change_ip` API
  test: node replace with `ignore_dead_nodes` test
  test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
  storage_service: remove `get_nodes_to_sync_with`
  storage_service: use `token_metadata` to calculate nodes waited for to be UP
  storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
2023-07-10 00:28:20 +02:00
Kefu Chai
1eb76d93b7 streaming: cast the progress to a float before formatting it
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.

so, let's make the percentage a float before formatting it.

Fixes #14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14588
2023-07-10 00:00:40 +03:00
Kefu Chai
894039d444 build: drop the warning on -O0 might fail tests
Michał Chojnowski noted that this is not true. -O0 almost doubles
the run time of `./test.py --mode=debug`. but it does not fail
any of the tests.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14456
2023-07-09 23:23:12 +03:00
Avi Kivity
850d759fd9 Merge 'repair: optimise repair reader with different shard count' from Gusev Petr
Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially.

When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`.

In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100).

This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`:

```
with this patch
    20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2...
    20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859

without this patch
    21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2...
    21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140
```

Fixes: #14093

Closes #14178

* github.com:scylladb/scylladb:
  repair_test: add test_reader_with_different_strategies
  repair: extract repair_reader declaration into reader.hh
  repair_meta: get_estimated_partitions fix
  repair_meta: use multishard_filter reader if the number of partitions is small
  repair_meta: delay _repair_reader creation
  database.hh: make_multishard_streaming_reader with range parameter
  database.cc: extract streaming_reader_lifecycle_policy
2023-07-09 23:21:06 +03:00
Aleksandra Martyniuk
61dc98b276 api: prevent non-owner cpu access to shared_ptr
In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs
to sstables is passes to reducer of map_reduce0. Reducer then accesses
these shared pointers. As reducer is invoked on the same shard
map_reduce0 is called, we have an illegal access to shared pointer
on non-owner cpu.

A set of shared pointers to sstables is trasnsformed in map function,
which is guaranteed to be invoked on a shard associated with the service.

Fixes: #14515.

Closes #14532
2023-07-09 23:09:59 +03:00
Kefu Chai
7a334c53af cql3: expression: correct format string
fmtlib uses `{}` as the placeholder for the formatted argument, not
`{}}`.

so let's correct it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14586
2023-07-09 22:26:29 +03:00
Kefu Chai
56c3462cba alternator: correct format string
when formatting the error message for `api_error::validation`, we
always include the caller in the error message, but in this case,
forgot to pass the `caller` to `seastar::format()`. if fmtlib
actually formats them, it would throw.

so let's pass `caller` to `seastar::format()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14589
2023-07-09 22:25:13 +03:00
Aleksandra Martyniuk
23e3251fc3 tasks: test: abort task manager when wrapped_compaction_manager is destructed
When task manager is not aborted, the tasks are stored in the memory,
not allowing the tasks' gate to be closed.

When wrapped_compaction_manager is destructed, task manager gets
aborted, so that system could shutdown.
2023-07-09 12:08:32 +02:00
Aleksandra Martyniuk
529c703143 compaction: swap compaction manager stopping order
task_manager::module::stop() waits till all compactions are complete.
Thus, ongoing compactions should be aborted before stop() is called
not to prolong shutdown process.

Task manager's compaction module is stopped after
compaction_manager::do_stop(), which aborts ongoing compactions,
is called.
2023-07-09 12:05:49 +02:00
Aleksandra Martyniuk
a59485b6da compaction: modify compaction_manager::stop()
In compaction_manager::stop(), do_stop() is called unconditionally.
It relies on do_stop to return immediately when _state == none.
2023-07-09 12:04:14 +02:00
Michał Chojnowski
c41f0ebd2a test: mutation_test: unflake test_external_memory_usage
The test has about 1/2500000 chance to fail due to a conflict of random
values. And it recently did, just to spite us.

Fight back.

Fixes #14563

Closes #14576
2023-07-08 15:20:25 +03:00
Kefu Chai
27d6ff36df compound_compat: do not format an sstring with {:d}
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.

so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.

Fixes #14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14578
2023-07-08 15:13:11 +03:00
Kefu Chai
26dcfea84a estimated_histogram: do not use dynamic format_string
fmtlib allows us to specify the field width dynamically, so specify
the field width in the same statement formatting the argument improves
the readability. and use the constexpr fmt string allows us to switch
to compile-time formatter supported by fmtlib v8.

this change also use `fmt::print()` to format the argument right to
the output ostream, instead of creating a temporary sstring, and
copy it to the output ostream.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14579
2023-07-08 15:10:41 +03:00
Anna Stuchlik
88e62ec573 doc: improve User Data info in Launch on AWS
Fixes https://github.com/scylladb/scylladb/issues/14565

This commit improves the description of ScyllaDB configuration
via User Data on AWS.
- The info about experimental features and developer mode is removed.
- The description of User Data is fixed.
- The example in User Data is updated.
- The broken link is fixed.

Closes #14569
2023-07-07 16:34:06 +02:00
Kamil Braun
de7f668441 Merge 'raft topology: send cdc generation data in parts' from Mikołaj Grzebieluch
The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.

Commands are sent sequentially to avoid concurrency problems.

Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.

Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.

A mechanism to remove old CDC generations will be introduced as a
follow-up.

Closes #13962

* github.com:scylladb/scylladb:
  test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
  service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
  raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
  raft: add release_guard
  raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
  raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
  raft topology: make `mutation_size_threshold` depends on `max_command_size`
  raft: reduce max batch size of raft commands and raft entries
  raft: add description argument to add_entry_unguarded
  raft: introduce `write_mutations` command
  raft: refactor `topology_change` applying
2023-07-07 16:31:29 +02:00
Kamil Braun
f9cfd7e4f5 Merge 'raft: do not ping self in direct failure detector' from Konstantin Osipov
Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead.
Fixes #14388

Closes #14558

* github.com:scylladb/scylladb:
  direct_fd: do not ping self
  raft: initialize raft_group_registry with host id early
  raft: code cleanup
2023-07-07 14:26:17 +02:00
Mikołaj Grzebieluch
4e3c97d8d4 test: raft topology: test prepare_and_broadcast_cdc_generation_data
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.

This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
8d6c95f9e3 service: raft topology: print warning in case of raft::commit_status_unknown exception in topology coordinator loop
When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it
prints an error. This exception is not an error in this case, and it can be
thrown when the leader has changed. It can happen in `add_entry_unguarded`
while sending a part of the CDC generation data in the `write_mutations` command.

Catch this exception in `topology_coordinator::run` and print a warning.
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
ade15ad74a raft topology: introduce prepare_and_broadcast_cdc_generation_data
Broadcasts all mutations returned from `prepare_new_cdc_generation_data`
except the last one. Each mutation is sent in separate raft command. It takes
`group0_guard`, and if the number of mutations is greater than one, the guard
is dropped, and a new one is created and returned, otherwise the old one will
be returned. Commands are sent in parallel and unguarded (the guard used for
sending the last mutation will guarantee that the term hasn't been changed).
Returns the generation's UUID, guard and last mutation, which will be sent
with additional topology data by the caller.

If we send the last mutation in the `write_mutation` command, we would use a
total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of
mutations), so it's better to send it in `topology_change` (we need to send
it after all `write_mutations`) with some small metadata.

With the default commitlog segment size, `mutation_size_threshold` will be 4 MB.
In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation
data can reach the size of 30 MB, thus there will be no more than 8 commands.

In a multi-DC cluster with 100ms latencies between DCs, this operation should
take about 200ms since we send the commands concurrently, but even if the commands
were replicated sequentially by Raft, it should take no more than 1.6s, which is
incomparably smaller than bootstrapping operation (bootstrapping is quick if there
is no data in the cluster, but usually if one has 100 nodes they have tons of data,
so indeed streaming/repair will take much longer (hours/days)).

Fixes FIXME in pr #13683.
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
04c38c6185 raft: add release_guard
This function takes guard and calls its destructor. It's used to not call raw destructor.
2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch
d2a4079bbe raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
If `group0_state_machine` applies all commands individually (without batching),
the resulting current `state_id` -- which will be compared with the
`prev_state_id` of the next command if it is a guarded command -- equals the
maximum of the `next_state_id` of all commands applied up to this point.
That's because the current `state_id` is obtained from the history table by
taking the row with the largest clustering key.

When `group0_state_machine::apply` is called with a batch of commands, the
current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id`
only once. When a command is merged, its `next_state_id` overwrites
`last_group0_state_id`, regardless of their order.

Let's consider the following situation:
The leader sends two unguarded `write_mutations` commands concurrently, with
timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded
`topology_change` with `prev_state_id` equal T2.
Suppose that the command with timeuuid T2 is committed first, and these commands
are small enough that all of `write_mutations` could be merged into one command.
Some followers can get all of these three commands before its `fsm` polls them.
In this situation, `group0_state_machine::apply` is called with all three of
them and `merger` will merge both `write_mutations` into one command. After that,
`merger::last_group0_state_id` will be equal to T1 (this command was committed
as the second one). When it processes the `topology_change` command, it will
compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in
making this command a no-op (which wouldn't happen if the commands were applied
individually).
Such a scenario results in inconsistent results: one replica applies `topology_change`,
but another makes it a no-op.
2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch
b2d22d665e raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.

Send entire cdc_generations_v3 table in the snapshot to resolve this problem.

As a follow-up, a mechanism to remove old CDC generations will be introduced.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
dc6017b71b raft topology: make mutation_size_threshold depends on max_command_size
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.

Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
6dad582796 raft: reduce max batch size of raft commands and raft entries
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.

A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.

Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.

Fixed typo "1204" => "1024".
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
760d415781 raft: add description argument to add_entry_unguarded
Provide useful description for `write_mutations` and
`broadcast_tables_query` that is stored in `system.group0_history`.

Reduces scope of issue #13370.
2023-07-07 13:11:44 +02:00
Anna Stuchlik
799ae97b52 doc: add the Rust CDC Connector to the docs
Fixes https://github.com/scylladb/scylladb/issues/13877

This commit adds the information about Rust CDC Connector
to the documentation. All relevant pages are updated:
the ScyllaDB Rust Driver page, and other places in
the docs where Java and Go CDC connectors are mentioned.

In addition, the drivers table is updated to indicate
Rust driver support for CDC.

Closes #14530
2023-07-07 11:13:25 +02:00
Nadav Har'El
edfb89ef65 sstables: stop warning when auto-snapshot leaves non-empty directory
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:

```
WARN  2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```

It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.

And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.

To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.

Fixes #13538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14557
2023-07-07 11:08:10 +02:00
Benny Halevy
cd44ad9338 docs: compaction: correct min_sstable_size default value
DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L`
which is 50 MB, not 50 bytes.

Fixes #14413

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14414
2023-07-07 11:08:10 +02:00
Marcin Maliszkiewicz
c5de25be4c locator: use deferred_close in azure and gcp snitches
Close needs to be called even if function throws in the middle.

Closes #14458
2023-07-07 11:08:10 +02:00
Avi Kivity
1f9a999c26 cql3: statement_restrictions: clean up dead code
We have plenty of code marked with #if 0. Once it was an indication
of missing functionality, but the code has evolved so much it's
useless as an indication and only a distraction.

Delete it.

Closes #14511
2023-07-07 11:08:10 +02:00
Gleb Natapov
4f23eec44f Rename experimental raft feature to consistent-topology-changes
Make the name more descriptive

Fixes #14145

Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>
2023-07-07 11:08:10 +02:00
Kamil Braun
3c139265b3 Merge 'doc: remove the dead link to unirestore' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/14459

This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements.

Closes #14519

* github.com:scylladb/scylladb:
  doc: minor language improvements on the Migration Tools page
  doc: remove the link to the private repository
2023-07-07 11:08:10 +02:00
Nadav Har'El
d6aba8232b alternator: configurable override for DescribeEndpoints
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.

This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:

1. Empty string (the default) leaves the current behavior -
   DescribeEndpoints echos the request's "Host" header.

2. The string "disabled" disables the DescribeEndpoints (it will return
   an UnknownOperationException). This is how DynamoDB Local behaves,
   and the AWS C++ SDK and the Dynamodb Shell work well in this mode.

3. Any other string is a fixed string to be returned by DescribeEndpoints.
   It can be useful in setups that should return a known address.

Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.

Fixes #14410.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14432
2023-07-07 11:08:10 +02:00
Konstantin Osipov
ff41ea86b6 direct_fd: do not ping self
No need to ping self in direct failure detector. This is confusing
during debugging and adds extra overhead.

Fixes #14388
2023-07-06 21:05:39 +03:00
Konstantin Osipov
50140980ac raft: initialize raft_group_registry with host id early
Earlier, when local query processor wasn't available at
the beginning of system start, we couldn't query our own
host id when initializing the raft group registry. The local
host id is needed by the registry since it is responsible
to route RPC messages to specific raft groups, and needs
to reject messages destined to a different host.

Now that the host id is known early at boot, remove the optional
and pass host id in the constructor. Resolves an earlier fixme.
2023-07-06 20:54:05 +03:00
Konstantin Osipov
d79d05aa46 raft: code cleanup
Rename raft_rpc::_server_id to raft_rpc::_my_id as is already the
name used in raft_group0:
- for consistency
- to reflect which server id it is.
2023-07-06 19:46:24 +03:00
Kamil Braun
0d437a7d63 Merge 'utils: error injection: add inject_with_handler for interactions with injected code' from Mikołaj Grzebieluch
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.

This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.

Closes #14357.

Closes #14460

* github.com:scylladb/scylladb:
  tests: introduce InjectionHandler class for communicating with injected code
  api/error_injection: add message_injection endpoint
  tests: utils: error injections: add test for inject_with_handler
  utils: error injection: add inject_with_handler for interactions with injected code
  utils: error injection: create structure for error injections data
2023-07-06 18:16:51 +02:00
Mikołaj Grzebieluch
907c0e8900 tests: introduce InjectionHandler class for communicating with injected code
Add a client for sending empty messages to the injected code from tests.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
8b1f5ba293 api/error_injection: add message_injection endpoint
Add an endpoint for sending empty messages to the injected code.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
7e5c42af0a tests: utils: error injections: add test for inject_with_handler
Add a test checking the correctness of the `inject_with_handler` method
in presence of concurrency.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
086b3369f4 utils: error injection: add inject_with_handler for interactions with injected code
Currently, it is hard for injected code to wait for some events, for example,
requests on some REST endpoint.

This commit adds the `inject_with_handler` method that executes injected function
and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers
associated with the injection on all shards about the received message.
There is a counter of received messages in `received_messages_counter`; it is shared
between the injection_data, which is created once when enabling an injection on
a given shard, and all `injection_handler`s, that are created separately for each
firing of this injection. The `counter` is incremented when receiving a message from
the REST endpoint and the condition variable is signaled.
Each `injection_handler` (separate for each firing) stores its own private counter,
`_read_messages_counter` that private counter is incremented whenever we wait for a
message, and compared to the received counter. We sleep on the condition variable
if not enough messages were received.
2023-07-06 12:32:07 +02:00
Kamil Braun
431a8f8591 test: rename test_topology_ip.py to test_replace.py
No idea why it was named like that before.
2023-07-06 10:24:46 +02:00
Kamil Braun
452d9a3c77 test: test bootstrap after IP change
Regression test for #14468.
2023-07-06 10:24:46 +02:00
Kamil Braun
2032d7dbe4 test: scylla_cluster: return the new IP from change_ip API
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
2023-07-06 10:24:46 +02:00
Kamil Braun
00f51ea753 test: node replace with ignore_dead_nodes test
Regression test for #14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
2023-07-06 10:24:46 +02:00
Kamil Braun
9b136ee574 test: scylla_cluster: accept ignore_dead_nodes in ReplaceConfig 2023-07-06 10:24:46 +02:00
Kamil Braun
9b8e5550b1 storage_service: remove get_nodes_to_sync_with
It's no longer used.
2023-07-06 10:24:46 +02:00
Kamil Braun
96278a09d4 storage_service: use token_metadata to calculate nodes waited for to be UP
At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In #14468 and #14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes #14468
Fixes #14487
2023-07-06 10:24:46 +02:00
Kamil Braun
bbcf8305bb storage_service: don't calculate ignore_nodes before waiting for normal handlers
Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
2023-07-06 10:24:44 +02:00
Tomasz Grabiec
c25201c1a3 Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Fixes https://github.com/scylladb/scylladb/issues/14503

Closes #14502

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-05 21:21:43 +02:00
Michał Chojnowski
f6203f2bd4 test: view_build_test: add range tombstones to test_view_update_generator_buffering
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
2023-07-05 17:33:49 +02:00
Michał Chojnowski
aab10402ce test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
A random mutation test for view_updating_consumer's buffering logic.
Reproduces #14503.
2023-07-05 17:33:49 +02:00
Michał Chojnowski
ac29b6f198 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-05 17:33:47 +02:00
Raphael S. Carvalho
5d34db2532 test: Extend sstable partition skipping test to cover fast forward using token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-05 11:38:58 -03:00
Kefu Chai
fa8eaab62b build: remove duplicated test
this change has no impact on `build.ninja` generated by `configure.py`.
as we are using a `set` for tracking the tests to be built. but it's
still an improvement, as we should not add duplicated entries in a set
when initializing it.

there are two occurrences of `test/boost/double_decker_test`, the one
which is in the club of the local cluster of collections tests - bptree,
btree, radix_tree and double_decker are preserved.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14478
2023-07-05 15:43:04 +03:00
Kefu Chai
e4697e2bd2 sstable: remove stale comment
this comment should have been removed in
f014ccf369. but better late than never.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14497
2023-07-05 15:42:11 +03:00
Pavel Emelyanov
e91f95a629 Merge 's3/test: restructure object_store test into a pytest based test suite' from Kefu Chai
in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store.

Closes #14165

* github.com:scylladb/scylladb:
  s3/test: do not return ip in managed_cluster()
  s3/test: verify the behavior with asserts
  s3/test: restructure object_store/run into a pytest
  s3/test: extract get_scylla_with_s3_cmd() out
  s3/test: s/restart_with_dir/kill_with_dir/
  s3/test: vendor run_with_dir() and friends
  s3/test: remove get_tempdir()
  s3/test: extract managed_cluster() out
2023-07-05 15:40:43 +03:00
Gleb Natapov
c42a91ec72 cql3: Extend the scope of group0_guard during DDL statement execution
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.

Fixes: #13942
Message-Id: <ZJ2aeNIBQCtnTaE2@scylladb.com>
2023-07-05 14:38:34 +02:00
Mikołaj Grzebieluch
01bc6f5294 utils: error injection: create structure for error injections data
This enables holding additional data associated with the injection.
2023-07-05 13:52:46 +02:00
Anna Stuchlik
4656d8c338 doc: update the prerequisites section 2023-07-05 11:52:03 +02:00
Anna Stuchlik
088a31cdb0 doc: minor language improvements on the Migration Tools page 2023-07-05 11:39:52 +02:00
Pavel Emelyanov
dfff5f2f2e Merge 'test/pylib: retry if minio_server is not ready and define a name for alias' from Kefu Chai
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.

in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.

also, in this change, instead of hardwiring the alias of "local" in the code,
define a variable for it. less repeating this way.

Fixes https://github.com/scylladb/scylladb/issues/1719

Closes #14517

* github.com:scylladb/scylladb:
  test/pylib: do not hardwire alias to "local"
  test/pylib: retry if minio_server is not ready
2023-07-05 12:32:58 +03:00
Anna Stuchlik
3213feee5f doc: remove the link to the private repository
This commit removes the dead link to the unirestore tool in
the private repository.
2023-07-05 11:28:37 +02:00
Kefu Chai
9080f8842b s3/test: do not return ip in managed_cluster()
let's just use cluster.contact_points for retrieving the IP address
of the scylla node in this single-node cluster. so the name of
managed_cluster() is less weird.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:07:39 +08:00
Kefu Chai
ec6410653f s3/test: verify the behavior with asserts
instead of assigning to "success", let's use assert for this purpose.
simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:07:21 +08:00
Kefu Chai
471d75c6c6 s3/test: restructure object_store/run into a pytest
instead of using a single run to perform the test, restructure
it into a pytest based test suite with a single test case.
this should allow us to add more tests exercising the object-storage
and cached/tierd storage in future.

* add fixtures so they can be reused by tests
* use tmpdir fixture for managing the tmpdir, see
  https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture
* perform part of the teardown in the "test_tempdir()" fixture
* change the type of test from "Run" to "Python"
* rename "run" to "test_basic.py"
* optionally start the minio server if the settings are not
  found in command line or env variables, so that the tests are
  self-contained without the fixture setup by test.py.
* instead of sys.exit(), use assert statement, as this is
  what pytest uses.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 17:05:13 +08:00
Petr Gusev
b69bc97673 repair_test: add test_reader_with_different_strategies 2023-07-05 13:02:17 +04:00
Kefu Chai
bffaf84395 s3/test: extract get_scylla_with_s3_cmd() out
* define a dedicated S3_server class which duck types MinioServer.
  it will be used to represent S3 server in place of MinioServer if
  S3 is used for testing
* prepare object_storage.yaml in get_scylla_with_s3(), so it is more
  clear that we are using the same set of settings for launching
  scylla

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:49:04 +08:00
Kefu Chai
f74218f434 s3/test: s/restart_with_dir/kill_with_dir/
replace the restart_with_dir() with kill_with_dir(), so
that we can simplify the usage of managed_cluster() by enabling it
to start and stop the single-node cluster. with this change, the caller
does not need to run the scylla and pass its pid to this function
any more.

since the restart_with_dir() call is superseded by managed_cluster(),
which tears down the cluster, teardown() is now only responsible to
print out the log file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:48:25 +08:00
Kefu Chai
a6bb5864ff s3/test: vendor run_with_dir() and friends
so we don't need to mess up with cql-pytest/run.py, which is
use by cql-pytest.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:48:04 +08:00
Kefu Chai
b45049c968 s3/test: remove get_tempdir()
to match with another call of managed_cluster(), so it's clear that
we are just reusing test_tempdir.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:45:14 +08:00
Kefu Chai
a5a87d81c6 s3/test: extract managed_cluster() out
for setting up the cluster and tearing down it.
this helps to indent the code so that it is visually explicit
the lifecycle of the cluster.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 16:45:14 +08:00
Kefu Chai
1faf50fc05 test/pylib: do not hardwire alias to "local"
define a variable for it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 15:58:41 +08:00
Kefu Chai
d55cfdc152 test/pylib: retry if minio_server is not ready
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.

in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.

Fixes #1719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-05 15:57:59 +08:00
Konstantin Osipov
b9c2b326bc raft: do not update raft address map with obsolete gossip data
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.

A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.

Fixes #14257
Refs #14357

Closes #14329
2023-07-05 00:16:28 +02:00
Michał Chojnowski
5ad0846bff view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-04 20:33:21 +02:00
Mikołaj Grzebieluch
e6b0403326 raft: introduce write_mutations command
This command is used to send mutations over raft.

In later commits if `topology_change` doesn't fit the max command size,
it will be split into smaller mutations and sent over multiple raft
commands.
2023-07-04 16:12:50 +02:00
Mikołaj Grzebieluch
06cedaf978 raft: refactor topology_change applying
Split up the `topology_change` command's logic to apply mutations and reload
the topology state in seperate functions.

This aims to extract the logic of applying mutations to use it in future raft commands.
2023-07-04 16:12:50 +02:00
Avi Kivity
0f59b17056 cql3: select_statement: don't copy metadata object needlessly
It's a shared_ptr<const metadata>, so it's safe to pass around.

perf-simple-query:

before:
211989.40 tps ( 62.1 allocs/op,  13.1 tasks/op,   43812 insns/op,        0 errors)
217889.09 tps ( 62.1 allocs/op,  13.1 tasks/op,   43713 insns/op,        0 errors)
211418.75 tps ( 62.1 allocs/op,  13.1 tasks/op,   43782 insns/op,        0 errors)
217388.46 tps ( 62.1 allocs/op,  13.1 tasks/op,   43733 insns/op,        0 errors)
211528.74 tps ( 62.1 allocs/op,  13.1 tasks/op,   43766 insns/op,        0 errors)

after:
215241.86 tps ( 61.1 allocs/op,  13.1 tasks/op,   43563 insns/op,        0 errors)
216172.41 tps ( 61.1 allocs/op,  13.1 tasks/op,   43562 insns/op,        0 errors)
212591.73 tps ( 61.1 allocs/op,  13.1 tasks/op,   43586 insns/op,        0 errors)
212217.28 tps ( 61.1 allocs/op,  13.1 tasks/op,   43553 insns/op,        0 errors)
215863.47 tps ( 61.1 allocs/op,  13.1 tasks/op,   43559 insns/op,        0 errors)

About 200 instructions saved.

Closes #14499
2023-07-04 16:41:51 +03:00
Marcin Maliszkiewicz
6424dd5ec4 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454
2023-07-04 16:15:08 +03:00
Anna Stuchlik
6408b520d4 doc: improve the tile of Unified Installer page
Following the feedback, this commit changes the page title
into "Install ScyllaDB Without root Privileges".
2023-07-04 15:05:56 +02:00
Anna Stuchlik
5895e210fd doc: move package install instructions to the docs
This commit moves the installation instructions with Linux
packages from the webstite to the docs.

The scope:
- Added the install-on-linux.rst file that has information
about all supported linux platform. The replace variables
in the file must be updated per release.
- Updated the index page to include the new file.

Refs: scylladb/scylla-docs#4091
2023-07-04 15:03:01 +02:00
Pavel Emelyanov
3679792f49 Merge 'test/pylib: allow run minio_server.py as a stand-alone tool' from Kefu Chai
this would allow developer to run a minio server for testing, for instance, s3_test.

Closes #14485

* github.com:scylladb/scylladb:
  test/pylib: chmod +x minio_server.py
  test/pylib: allow run minio_server.py as a stand-alone tool
2023-07-04 13:41:51 +03:00
Petr Gusev
9198175b89 repair: extract repair_reader declaration into reader.hh
It's needed to write a unit test for it in the following
commits. No other code changes have been made.
2023-07-04 13:39:53 +03:00
Petr Gusev
b9f527bfa8 repair_meta: get_estimated_partitions fix
The shard_range parameter was unused.
2023-07-04 13:39:53 +03:00
Petr Gusev
3aeee90f04 repair_meta: use multishard_filter reader if the number of partitions is small
We replace is_local_reader bool_class with the
read_strategy enum, since now we have three options.
We choose our new multishard_streaming_reader if
the number of partitions is less than the
number of master subranges.
2023-07-04 13:39:53 +03:00
Petr Gusev
c0d049982c repair_meta: delay _repair_reader creation
In later commits we will need the estimated number
of partitions in _repair_reader creation,
so in this commit we delay it until the
reader is first used in read_rows_from_disk
function. read_rows_from_disk is used
in get_sync_boundary, which is called
by master after set_estimated_partitions.
2023-07-04 13:39:49 +03:00
Petr Gusev
f05ab33ee7 database.hh: make_multishard_streaming_reader with range parameter
We add an overload of make_multishard_streaming_reader
which reads all the data in the given range. We will use it later
in row level repair if --smp is different on the
nodes and the number of partitions is small.
2023-07-04 13:30:37 +03:00
Petr Gusev
614a1b3770 database.cc: extract streaming_reader_lifecycle_policy
We are going to use it later in a new
make_multishard_streaming_reader overload.
In this commit we just move it outside
into the anonymous namespace, no other code changes
were made.
2023-07-04 13:30:37 +03:00
Kefu Chai
949bb719cd sstables: use try_emplace() when appropriate
so we don't have to search in the unordered_map twice. and it's
more readable, as we don't need to compare an iterator with the
sentry.

also, take the opportunity to simplify the code by using the
temporary `s3_cfg` when possible instead of `it->second.cfg`
which is less readable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 15:40:10 +08:00
Kefu Chai
dcfbc85485 replica,sstable: do not assign a value to a shared_ptr
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.

unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.

this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.

in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 15:39:52 +08:00
Kefu Chai
c005b6dce0 test/pylib: chmod +x minio_server.py
add a shebang line. so we can just launch
a minio_server using

```console
test/pylib/minio_server.py --host 127.0.0.1
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 13:19:34 +08:00
Kefu Chai
2bae0b9aa8 test/pylib: allow run minio_server.py as a stand-alone tool
this would allow developer to run a minio server for testing, for
instance, s3_test, using something like:

```console
$ python3 test/pylib/minio_server.py --host 127.0.0.1
tempdir='/tmp/tmpfoobar-minio'
export S3_SERVER_ADDRESS_FOR_TEST=127.0.0.1
export S3_SERVER_PORT_FOR_TEST=900
export S3_PUBLIC_BUCKET_FOR_TEST=testbucket
```

and developer is supposed to copy-and-paste the `export` commands
to prepare the environmental variables for the test using the
minio server. the tempdir is used for the rundir of minio, and it
is also used for holding the log file of this tool. one might want
to check it when necessary.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-04 13:14:42 +08:00
Tomasz Grabiec
7d35cf8657 Merge 'migration_manager: disable schema pulls when schema is Raft-managed' from Kamil Braun
We want to disable `migration_manager` schema pulls and make schema
managed only by Raft group 0 if Raft is enabled. This will be important
with Raft-based topology, when schema will depend on topology (e.g. for
tablets).

We solved the problem partially in PR #13695. However, it's still
possible for a bootstrapping node to pull schema in the early part of
bootstrap procedure, before it setups group 0, because of how the
currently used `_raft_gr.using_raft()` check is implemented.

Here's the list of cases:
- If a node is bootstrapping in non-Raft mode, schema pulls must remain
  enabled.
- If a node is bootstrapping in Raft mode, it should never perform a
  schema pull.
- If a bootstrapped node is restarting in non-Raft mode but with Raft
  feature enabled (which means we should start upgrading to use Raft),
  or restarting in the middle of Raft upgrade procedure, schema pulls must
  remain enabled until the Raft upgrade procedure finishes.
  This is also the case of restarting after RECOVERY.
- If a bootstrapped node is restarting in Raft mode, it should never
  perform a schema pull.

The `raft_group0` service is responsible for setting up Raft during boot
and for the Raft upgrade procedure. So this is the most natural place to
make the decision that schema pulls should be disabled. Instead of
trying to come up with a correct condition that fully covers the above
list of cases, store a `bool` inside `migration_manager` and set it from
`raft_group0` function at the right moment - when we decide that we
should boot in Raft mode, or restart with Raft, or upgrade. Most of the
conditions are already checked in `setup_group0_if_exist`, we just need
to set the bool. Also print a log message when schema pulls are
disabled.

Fix a small bug in `migration_manager::get_schema_for_write` - it was
possible for the function to mark schema as synced without actually
syncing it if it was running concurrently to the Raft upgrade procedure.

Correct some typos in comments and update the comments.

Fixes #12870

Closes #14428

* github.com:scylladb/scylladb:
  raft_group_registry: remove `has_group0()`
  raft_group0_client: remove `using_raft()`
  migration_manager: disable schema pulls when schema is Raft-managed
2023-07-03 23:54:34 +02:00
Tomasz Grabiec
f2ed9fcd7e schema_mutations, migration_manager: Ignore empty partitions in per-table digest
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.
2023-07-03 23:06:55 +02:00
Nadav Har'El
ec77172b4b Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity
SELECT clause components (selectors) are currently evaluated during query execution
using a stateful class hierarchy. This state is needed to hold intermediate state while
aggregating over multiple rows. Because the selectors are stateful, we must re-create
them each query using a selector_factory hierarchy.

We'd like to convert all of this to the unified expression evaluation machinery, so we can
have just one grammar for expressions, and just one way to evaluate expressions, but
the statefulness makes this complex.

In commit 59ab9aac44 "(Merge 'functions: reframe aggregate functions in terms
of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving
their state to aggregate_function_selector::_accumulator, and therefore into the
class hierarchy we're addressing now. Another reason for keeping state is that selectors
that aren't aggregated capture the first value they see in a GROUP BY group.

Since expressions can't contain state directly, we break apart expressions that contain
aggregate functions into two: an inner expression that processes incoming rows within
a group, and an outer expression that generates the group's output. The two expressions
communicate via a newly introduced expression element: a temporary.

The problem of non-aggregated columns requiring state is solved by encapsulating
those columns in an internal aggregate function, called the "first" function.

In terms of performance, this series has little effect, since the common case of selectors
that only contain direct column references without transformations is evaluated via a fast
path (`simple_selection`). This fast-path is preserved with almost no changes.

While the series makes it possible to start to extend the grammar and unify expression
syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change:
the `SELECT JSON` statement generates json object field names based on the input selectors.
In one case the name of the field has changed, but it is an esoteric case (where a function call
is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra.

Closes #14467

* github.com:scylladb/scylladb:
  cql3: selection: drop selector_factories, selectables, and selectors
  cql3: select_statement: stop using selector_factories in SELECT JSON
  cql3: selection: don't create selector_factories any more
  cql3: selection: collect column_definitions using expressions
  cql3: selection: reimplement selection::is_aggregate()
  cql3: selection: evaluate aggregation queries via expr::evaluate()
  cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
  cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
  cql3: selection: store primary key in result_set_builder
  cql3: expression: fix field_selection::type interpretation by evaluate()
  cql3: selection: make result_set_builder::current non-optional<>
  cql3: selection: simplify row/group processing
  cql3: selection: convert requires_thread to expressions
  cql: selection: convert used_functions() to expressions
  cql3: selection: convert is_reducible/get_reductions to expressions
  cql3: selection: convert is_count() to expressions
  cql3: selection convert contains_ttl/contains_writetime to work on expressions
  cql3: selection: make simple_selectors stateless
  cql3: expression: add helper to split expressions with aggregate functions
  cql3: selection: short-circuit non-aggregations
  cql3: selection: drop validate_selectors
  cql3: select_statement: force aggregation if GROUP BY is used
  cql3: select_statement: levellize aggregation depth
  cql3: selection: skip first_function when collecting metadata
  cql3: select_statement: explicitly disable automatic parallelization with no aggregates
  cql3: expression: introduce temporaries
  cql3: select_statement: use prepared selectors
  cql3: selection: avoid selector_factories in collect_metadata()
  cql3: expressions: add "metadata mode" formatter for expressions
  cql3: selection: convert collect_metadata() to the prepared expression domain
  cql3: selection: convert processes_selection to work on prepared expressions
  cql3: selection: prepare selectors earlier
  cql3: raw_selector: deinline
  cql3: expression: reimplement verify_no_aggregate_functions()
  cql3: expression: add helpers to manage an expression's aggregation depth
  cql3: expression: improve printing of prepared function calls
  cql3: functions: add "first" aggregate function
2023-07-03 23:21:33 +03:00
Tomasz Grabiec
0c86abab4d migration_manager, schema_tables: Implement migration_manager::reload_schema()
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
2023-07-03 20:32:59 +02:00
Tomasz Grabiec
9bfe9f0b2f schema_tables: Avoid crashing when table selector has only one kind of tables
Currently not reachable, because selectors are always constructed with
both kinds initailized. Will be triggered by the next patch.
2023-07-03 20:32:59 +02:00
Avi Kivity
66c47d40e6 cql3: selection: drop selector_factories, selectables, and selectors
The whole class hierarchy is no longer used by anything and we can just
delete it.
2023-07-03 19:45:17 +03:00
Avi Kivity
d9cf81f1a6 cql3: select_statement: stop using selector_factories in SELECT JSON
SELECT JSON uses selector_factories to obtain the names of the
fields to insert into the json object, and we want to drop
selector_factories entirely. Switch instead to the ":metadata" mode
of printing expressions, which does what we want.

Unfortunately, the switch changes how system functions are converted
into field names. A function such as unixtimestampof() is now rendered
as "system.unixtimestampof()"; before it did not have the keyspace
prefix.

This is a compatiblity problem, albeit an obscure one. Since the new
behavior matches Cassandra, and the odds of hitting this are very low,
I think we can allow the change.
2023-07-03 19:45:17 +03:00
Avi Kivity
039472ffb9 cql3: selection: don't create selector_factories any more
We no longer use selector_factories for anything, so we can drop them.
2023-07-03 19:45:17 +03:00
Avi Kivity
e521557ce5 cql3: selection: collect column_definitions using expressions
The replica needs to know which columns we're interested in. Iterate
and recurse into all selector expressions to collect all mentioned columns.

We use the same algorithm that create_factories_and_collect_column_definitions()
uses, even though it is quadratic, to avoid causing surprises.
2023-07-03 19:45:17 +03:00
Avi Kivity
7bd317ace4 cql3: selection: reimplement selection::is_aggregate()
We can get rid of the last use of selector_factories by reimplementing
is_aggregate(). It's simple - if we have an inner loop, we're
aggregating.
2023-07-03 19:45:17 +03:00
Avi Kivity
91cdaa72bd cql3: selection: evaluate aggregation queries via expr::evaluate()
When constructing a selection_with_processing, split the
selectors into an inner loop and an outer loop with split_aggregation().
We can then reimplement add_input_row() and get_output_row() as follows:

 - add_input_row(): evaluate the inner loop expressions and store
   the results in temporaries
 - get_output_row(): evaluate the outer loop expressions, pulling in
   values from those temporaries.

reset(), which is called between groups, simply copies the initial
values rathered by split_aggregation() into the temporaries.

The only complexity comes from add_column_for_post_query_processing(),
which essentially re-does the work of split_aggregation(). It would
be much better if we added the column before split_aggregation() was
called, but some refactoring has to take place before that happens.
2023-07-03 19:45:17 +03:00
Avi Kivity
27254c4f50 cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
In three cases we need to consult a column that's possibly not explicitly
selected:
 - for the WHERE clause
 - for GROUP BY
 - for ORDER BY

The return value of the function is the index where the newly-added
column can be found. Currently, the index is correct for both
the internal column vector and the result set, but soon in won't
be.

In the first two cases (WHERE clause and ORDER BY), we're interested
in the column before grouping, in the last case (ORDER BY) we're interested
in the column after grouping, so we need to distinguish between the two.

Since we already have selection::index_of() that returns the pre-grouping
index, choose the post-grouping index for the return value of
selection::add_column_for_post_processing(), and change the GROUP BY
code to use index_of(). Comments are added.
2023-07-03 19:45:17 +03:00
Avi Kivity
6bf1bd7130 cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
Now that everything is in place, implement the fast-path
transform_input_row() for selection_with_processing. It's a
straightforward call to evaluate() in a loop.

We adjust add_column_for_post_processing() to also update _selectors,
otherwise ORDER BY clauses that require an additional column will not
see that column.

Since every sub-class implements transform_input_row(), mark
the base class declaration as pure virtual.
2023-07-03 19:45:17 +03:00
Avi Kivity
f5eb7fd6dc cql3: selection: store primary key in result_set_builder
expr::evaluate() expects an exploded primary key in its
evaluation_inputs structure (this dates back from the conversion
of filtering to expressions). But right now, the exploded primary
key is only available in the filter.

That's easy to fix however: move the primary key containers
to result_set_builder and just keep references in the filter.

After this, we can evaluate column_value expressions that
reference the primary key.
2023-07-03 19:45:17 +03:00
Avi Kivity
0021f77e30 cql3: expression: fix field_selection::type interpretation by evaluate()
field_selection::type refers to the type of the selection operation,
not the type of the structure being selected. This is what
prepare_expression() generates and how all other expression elements
work, but evaluate() for field_selection thinks it's the type
of the structure, and so fails when it gets an expression
from prepare_expression().

Fix that, and adjust the tests.
2023-07-03 19:45:17 +03:00
Avi Kivity
aed01018a3 cql3: selection: make result_set_builder::current non-optional<>
Previously, we used the engagedness of result_set_builder::optional
as a flag, but the previous patch eliminated that and it's always
engaged. Remove the optional wrapper to reduce noise.
2023-07-03 19:45:17 +03:00
Avi Kivity
44c8507075 cql3: selection: simplify row/group processing
Processing a result set relies on calling result_set_builder::new_row().
This function is quite complex as it has several roles:

 - complete processing of the previously computed row, if any
 - determine if GROUP BY grouping has changed, and flush the previous group
   if so
 - flush the last group if that's the case

This works now, but won't work with expr::evaluate. The reason is that
new_row() is called after the partition key and clustering key of the
new row have been evaluated, so processing of the previous row will see
incorrect data. It works today because we copy the partition key and
clustering key into result_set_builder::current, but expr::evaluate
uses the exploded partition key and clustering key, which have been
clobbered.

The solution is to separate the roles. Instead of new_row() that's
responsible for completing the previous row and starting a new one,
we have start_new_row() that's responsible for what its name says,
and complete_row() that's responsible for completing the row and
checking for group change. The responsibity for flushing the final
group is moved to result_set_builder::build(). This removes the
awkward "more_rows_coming" parameter that makes everything more
complicated.

result_set_builder::current is still optional, but it's always
engaged. The next patch will clean that up.
2023-07-03 19:45:17 +03:00
Avi Kivity
877f4f86d2 cql3: selection: convert requires_thread to expressions
If any function requires a thread to execute (due to running in Lua
or wasm), then the entire selection needs to run in a thread.
2023-07-03 19:45:17 +03:00
Avi Kivity
cbd68abde8 cql: selection: convert used_functions() to expressions
used_functions() is used to check whether prepared statements need
to be invalidated when user-defined functions change.

We need to skip over empty scalar components of aggregates, since
these can be defined by users (with the same meaning as if the
identity function was used).
2023-07-03 19:45:17 +03:00
Avi Kivity
bfb1acc6d3 cql3: selection: convert is_reducible/get_reductions to expressions
The current version of automatic query parallelization works when all
selectors are reducible (e.g. have a state_reduction_function member),
and all the inputs to the aggregates are direct column selectors without
further transformation. The actual column names and reductions need to
be packed up for forward_service to be used.

Convert is_reducible()/get_reductions() to the expression world. The
conversion is fairly straightforward.
2023-07-03 19:45:17 +03:00
Avi Kivity
d99fc29e2d cql3: selection: convert is_count() to expressions
Early versions of automatic query parallelization only
supported `SELECT count(*)` with one selector. Convert the
check to expressions.
2023-07-03 19:45:17 +03:00
Avi Kivity
d36eb8cea6 cql3: selection convert contains_ttl/contains_writetime to work on expressions
contains_ttl/contains_writetime are two attributes of a selection. If a selection
contains them, we must ask the replica to send them over; otherwise we don't
have data to process. Not sending ttl/writetime saves some effort.

The implementation is a straightforward recursive descent using expr::find_in_expression.
2023-07-03 19:45:17 +03:00
Avi Kivity
6c2bb5e1ed cql3: selection: make simple_selectors stateless
Now that we push all GROUP BY queries to selection_with_processing,
we always process rows via transform_input_row() and there's no
reason to keep any state in simple_selectors.

Drop the state and raise an internal error if we're ever
called for aggregation.
2023-07-03 19:45:17 +03:00
Avi Kivity
a26516ef65 cql3: expression: add helper to split expressions with aggregate functions
Aggregate functions cannot be evaluated directly, since they implicitly
refer to state (the accumulator). To allow for evaluation, we
split the expression into two: an inner expression that is evaluated
over the input vector (once per element). The inner expression calls
the aggregation function, with an extra input parameter (the accumulator).

The outer expression is evaluated once per input vector; it calls
the final function, and its input is just the accumulator. The outer
expression also contains any expressions that operate on the result
of the aggregate function.

The acculator is stored in a temporary.

Simple example:

   sum(x)

is transformed into an inner expression:

   t1 = (t1 + x)   // really sum.aggregation_function

and an outer expression:

   result = t1     // really sum.state_to_result_function

Complicated example:

    scalar_func(agg1(x, f1(y)), agg2(x, f2(y)))

is transformed into two inner expressions:

    t1 = agg1.aggregation_function(t1, x, f1(y))
    t2 = agg2.aggregation_function(t2, x, f2(y))

and an outer expression

    output = scalar_func(agg1.state_to_result_function(t1),
                         agg2.state_to_result_function(t2))

There's a small wart: automatically parallelized queries can generate
"reducible" aggregates that have no state_to_result function, since we
want to pass the state back to the coordinator. Detect that and short
circuit evaluation to pass the accumulator directly.
2023-07-03 19:45:17 +03:00
Avi Kivity
f48ecb5049 cql3: selection: short-circuit non-aggregations
Currently, selector evaluation assumes the most complex case
where we aggregate, so multiple input rows combine into one output row.

In effect the query either specifies an outer loop (for the group)
and an inner loop (for input rows), or it only specifies the inner loop;
but we always perform the outer and inner loop.

Prepare to have a separate path for the non-aggregation case by
introducing transform_input_row().
2023-07-03 19:45:17 +03:00
Avi Kivity
4a2428e4ec cql3: selection: drop validate_selectors
It's unused. It dates from the (perhaps better) time when
regularity of aggregation across selectors was enforced.
2023-07-03 19:45:17 +03:00
Avi Kivity
432cb02d64 cql3: select_statement: force aggregation if GROUP BY is used
GROUP BY is typically used with aggregation. In one case the aggregation
is implicit:

    SELECT a, b, c
    FROM tab
    GROUP BY x, y, z

One row will appear from each group, even though no aggregation
was specified. To avoid this irregularity, rewrite this query as

    SELECT first(a), first(b), first(c)
    FROM tab
    GROUP BY x, y, z

This allows us to have different paths for aggregations and
non-aggregations, without worrying about this special case.
2023-07-03 19:45:17 +03:00
Avi Kivity
bc6c64e13c cql3: select_statement: levellize aggregation depth
Avoid mixed aggregate/non-aggregate queries by inserting
calls to the first() function. This allows us to avoid internal
state (simple_selector::_current) and make selector evaluation
stateless apart from explicit temporaries.
2023-07-03 19:45:17 +03:00
Avi Kivity
ecdded90cd cql3: selection: skip first_function when collecting metadata
We plan to rewrite aggregation queries that have a non-aggregating
selector using the first function, so that all selectors are
aggregates (or none are). Prevent the first function from affecting
metadata (the auto-generated column names), by skipping over the
first function if detected. They input and output types are unchanged
so this only affects the name.
2023-07-03 19:45:17 +03:00
Avi Kivity
996e02f5bf cql3: select_statement: explicitly disable automatic parallelization with no aggregates
A query of the form `SELECT foo, count(foo) FROM tab` returns the first
value of the foo column along with the count. This can't be parallized
today since the first selector isn't an aggregate.

We plan to rewrite the query internally as `SELECT first(foo), count(foo)
FROM tab`, in order to make the query more regular (no mixing of aggregates
and non-aggregates). However, this will defeat the current check since
after the rewrite, all selectors are aggregates.

Prepare for this by performing the check on a pre-rewrite variable, so
it won't be affected by the query rewrite in the next patch.

Note that although even though we could add support for running
first() in parallel, it's not possible to get the correct results,
since first() is not commutative and we don't reduce in order. It's
also not a particularly interesting query.
2023-07-03 19:45:17 +03:00
Avi Kivity
778ae2b461 cql3: expression: introduce temporaries
Temporaries are similar to bind variables - they are values provided from
outside the expression. While bind variables are provided by the user, temporaries
are generated internally.

The intended use is for aggregate accumulator storage. Currently aggregates
store the accumulator in aggregate_function_selector::_accumulator, which
means the entire selector hierarchy must be cloned for every query. With
expressions, we can have a single expression object reused for many computations,
but we need a way to inject the accumulator into an aggregation, which this
new expression element provides.
2023-07-03 19:45:17 +03:00
Avi Kivity
7c3ceb6473 cql3: select_statement: use prepared selectors
Change one more layer of processing to work on prepared
rather than raw selectors. This moves the call to prepare
the selectors early in select_statement processing. In turn
this changes maybe_jsonize_select_clause() and forward_service's
mock_selection() to work in the prepared realm as well.

This moves us one step closer to using evaluate() to process
the select clause, as the prepared selectors are now available
in select_statement. We can't use them yet since we can't evaluate
aggregations.
2023-07-03 19:45:17 +03:00
Avi Kivity
a338d0455d cql3: selection: avoid selector_factories in collect_metadata()
Generate the column headings in the result set metadata using
the newly introduced result_set_metadata mode of the expression
printer.
2023-07-03 19:45:17 +03:00
Avi Kivity
7aee322a6c cql3: expressions: add "metadata mode" formatter for expressions
When returning a result set (and when preparing a statement), we
return metadata about the result set columns. Part of that is the
column names, which are derived from the expressions used as selectors.

Currently, they are computed via selector::column_name(), but as
we're dismantling that hierarchy we need a different way to obtain
those names.

It turns out that the expression formatter is close enough to what
we need. To avoid disturbing the current :user mode, add a new
:metadata mode and apply the adjustments needed to bring it in line
with what column metadata looks like today.

Note that column metadata is visible to applications and they can
depend on it; e.g. the Python driver allows choosing columns based on
their names rather than ordinal position.
2023-07-03 19:45:17 +03:00
Avi Kivity
a1f4abb753 cql3: selection: convert collect_metadata() to the prepared expression domain
Simplifies refactoring later on.
2023-07-03 19:45:17 +03:00
Avi Kivity
91b251f6b4 cql3: selection: convert processes_selection to work on prepared expressions
processes_selection() checks whether a selector passes-through a column
or applies some form of processing (like a case or function application).

It's more sensible to do this in the prepared domain as we have more
information about the expression. It doesn't really help here, but
it does help the refactoring later in the series.
2023-07-03 19:45:17 +03:00
Avi Kivity
4fb797303f cql3: selection: prepare selectors earlier
Currently, each selector expression is individually prepared, then converted
into a selector object that is later executed. This is done (on a vector
of raw selectors) by cql3::selection::raw_selector::to_selectables().

Split that into two phases. The first phase converts raw_selector into
a new struct prepared_selector (a better name would be plain 'selector',
but it's taken for now). The second phase continues the process and
converts prepared_selector into selectables.

This gives us a full view of the prepared expressions while we're
preparing the select clause of the select statement.
2023-07-03 19:45:17 +03:00
Avi Kivity
70b246eaaf cql3: raw_selector: deinline
It's easier to refactor things if they don't cause the entire
universe to recompile, plus adding new headers is less painful.
2023-07-03 19:45:17 +03:00
Avi Kivity
99fe0ee772 cql3: expression: reimplement verify_no_aggregate_functions()
Most clauses in a CQL statement don't tolerate aggregate functions,
and so they call verify_no_aggregate_functions(). It can now be
reimplemented in terms of aggregation_depth(), removing some code.
2023-07-03 19:45:17 +03:00
Avi Kivity
b1b4a18ad8 cql3: expression: add helpers to manage an expression's aggregation depth
We define the "aggregation depth" of an expression by how many
nested aggregation functions are applied. In CQL/SQL, legal
values are 0 and 1, but for generality we deal with any aggregation depth.

The first helper measures the maximum aggregation depth along any path
in the expression graph. If it's 2 or greater, we have something like
max(max(x)) and we should reject it (though these helpers don't). If
we get 1 it's a simple aggregation. If it's zero then we're not aggregating
(though CQL may decide to aggregate anyway if GROUP BY is used).

The second helper edits an expression to make sure the aggregation depth
along any path that reaches a column is the same. Logically,
`SELECT x, max(y)` does not make sense, as one is a vector of values
and the other is a scalar. CQL resolves the problem by defining x as
"the first value seen". We apply this resolution by converting the
query to `SELECT first(x), max(y)` (where `first()` is an internal
aggregate function), so both selectors refer to scalars that consume
vectors.

When a scalar is consumed by an aggregate function (for example,
`SELECT max(x), min(17)` we don't have to bother, since a scalar
is implicity promoted to a vector by evaluating it every row. There
is some ambiguity if the scalar is a non-pure function (e.g.
`SELECT max(x), min(random())`, but it's not worth following.

A small unit test is added.
2023-07-03 19:45:16 +03:00
Nadav Har'El
94bf6bbeaa Merge 'Remove unused storage_proxy args from some replica::database methods' from Pavel Emelyanov
null

Closes #14489

* github.com:scylladb/scylladb:
  database: Remove unused proxy arg from update_keyspace_on_all_shards()
  database: Remove unused proxy arg from update_keyspace()
2023-07-03 19:14:28 +03:00
Avi Kivity
faf0ea0f68 cql3: expression: improve printing of prepared function calls
Currently, a prepared function_call expression is printed as an
"anonymous function", but it's not really anonymous - the name is
available. Print it out.

This helps in a unit test later on (and is worthwhile by itself).
2023-07-03 19:02:33 +03:00
Alejo Sanchez
520bd90008 test/boost/memtable_test: split test plain/reverse
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14447
2023-07-03 15:20:12 +03:00
Pavel Emelyanov
0d4c981423 database: Remove unused proxy arg from update_keyspace_on_all_shards()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-03 14:19:54 +03:00
Pavel Emelyanov
42b9ba48de database: Remove unused proxy arg from update_keyspace()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-03 14:19:36 +03:00
Kefu Chai
04434c02b3 sstables: print generation without {:d}
the formatter for sstables::generation_type does not support "d"
specifier, so we should not use "{:d}" for printing it. this works
before d7c90b5239, but after that
change, generation_type is not an alias of int64_t anymore.
and its formatter does not support "d", so we should either
specialize fmt::formatter<generation_type> to support it or just
drop the specifier.

since seastar::format() is using
```c++
fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...);
```
to print the arguments with given fmt string, we cannot identify
these kind of error at compile time.

at runtime, if we have issues like this, {fmt} would throw exception
like:
```
terminate called after throwing an instance of 'fmt::v9::format_error'
  what():  invalid format specifier
```
when constructing the `std::runtime_error` instance.

so, in this change, "d" is removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14427
2023-07-03 13:53:13 +03:00
Pavel Emelyanov
f18bd23ec5 Merge 'repair: coroutinize some functions' from Kefu Chai
also, take this opportunity to let `handle_mutation_fragment()` return void. for better readability.

Closes #14258

* github.com:scylladb/scylladb:
  repair: do not check retval of handle_mutation_fragment()
  repair: coroutinize move_row_buf_to_working_row_buf()
  repair: coroutinize read_rows_from_disk()
  repair: coroutinize get_sync_boundary()
2023-07-03 13:45:42 +03:00
Avi Kivity
b7556e9482 cql3: functions: add "first" aggregate function
first(x) returns the first x it sees in the group. This is useful
for SELECT clauses that return a mix of aggregates and non-aggregates,
for example

    SELECT max(x), x

with inputs of x = { 1, 2, 3 } is expected to return (3, 1).

Currently, this behavior is handled by individual selectors,
which means they need to contain extra state for this, which
cannot be easily translated to expressions. The new first function
allows translating the SELECT clause above to

    SELECT max(x), first(x)

so all selectors are aggregations and can be handled in the same
way.

The first() function is not exposed to users.
2023-07-02 18:15:00 +03:00
Kefu Chai
1ab2bb69b8 keys: do not use zip_iterator for printing key components
boost's the operator==() implementation of boost's zip_iterator
returns true only if all elements in enclosed tuple of zip_iterator
are equal. and the zip_iterator always advances all the iterators in
the enclosed tuple. but in our case, some components might be missing.
in other words, the size of the `components` might be smaller than
that of the `types` range. so, when the zip_iterator advances past
the end of the components, scylla starts reading out of bounds.

because zip_iterator does not allow us to customize how it implements
the equal operator. and we cannot deduce the size of components without
reading all of them. so in this change, we partially revert
3738fcbe05, instead of using fmt::join(),
just iterate through the components manually. this should avoid
the out-of-bound reading, and also preserve the original behavior.

Branches: 5.3
Fixes #14435
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14457
2023-07-01 23:49:02 +03:00
Avi Kivity
d88dfa0ad2 tools: scylla-sstable: fix stack overflow due to multiple db::config placed on the stack
db::config is pretty large (~32k) and there are four of them, blowing the stack. Fix by
allocating them on the heap.

It's not clear why this shows up on my system (clang 16) and not in the frozen toolchain.
Perhaps clang 16 is less able to reuse stack space.

Closes #14464
2023-07-01 09:21:05 +03:00
Michał Jadwiszczak
2071ade171 docs:cql: add information about generic describe
In our cql's documentation, there was no information that type can
be omitted in a describe statement.
Added this information along with the order of looking for the element.
2023-06-30 14:50:14 +02:00
Michał Jadwiszczak
58eb7a45b7 cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc 2023-06-30 14:50:08 +02:00
Michał Jadwiszczak
d5748fd895 cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe
So far generic describe (`DESC <name>`) followed Cassandra
implementation and it only described keyspace/table/view/index.

This commit adds UDT/UDF/UDA to generic describe.

Fixes: #14170
2023-06-30 14:38:22 +02:00
Anna Stuchlik
e7bb86e0f1 doc: fix broken links on the Scylla SStable page 2023-06-30 12:00:59 +02:00
Piotr Dulikowski
ee9bfb583c combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452
2023-06-30 12:07:13 +03:00
Kamil Braun
1760a84873 raft_group_registry: remove has_group0()
The function is no longer used.
2023-06-30 11:06:05 +02:00
Kamil Braun
6c3f391c0a raft_group0_client: remove using_raft()
The function is no longer used.
2023-06-30 11:06:05 +02:00
Kamil Braun
0eee196a2e migration_manager: disable schema pulls when schema is Raft-managed
We want to disable `migration_manager` schema pulls and make schema
managed only by Raft group 0 if Raft is enabled. This will be important
with Raft-based topology, when schema will depend on topology (e.g. for
tablets).

We solved the problem partially in PR #13695. However, it's still
possible for a bootstrapping node to pull schema in the early part of
bootstrap procedure, before it setups group 0, because of how the
currently used `_raft_gr.using_raft()` check is implemented.

Here's the list of cases:
- If a node is bootstrapping in non-Raft mode, schema pulls must remain
  enabled.
- If a node is bootstrapping in Raft mode, it should never perform a
  schema pull.
- If a bootstrapped node is restarting in non-Raft mode but with Raft
  feature enabled (which means we should start upgrading to use Raft),
  or restarting in the middle of Raft upgrade procedure, schema pulls must
  remain enabled until the Raft upgrade procedure finishes.
  This is also the case of restarting after RECOVERY.
- If a bootstrapped node is restarting in Raft mode, it should never
  perform a schema pull.

The `raft_group0` service is responsible for setting up Raft during boot
and for the Raft upgrade procedure. So this is the most natural place to
make the decision that schema pulls should be disabled. Instead of
trying to come up with a correct condition that fully covers the above
list of cases, store a `bool` inside `migration_manager` and set it from
`raft_group0` function at the right moment - when we decide that we
should boot in Raft mode, or restart with Raft, or upgrade. Most of the
conditions are already checked in `setup_group0_if_exist`, we just need
to set the bool. Also print a log message when schema pulls are
disabled.

Fix a small bug in `migration_manager::get_schema_for_write` - it was
possible for the function to mark schema as synced without actually
syncing it if it was running concurrently to the Raft upgrade procedure.

Correct some typos in comments and update the comments.

Fixes #12870
2023-06-30 11:06:02 +02:00
Botond Dénes
5648bfb9a0 Merge 'Find task manager task progress' from Aleksandra Martyniuk
Modify task_manager::task::impl::get_progress method so that,
whenever relevant, progress is calculated based on children's
progress. Otherwise progress indicates only whether the task
is finished or not.

The method may be overriden in inheriting classes.

Closes #14381

* github.com:scylladb/scylladb:
  tasks: delete task_manager::task::impl::_progress as it's unused
  tasks: modify task_manager::task::impl::get_progress method
  tasks: add is_complete method
2023-06-30 09:47:07 +03:00
Botond Dénes
8a7261fd70 Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).

This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

- [x]  5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)

(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)

Closes #14444

* github.com:scylladb/scylladb:
  doc: fix rollback in 4.3-to-2021.1 upgrade guide
  doc: fix rollback in 5.0-to-2022.1 upgrade guide
  doc: fix rollback in 5.1-to-2022.2 upgrade guide
2023-06-30 09:38:45 +03:00
Botond Dénes
d20ed2d4db Merge 'doc: improve the Unified Installer page' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/14033

This PR:
- replaces the OUTDATED list of platforms supported by Unified Installer with a link to the "OS Support" page. In this way, the list of supported OSes will be documented in one place, preventing outdated documentation.
- improves the language and syntax, including:
    - Improving the wording.
    - Replacing "Scylla" with "ScyllaDB"
    - Fixing language mistakes
    - Fixing heading underline so that the headings render correctly.

Closes #14445

* github.com:scylladb/scylladb:
  doc: update the language - Unified Installer page
  doc: update Unified Installer support
2023-06-30 09:38:18 +03:00
Kamil Braun
ff386e7a44 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336
2023-06-29 22:46:42 +02:00
Konstantin Osipov
3d81408a58 test.py: make experimental: raft the default for all tests
Make sure all tests use the new centralized topology
coordinator. This is a step forward towards maturing the
coordinator implementation.

Closes #14039
2023-06-29 14:44:00 +02:00
Tomasz Grabiec
a9282103ba Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy
This series aims at hardening schema merges and preventing inconsistencies across shards by
updating the database shards before calling the notification callback.

As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux.

In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state .

Other changes optimize this path by:
- updating shard 0 first, to seed the effective_replication_map.
- executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change.

Fixes #13137

Closes #14158

* github.com:scylladb/scylladb:
  migration_manager: propagate listener notification exceptions
  storage_service: keyspace_changed: execute only on shard 0
  database: modify_keyspace_on_all_shards: execute func first on shard 0
  database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
  database: add modify_keyspace_on_all_shards
  schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
  database: create_keyspace_on_all_shards
  database: update_keyspace_on_all_shards
  database: drop_keyspace_on_all_shards
2023-06-29 12:17:53 +02:00
Anna Stuchlik
d3aba00131 doc: update the language - Unified Installer page
This commit improves the language and syntax on
the Unified Installer page. The changes cover:

- Improving the wording.
- Replacing "Scylla" with "ScyllaDB"
- Fixing language mistakes
- Fixing heading underline so that the headings
  render correctly.
2023-06-29 12:11:22 +02:00
Anna Stuchlik
944ce5c5c2 doc: update Unified Installer support
This commit replaces the OUTDATED list of platforms supported
by Unified Installer with a link to the "OS Support" page.
In this way, the list of supported OSes will be documented
in one place, preventing outdated documentation.
2023-06-29 11:51:21 +02:00
Aleksandra Martyniuk
f63825151e tasks: delete task_manager::task::impl::_progress as it's unused 2023-06-29 11:30:27 +02:00
Aleksandra Martyniuk
d624be4e6b tasks: modify task_manager::task::impl::get_progress method
Modify task_manager::task::impl::get_progress method so that,
whenever relevant, progress is calculated based on children's
progress. Otherwise progress indicates only whether the task
is finished or not.
2023-06-29 11:30:26 +02:00
Botond Dénes
2a58b4a39a Merge 'Compaction resharding tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on table and shard level.

Closes #14044

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  replica: delete unused functions and struct
  compaction: add reshard_sstables_compaction_task_impl
  compaction: replica: copy struct and functions from distributed_loader.cc
  compaction: create resharding_compaction_task_impl
2023-06-29 12:10:54 +03:00
Aleksandra Martyniuk
0278b21e76 tasks: add is_complete method
Add is_complete method to task_manager::task::impl and
task_manager::task.
2023-06-29 11:02:14 +02:00
Anna Stuchlik
32cfde2f8b doc: fix rollback in 4.3-to-2021.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.4.3-to-2021.1 upgrade guide by adding the command
to restore system tables.
2023-06-29 10:36:47 +02:00
Anna Stuchlik
130ddc3d2b doc: fix rollback in 5.0-to-2022.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.0-to-2022.1 upgrade guide by adding the command
to restore system tables.
2023-06-29 10:29:23 +02:00
Anna Stuchlik
8b3153f9ef doc: fix rollback in 5.1-to-2022.2 upgrade guide
This commit fixes the Restore System Tables section
in the 5.1-to-2022.2 upgrade guide by adding a command
to clean upgraded SStables during rollback.
2023-06-29 10:17:06 +02:00
Nadav Har'El
dd63169077 Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco
Reduce test string value size, parallelize inserts, and use a prepared statement,

The debug running time for this tests is reduced from 13:18 to 7:52.

Refs #13905

Closes #14380

* github.com:scylladb/scylladb:
  test/boost/index_with_paging_test: parallel insert
  test/boost/index_with_paging_test: prepared statement
  test/boost/index_with_paging_test: reduce running time
2023-06-29 10:45:01 +03:00
Kefu Chai
2894bc4954 repair: do not check retval of handle_mutation_fragment()
handle_mutation_fragment() does not return `stop_iteration::yes`
anymore after fbbc86e18c, so let's
stop checking its return value. and make it return void.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-29 10:35:50 +08:00
Kefu Chai
e16b5ceb48 repair: coroutinize move_row_buf_to_working_row_buf()
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-29 10:19:02 +08:00
Kefu Chai
a973b43b9f repair: coroutinize read_rows_from_disk()
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-29 10:17:48 +08:00
Kefu Chai
0cacc1fd4e repair: coroutinize get_sync_boundary()
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-29 10:17:48 +08:00
Tomasz Grabiec
50e8ec77c6 Merge 'Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping' from Kamil Braun
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.

As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.

This should improve CI stability.
Fixes: #12972
Refs: #14042

Closes #14354

* github.com:scylladb/scylladb:
  messaging_service: print which connections are dropped due to missing topology info
  storage_service: wait for nodes to be UP on bootstrap
  storage_service: wait for NORMAL state handler before `setup_group0()`
  storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
2023-06-28 20:40:03 +02:00
Avi Kivity
f6f974cdeb cql3: selection: fix GROUP BY, empty groups, and aggregations
A GROUP BY combined with aggregation should produce a single
row per group, except for empty groups. This is in contrast
to an aggregation without GROUP BY, which produces a single
row no matter what.

The existing code only considered the case of no grouping
and forced a row into the result, but this caused an unwanted
row if grouping was used.

Fix by refining the check to also consider GROUP BY.

XFAIL tests are relaxed.

Fixes #12477.

Note, forward_service requires that aggregation produce
exactly one row, but since it can't work with grouping,
it isn't affected.

Closes #14399
2023-06-28 18:56:22 +03:00
Kamil Braun
b912eeade5 Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
  test: add test for group0 raft command merging
  service: raft: respect max mutation size limit when persisting raft entries
  group0_state_machine: merge commands before applying them whenever possible
2023-06-28 17:21:07 +02:00
Kamil Braun
1fa9678c64 messaging_service: print which connections are dropped due to missing topology info
This connection dropping caused us to spend a lot of time debugging.
Those debugging sessions would be shorter if Scylla logs indicated that
connections are being dropped and why.

Connection drops for a given node are a one-time event - we only do it
if we establish a connection to a node without topology info, which
should only happen before we handle the node's NORMAL status for the
first time. So it's a rare thing and we can log it on INFO level without
worrying about log spam.
2023-06-28 16:20:29 +02:00
Kamil Braun
51cec2be86 storage_service: wait for nodes to be UP on bootstrap
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Refs: #14042

This doesn't completely fix #14042 yet becasue it's specific to
gossiper-based topology mode only. For Raft-based topology, the node
joining procedure will be coordinated by the topology coordinator right
from the start and it will be the coordinator who issues the 'wait for
node to see other live nodes'.
2023-06-28 16:20:29 +02:00
Kamil Braun
5ec5c7704c storage_service: wait for NORMAL state handler before setup_group0()
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

We do it only in non-raft-topology mode, because with Raft-based
topology, node state changes are propagated to the cluster through
explicit global barriers and we plan to remove node statuses from
gossiper altogether.

Fixes: #12972
2023-06-28 16:19:24 +02:00
Anna Stuchlik
f4ae2c095b doc: fix rollback in 5.2-to-2023.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.

This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

Closes #14373
2023-06-28 17:16:32 +03:00
Alejo Sanchez
d4697ed21e test/boost/index_with_paging_test: parallel insert
Parallelize inserts for long-running test_index_with_paging.

Run time in debug mode reduced by 1 minute 48 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 16:11:58 +02:00
Alejo Sanchez
70a3179888 test/boost/index_with_paging_test: prepared statement
Prepare statement for insert.

Run time in debug mode reduced by 9 seconds.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 14:49:21 +02:00
Michał Jadwiszczak
0a8fcead08 cql3: Specify arguments types in UDA creation errors
Display not only function name but also expected arguments
if `state_function` or `final_function` was not found.

Fixes: #12088

Closes #14278
2023-06-28 15:27:49 +03:00
Alejo Sanchez
48d24269f1 test/boost/index_with_paging_test: reduce running time
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.

The debug running time for this tests is reduced from 9 minutes to 6
minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-06-28 13:55:52 +02:00
Nadav Har'El
49c8c06b1b Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT
2023-06-28 14:43:54 +03:00
Kamil Braun
64c302e777 storage_service: extract gossiper::wait_for_live_nodes_to_show_up()
This piece of `storage_service::wait_for_ring_to_settle()` will be
performed earlier in the boot procedure in follow-up commits.

Make it more generic, to be able to wait for `n` nodes to show up. Here
we wait for `2` nodes - ourselves and at least one other.
2023-06-28 12:36:06 +02:00
Aleksandra Martyniuk
bf3e0744c1 test: extend test_compaction_task.py to test resharding compaction 2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk
87c8d63b7a compaction: add shard_reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction on one shard.
2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk
db6e4a356b compaction: invoke resharding on sharded database
In reshard_sstables_compaction_task_impl::run() we call
sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed
to that method, we use both sharded sstable_directory service
and its local instance.

To make it straightforward that sharded and local instances are
dependend, we call sharded<replica::database>::invoke_on_all
instead and access local directory through the sharded one.
2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk
1acaed026a compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() 2023-06-28 11:43:11 +02:00
Aleksandra Martyniuk
85cc85fc5a replica: delete unused functions and struct 2023-06-28 11:41:43 +02:00
Aleksandra Martyniuk
837d77ba8c compaction: add reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction.
2023-06-28 11:41:43 +02:00
Aleksandra Martyniuk
0d6dd3eeda compaction: replica: copy struct and functions from distributed_loader.cc
As a preparation for integrating resharding compaction with task manager
a struct and some functions are copied from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
2023-06-28 11:41:42 +02:00
Aleksandra Martyniuk
2b4874bbf7 compaction: create resharding_compaction_task_impl
resharding_compaction_task_impl serves as a base class of all
concrete resharding compaction task classes.
2023-06-28 11:36:53 +02:00
Jan Ciolek
bfbc3d70b7 modification_statement: reject conditional statements with empty clustering key
`modification_statement::execute_with_condition` validates that a query with
an IF condition can be executed correctly.

There's already a check for empty partition key ranges, but there was no check
for empty clustering ranges.

Let's add a check for the clustering ranges as well, they're not allowed to be empty.
After this change Scylla outputs the same type of message for empty partition and
clustering ranges, which improves UX.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-28 10:30:52 +02:00
Jan Ciolek
ccdb26bf9e statements/cas_request: fix crash on empty clustering range in LWT
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

To fix it let's throw en exception when the clustering range
is empty. Cassandra also rejects queries with `c = 1 AND c = 2`.

There's also a check for empty partition range, as it used
to crash in the past, can't really hurt to add it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-28 10:18:06 +02:00
Botond Dénes
586102b42e Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.

There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to detect this case.

After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.

Fixes #13491

Closes #14375

* github.com:scylladb/scylladb:
  readers: evictable_reader: don't accidentally consume the entire partition
  test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
2023-06-28 07:58:45 +03:00
Kefu Chai
c3d2f0cd81 script: add base36-uuid.py
this script provides a tool to decode a base36 encoded timeuuid
to the underlying msb and lsb bits, and to encode msb and lsb
to a string with base36.

Both scylla and Cassandra 4.x support this new SSTable identifier used
in SSTable names. like "nb-3fw2_0tj4_46w3k2cpidnirvjy7k-big-Data.db".
Since this is a new way to print timeuuid, and unlike the representation
defined by RFC4122, it is not straightforward to connect the the
in-memory representation (0x6636ac00da8411ec9abaf56e1443def0) to its
string representation of SSTable identifiers, like
"3fw2_0tj4_46w3k2cpidnirvjy7k". It would be handy to have this
tool to encode/decode the number/string for debugging purpose.

For more context on the new SSTable identifier, please
see
https://cassandra.apache.org/_/blog/Apache-Cassandra-4.1-New-SSTable-Identifiers.html
and https://issues.apache.org/jira/browse/CASSANDRA-17048

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14374
2023-06-27 16:56:31 +03:00
Kamil Braun
96bc78905d readers: evictable_reader: don't accidentally consume the entire partition
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.

There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.

Fixes #13491
2023-06-27 14:37:29 +02:00
Kamil Braun
5800ce8ddd test: flat_mutation_reader_assertions: squash r_t_cs with the same position
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).

Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.

To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.

Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
2023-06-27 14:37:25 +02:00
Gleb Natapov
945f476363 test: add test for group0 raft command merging
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
2023-06-27 14:59:55 +03:00
Gleb Natapov
8307b09c64 service: raft: respect max mutation size limit when persisting raft entries
The code that preserves raft entries builds one batch statement to store
all of them, but the butch's statement execute() merges all of the
statements into one mutation and passes it to the database. The mutation
can be larger than max mutation size limit and the write will fail. Fix
it by splitting the write to multiple batch statements if needed.
2023-06-27 14:59:55 +03:00
Gleb Natapov
311cfa1be8 group0_state_machine: merge commands before applying them whenever possible
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

Fixes: #12581
2023-06-27 14:40:46 +03:00
Botond Dénes
8d1dfbf0d9 Merge 'Fixing broken links to ScyllaDB University lessons, Scylla University…' from Guy Shtub
… -> ScyllaDB University

Closes #14385

* github.com:scylladb/scylladb:
  Update docs/operating-scylla/procedures/backup-restore/index.rst
  Fixing broken links to ScyllaDB University lessons, Scylla University -> ScyllaDB University
2023-06-27 13:05:46 +03:00
Avi Kivity
f86dd857ca Merge 'Certificate based authorization' from Calle Wilund
Fixes #10099

Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex.

Example:

scylla.yaml:

```
    authenticator: com.scylladb.auth.CertificateAuthenticator
    auth_superuser_name: <name>
    auth_certificate_role_query: CN=([^,\s]+)

    client_encryption_options:
      enabled: True
      certificate: <server cert>
      keyfile: <server key>
      truststore: <shared trust>
      require_client_auth: True
```
In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for  qlsh set "usercert" and "userkey" to these certificate files.

No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it.

Otherwise, connection becomes the role described.

To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup.

Closes #12214

* github.com:scylladb/scylladb:
  docs: Add documentation of certificate auth + auth_superuser_name
  auth: Add TLS certificate authenticator
  transport: Try to do early, transport based auth if possible
  auth: Allow for early (certificate/transport) authentication
  auth: Allow specifying initial superuser name + passwd (salted) in config
  roles-metadata: Coroutinuze some helpers
2023-06-27 12:52:14 +03:00
Guy Shtub
89bb690098 Update docs/operating-scylla/procedures/backup-restore/index.rst
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-06-27 12:22:59 +03:00
Calle Wilund
00e5aec7ec docs: Add documentation of certificate auth + auth_superuser_name
Not great docs. But a start.
2023-06-27 07:38:50 +00:00
Benny Halevy
3ca0c6c0a5 compaction_manager: try_perform_cleanup: set owned_ranges_ptr with compaction disabled
Otherwise regular compaction can sneak in and
see !cs.sstables_requiring_cleanup.empty() with
cs.owned_ranges_ptr == nullptr and trigger
the internal error in `compaction_task_executor::compact_sstables`.

Fixes scylladb/scylladb#14296

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14297
2023-06-27 08:47:13 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00
Raphael S. Carvalho
1d8cb32a5d table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 22:30:39 -03:00
Raphael S. Carvalho
1ff8645eaa view_update_generator: Dump throughput and duration for view update from staging
Very helpful for user to understand how fast view update generation
is processing the staging sstables. Today, logs are completely
silent on that. It's not uncommon for operators to peek into
staging dir and deduce the throughput based on removal of files,
which is terrible.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:23 -03:00
Raphael S. Carvalho
83c70ac04f utils: Extract pretty printers into a header
Can be easily reused elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:20 -03:00
Benny Halevy
9231a6c480 cql-pytest: test_using_timestamp: increase ttl
It seems like the current 1-second TTL is too
small for debug build on aarch64 as seen in
https://jenkins.scylladb.com/job/scylla-master/job/build/1513/artifact/testlog/aarch64/debug/cql-pytest.test_using_timestamp.1.log
```
            k = unique_key_int()
            cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v1}) USING TIMESTAMP {ts} and TTL 1")
            cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v2}) USING TIMESTAMP {ts}")
>           assert_value(k, v1)

test/cql-pytest/test_using_timestamp.py:140:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

k = 10, expected = 2

    def assert_value(k, expected):
        select = f"SELECT k, v FROM {table} WHERE k = {k}"
        res = list(cql.execute(select))
>       assert len(res) == 1
E       assert 0 == 1
E        +  where 0 = len([])
```

Increase the TTL used to write data to de-flake the test
on slow machines running debug build.

Ref #14182

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14396
2023-06-26 21:35:31 +03:00
Benny Halevy
825d617a53 migration_manager: propagate listener notification exceptions
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards.  Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.

We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```

The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).

Refs #3389

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
a690f0e81f storage_service: keyspace_changed: execute only on shard 0
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.

Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
13dd92e618 database: modify_keyspace_on_all_shards: execute func first on shard 0
When creating or altering a keyspace, we create a new
effective_replication_map instance.

It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result).  When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
ba15786059 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.

Fixes #13137

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
3b8c913e61 database: add modify_keyspace_on_all_shards
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
dc9b0812e9 schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
3520c786bd database: create_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Kefu Chai
fb05fddd7d build: build with -O0 if Clang >= 16 is used
to workaround https://github.com/llvm/llvm-project/issues/62842,
per the test this issue only surfaces when compiling the tree
with
ae7bf2b80b
which is included in Clang version 16, and the issue disappears
when the tree is compiled with -O0.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14391
2023-06-26 18:55:10 +03:00
Calle Wilund
a3db540142 auth: Add TLS certificate authenticator
Fixes #10099

Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator,
will extract roles from TLS authentication certificate (not wire cert - those are
server side) subject, based on configurable regex.

Example:

scylla.yaml:

authenticator: com.scylladb.auth.CertificateAuthenticator
auth_superuser_name: <name>
auth_certificate_role_queries:
	- source: SUBJECT
	  query: CN=([^,\s]+)

client_encryption_options:
  enabled: True
  certificate: <server cert>
  keyfile: <server key>
  truststore: <shared trust>
  require_client_auth: True

In a client, then use a certificate signed with the <shared trust>
store as auth cert, with the common name <name>. I.e. for cqlsh
set "usercert" and "userkey" to these certificate files.

No user/password needs to be sent, but role will be picked up
from auth certificate. If none is present, the transport will
reject the connection. If the certificate subject does not
contain a recongnized role name (from config or set in tables)
the authenticator mechanism will reject it.

Otherwise, connection becomes the role described.
2023-06-26 15:00:21 +00:00
Calle Wilund
20e9619bb1 transport: Try to do early, transport based auth if possible
Bypassing the need for an AUTH message+response. I.e. do auth _without_ client having
login specified.
2023-06-26 15:00:21 +00:00
Calle Wilund
a4b13febde auth: Allow for early (certificate/transport) authentication
Preparing for new authenticators. Hint hint.
2023-06-26 15:00:20 +00:00
Calle Wilund
69217662bd auth: Allow specifying initial superuser name + passwd (salted) in config
Instead of locking this to "cassandra:cassandra", allow setting in scylla.yaml
or commandline. Note that config values become redundant as soon as auth tables
are initialized.
2023-06-26 15:00:20 +00:00
Calle Wilund
3638849e63 roles-metadata: Coroutinuze some helpers
To make it easier to add options/features
2023-06-26 15:00:20 +00:00
Wojciech Mitros
b8473f45a5 auth: do not grant permissions to creator without actually creating
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
1. The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
2. The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.

These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.

Additionally, a test is added that reproduces both of the mentioned
cases.
2023-06-26 16:29:49 +02:00
Alexey Novikov
ca4e7f91c6 compact and remove expired rows from cache on read
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs #2252, #6033

Closes #12917
2023-06-26 15:29:01 +02:00
Benny Halevy
5710ec55c2 leveled_compaction_backlog_tracker: replace_sstables: provide strong exception safety guarantees
Modify a temporary copy of `_size_per_level`
and apply it back only when done.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 14:21:49 +03:00
Benny Halevy
39d4b548fc time_window_backlog_tracker: replace_sstables: provide strong exception safety guarantees
Modify a temporary copy of the `_windows` map
and move-assign it back atomically when done.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 14:20:05 +03:00
Benny Halevy
635c564a9d size_tiered_backlog_tracker: replace_sstables: provide strong exception safety guarantees
By making all changes on temporary variables
and eventually moving them back into the tracker members
in a noexcept block the function can safely throw
until the changes are committed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 14:02:21 +03:00
Benny Halevy
054a031504 size_tiered_backlog_tracker: provide static calculate_sstables_backlog_contribution
Instead of providing refresh_sstables_backlog_contribution
that updates the tracker in place, provide a static function
calculate_sstables_backlog_contribution that doesn't change
the tracker state to facilitate exception safety in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 13:32:56 +03:00
Benny Halevy
4e5bfe2c18 size_tiered_backlog_tracker: make log4 helper static
It is completely generic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 13:30:43 +03:00
Benny Halevy
5d6c2b0d12 size_tiered_backlog_tracker: define struct sstables_backlog_contribution
Encapsulate the contribution-related members in
struct contribution, to be used for strong exception safety.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 13:29:38 +03:00
Benny Halevy
bf69584ccc size_tiered_backlog_tracker: update_sstables: update total_bytes only if set changed
Although replace_sstables is supposed to be called
only once per {old_ssts, new_ssts} it is safer
to update `_total_bytes` with `sst->data_size()`
only if the sst was inserted/erased successfully.
Otherwise _total_bytes may go out of sync with the
contents of _all.

That said, the next step should be to refer to the
compaction_group's main sstable set directly rather
than maintaining a "shadow" set in the tracker.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 13:28:50 +03:00
Benny Halevy
1a8cc84981 compaction_backlog_tracker: replace_sstables: pass old and new sstables vectors by ref
To facilitate rollback on the error handling path,
to provide strong exception safety guarantees.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 13:27:18 +03:00
Wojciech Mitros
7883a88abd transport: add is_schema_change() method to result_message
In the next patch, we will want to observe when the result
message is a schema change and handle it differently than
when it is not. This patch adds a helper method for that,
which should be more readable than a dynamic_pointer_cast
and a comparison with nullptr.
2023-06-26 12:22:14 +02:00
Benny Halevy
0877e7a846 compaction_backlog_tracker: replace_sstables: add FIXME comments about strong exception safety
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 12:51:48 +03:00
Botond Dénes
b23361977b Merge 'Compaction reshape tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on top and shard level.

Closes #14112

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test reshaping compaction
  compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run()
  compaction: add shard_reshaping_compaction_task_impl
  replica: delete unused function
  compaction: add table_reshaping_compaction_task_impl
  compaction: copy reshape to task_manager_module.cc
  compaction: add reshaping_compaction_task_impl
2023-06-26 11:56:07 +03:00
Alejo Sanchez
4999cbc1cf test/boost/cql_functions_test: split long running tests
Split long running test_aggregate_functions to one case per type.

This allows test.py to run them in parallel.

Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14368
2023-06-26 11:29:36 +03:00
Alejo Sanchez
8b1968cfbb test/boost/schema_changes_test: split long-running test
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.

Add static checks to alert if the array of types changed.

Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14367
2023-06-26 11:24:07 +03:00
Alejo Sanchez
633f026d63 test/boost/memtable_test: allow parallel run
Remove previous configuration blocking parallel run.

Test cases run fine in local debug.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14369
2023-06-26 11:23:43 +03:00
Alejo Sanchez
3cbfd863eb test/boost/database_test: split long running tests
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.

They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.

Split the test cases to run with 0 and 1, so test.py can run them in
parallel.

Refs #13905

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14356
2023-06-26 11:20:27 +03:00
Takuya ASADA
c70a9cbffe scylla_fstrim_setup: start scylla-fstrim.timer on setup
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.

Also, unmask is unnecessary for enabling the timer.

Fixes #14249

Closes #14252
2023-06-26 11:17:51 +03:00
Petr Gusev
1e851262f2 storage_proxy: handler responses, use pointers to default constructed values instead of nulls
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.

The problem was introduced in this commit
 2d791a5ed4. The
 function `encode_replica_exception_for_rpc` used
 `default_tuple_maker` callback to create tuples
 containing exceptions. Callers returned pointers
 to default-constructed values in this callback,
 e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
 The commit changed this to just `SourceTuple{}`,
 which means nullptr for pointer types.

Fixes: #14282

Closes #14352
2023-06-26 11:10:38 +03:00
Anna Stuchlik
74fc69c825 doc: add Ubuntu 22 to 2021.1 OS support
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036

This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.

This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.

Closes #14372
2023-06-26 10:41:43 +03:00
Aleksandra Martyniuk
197635b44b compaction: delete generation of new sequence number for table tasks
Compaction tasks covering table major, cleanup, offstrategy,
and upgrade sstables compaction inherit sequence number from their
parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.

Closes #14379
2023-06-26 10:36:10 +03:00
Benny Halevy
53a6ea8616 database: update_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 09:35:35 +03:00
Benny Halevy
9d40305ef6 database: drop_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 09:34:42 +03:00
Guy Shtub
fa9df1b216 Fixing broken links to ScyllaDB University lessons, Scylla University -> ScyllaDB University 2023-06-25 09:09:49 +03:00
Aleksandra Martyniuk
b02a5fd184 test: extend test_compaction_task.py to test reshaping compaction 2023-06-23 16:22:53 +02:00
Aleksandra Martyniuk
f9a527b06d compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run() 2023-06-23 16:22:53 +02:00
Aleksandra Martyniuk
1960904a72 compaction: add shard_reshaping_compaction_task_impl
shard_reshaping_compaction_task_impl covers reshaping compaction
on one shard.
2023-06-23 16:22:38 +02:00
Aleksandra Martyniuk
19ec5b4256 replica: delete unused function 2023-06-23 15:57:43 +02:00
Aleksandra Martyniuk
e3e2d6b886 compaction: add table_reshaping_compaction_task_impl 2023-06-23 15:57:37 +02:00
Aleksandra Martyniuk
dace5fb004 compaction: copy reshape to task_manager_module.cc
distributed_loader::reshape is copied to compaction/task_manager_module.cc
as it will be used in reshape compaction tasks.
2023-06-23 12:53:16 +02:00
Aleksandra Martyniuk
981a50e490 compaction: add reshaping_compaction_task_impl
reshaping_compaction_task_impl serves as a base class of all
concrete reshaping compaction task classes.
2023-06-23 12:53:15 +02:00
Kamil Braun
e6942d31d3 Merge 'query processor code cleanup' from Gleb
The series contains mostly cleanups for query processor and no functional
change. The last patch is a small cleanup for the storage_proxy.

* 'qp-cleanup' of https://github.com/gleb-cloudius/scylla:
  storage_proxy: remove unused variable
  client_state: co-routinise has_column_family_access function
  query_processor: get rid of internal_state and create individual query_satate for each request
  cql3: move validation::validate_column_family from client_state::has_column_family_access
  client_state: drop unneeded argument from has.*access functions
  cql3: move check for dropping cdc tables from auth to the drop statement code itself
  query_processor: co-routinise execute_prepared_without_checking_exception_message function
  query_processor: co-routinize execute_direct_without_checking_exception_message function
  cql3: remove empty statement::validate functions
  cql3: remove empty function validate_cluster_support
  cql3/statements: fix indentation and spurious white spaces
  query_processor: move statement::validate call into execute_with_params function
  query_processor: co-routinise execute_with_params function
  query_processor: execute statement::validate before each execution of internal query instead of only during prepare
  query_processor: get rid of shared internal_query_state
  query_processor: co-routinize execute_paged_internal function
  query_processor: co_routinize execute_batch_without_checking_exception_message function
  query_processor: co-routinize process_authorized_statement function
2023-06-23 10:32:57 +02:00
Kamil Braun
be5b61b870 Merge 'cql3: expr: break up expression.hh header' from Avi Kivity
It's very annoying to add a declaration to expression.hh and watch
the whole world get recompiled. Improve that by moving less-common
functions to a new header expr-utils.hh. Move the evaluation machinery
to a new header evaluate.hh. The remaining definitions in expression.hh
should not change as often, and thus cause less frequent recompiles.

Closes #14346

* github.com:scylladb/scylladb:
  cql3: expr: break up expression.hh header
  cql3: expr: restrictions.hh: protect against double inclusions
  cql3: constants: deinline
  cql3: statement_restrictions: deinline
  cql3: deinline operation::fill_prepare_context()
2023-06-23 10:19:28 +02:00
Nadav Har'El
0a1283c813 Merge 'cql3:statements:describe_statement: check pointer after casting to UDF/UDA' from Michał Jadwiszczak
There was a bug in describe_statement. If executing `DESC FUNCTION  <uda name>` or ` DESC AGGREGATE <udf name>`, Scylla was crashing because the function was found (`functions::find()` searches both UDFs and UDAs) but the function was bad and the pointer wasn't checked after cast.

Added a test for this.

Fixes: #14360

Closes #14332

* github.com:scylladb/scylladb:
  cql-pytest:test_describe: add test for filtering UDF and UDA
  cql3:statements:describe_statement: check pointer to UDF/UDA
2023-06-22 20:54:25 +03:00
Michał Jadwiszczak
d3d9a15505 cql-pytest:test_describe: add test for filtering UDF and UDA 2023-06-22 18:08:45 +02:00
Michał Jadwiszczak
d498451cdf cql3:statements:describe_statement: check pointer to UDF/UDA
While looking for specific UDF/UDA, result of
`functions::functions::find()` needs to be filtered out based on
function's type.

Fixes: #14360
2023-06-22 18:08:16 +02:00
Gleb Natapov
94fcba5662 storage_proxy: remove unused variable 2023-06-22 15:26:20 +03:00
Gleb Natapov
caee26ab4f client_state: co-routinise has_column_family_access function 2023-06-22 15:26:20 +03:00
Gleb Natapov
28f31bcbb1 query_processor: get rid of internal_state and create individual query_satate for each request
We want to put per-request field in query_satate, so make it unique for
each internal execution (it is unique for non internal once).
2023-06-22 15:26:20 +03:00
Avi Kivity
b858a4669d cql3: expr: break up expression.hh header
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:

 - moving some restrictions-related definitions to
   the existing expr/restrictions.hh
 - moving evaluation related names to a new header
   expr/evaluate.hh
 - move utilities to a new header
   expr/expr-utilities.hh

expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
2023-06-22 14:21:03 +03:00
Avi Kivity
25c351a4f6 cql3: expr: restrictions.hh: protect against double inclusions
Add #pragma once. Right now it's safe as it only has declarations
(which can be repeated), but soon it will have a definition.
2023-06-22 14:19:43 +03:00
Avi Kivity
7302088274 cql3: constants: deinline
To reduce future header fan-in, deinline all non-trivial functions.
While these aer on the hot path, they can't be inlined anyway as they're
virtual, and they're quite heavy anyway.
2023-06-22 14:19:43 +03:00
Avi Kivity
6c0f8a73c5 cql3: statement_restrictions: deinline
Reduce future header fan-in by deinlining functions. These are
all on the prepare path.
2023-06-22 14:19:43 +03:00
Avi Kivity
3834a1fd7c cql3: deinline operation::fill_prepare_context()
To reduce operation.hh include fan-in, deinline fill_prepare_context().
It's not performance sensitive has it's on the prepare phase.
2023-06-22 14:19:43 +03:00
Gleb Natapov
4bad482e4b cql3: move validation::validate_column_family from client_state::has_column_family_access
Checking keyspace/table presence should not be part of authorization code
and it is not done consistently today.  For instance keyspace presence
is not checked in "alter keyspace" during authorization, but during
statement execution. Make it consistent.
2023-06-22 13:57:36 +03:00
Gleb Natapov
31bddb65c7 client_state: drop unneeded argument from has.*access functions
After previous patch we can drop db argument to most of has.*access
functions in the client_state.
2023-06-22 13:57:36 +03:00
Gleb Natapov
06bcce53b5 cql3: move check for dropping cdc tables from auth to the drop statement code itself
Checking if a table is CDC log and cannot be dropped should not be done
as part of authentication (this has nothing to do with auth), but in the
drop statement itself. Throwing unauthorized_exception is wrong as well,
but unfortunately it is enshrined with a test. Not sure if it is a good
idea to change it now.
2023-06-22 13:57:36 +03:00
Gleb Natapov
0820309c14 query_processor: co-routinise execute_prepared_without_checking_exception_message function 2023-06-22 13:57:36 +03:00
Gleb Natapov
818e72c029 query_processor: co-routinize execute_direct_without_checking_exception_message function 2023-06-22 13:57:36 +03:00
Gleb Natapov
45ce608117 cql3: remove empty statement::validate functions
There are a lot of empty overloads for the function so lets remove them
and use the one in the parent class instead.
2023-06-22 13:57:33 +03:00
Gleb Natapov
3cd9b8548d cql3: remove empty function validate_cluster_support 2023-06-22 13:52:52 +03:00
Gleb Natapov
8c2c4a6a78 cql3/statements: fix indentation and spurious white spaces 2023-06-22 13:49:11 +03:00
Gleb Natapov
d75a41ba30 query_processor: move statement::validate call into execute_with_params function
It is called before any call to the function anyway, so lets do it once instead.
2023-06-22 13:49:11 +03:00
Gleb Natapov
24e78059a5 query_processor: co-routinise execute_with_params function 2023-06-22 13:49:11 +03:00
Gleb Natapov
725fa5e0f3 query_processor: execute statement::validate before each execution of internal query instead of only during prepare
There is a discrepancy on how statement::validate is used. On a regular
path it is called before each execution, but on internal execution
path it is called only once during prepare. Such discrepancy make it
hard to reason what can and cannot be done during the call. Call it
uniformly before each execution. This allow validate to check a state that
can change after prepare.
2023-06-22 13:49:11 +03:00
Gleb Natapov
ce12a18135 query_processor: get rid of shared internal_query_state
internal_query_state was passed in shared_ptr from the java
translation times. It may be a regular c++ type with a lifetime
bound by the function execution it was created in.
2023-06-22 13:49:11 +03:00
Gleb Natapov
aabd05e2ef query_processor: co-routinize execute_paged_internal function 2023-06-22 13:49:11 +03:00
Gleb Natapov
64a67a59d6 query_processor: co_routinize execute_batch_without_checking_exception_message function 2023-06-22 13:49:11 +03:00
Gleb Natapov
c4ca24e636 query_processor: co-routinize process_authorized_statement function 2023-06-22 13:49:11 +03:00
Kamil Braun
23a60df92d Merge 'cql3: expr: simplify evaluate()' from Avi Kivity
Make evaluate()'s body more regular, then exploit it by
replacing the long list of branches with a lambda template.

Closes #14306

* github.com:scylladb/scylladb:
  cql3: expr: simplify evaluate()
  cql3: expr: standardize evaluate() branches to call do_evaluate()
  cql3: expr: rename evaluate(ExpressionElement) to do_evaluate()
2023-06-22 12:18:36 +02:00
Kamil Braun
563d466de1 Merge 'cql3: select_statement: coroutinize indexed statement's do_execute()' from Avi Kivity
Improves readability, and probably a little faster too.

Closes #14311

* github.com:scylladb/scylladb:
  cql3: select_statement: reindent indexed_table_select_statement::do_execute
  cql3: select_statement: simplify inner lambda in indexed_table_select_statement::do_execute()
  cql3: select_statement: coroutinize indexed_table_select_statement::do_execute()
2023-06-22 12:10:45 +02:00
Botond Dénes
55e09dbdc0 Merge 'doc: move cloud deployment instruction to docs -v2' from Anna Stuchlik
This is V2 of https://github.com/scylladb/scylladb/pull/14108

This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs.

The scope:

* Added new files with instructions for AWS, GCP, and Azure.
* Added the new files to the index.
* Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section.
* Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:"
* Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files in the Enterprise repository.

In addition:
- The Configuration Reference file was moved out of the installation section (it's not about installation at all)
- The links to creating a cluster were removed from the installation page (as not related).

Related: https://github.com/scylladb/scylla-docs/issues/4091

Closes #14153

* github.com:scylladb/scylladb:
  doc: remove the rpm-info file (What is in each RPM) from the installation section
  doc: move cloud deployment instruction to docs -v2
2023-06-22 12:58:30 +03:00
Avi Kivity
32b27d6a08 cql3: expr: change evaluation_input vector components to take spans
Spans are slightly cleaner, slightly faster (as they avoid an indirection),
and allow for replacing some of the arguments with small_vector:s.

Closes #14313
2023-06-22 11:28:01 +02:00
Anna Stuchlik
950ef5195e Merge branch 'master' into anna-install-cloud-v2 2023-06-22 10:03:29 +02:00
Botond Dénes
e1c2de4fb8 Merge 'forward_service: fix forgetting case-sensitivity in aggregates ' from Jan Ciołek
There was a bug that caused aggregates to fail when used on column-sensitive columns.

For example:
```cql
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written.

The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Closes #14340

* github.com:scylladb/scylladb:
  service/forward_service.cc: make case-sensitivity explicit
  cql-pytest/test_aggregate: test case-sensitive column name in aggregate
  forward_service: fix forgetting case-sensitivity in aggregates
2023-06-22 08:25:33 +03:00
Botond Dénes
320159c409 Merge 'Compaction group major compaction task' from Aleksandra Martyniuk
Task manager task covering compaction group major
compaction.

Uses multiple inheritance on already existing
major_compaction_task_executor to keep track of
the operation with task manager.

Closes #14271

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py
  test: use named variable for task tree depth
  compaction: turn major_compaction_task_executor into major_compaction_task_impl
  compaction: take gate holder out of task executor
  compaction: extend signature of some methods
  tasks: keep shared_ptr to impl in task
  compaction: rename compaction_task_executor methods
2023-06-22 08:15:17 +03:00
Avi Kivity
8576502c48 Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.

This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.

Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.

Closes #13850

* github.com:scylladb/scylladb:
  test: pylib: increase checking period for `get_alive_endpoints`
  test: add node banning test
  test: pylib: manager_client: `get_cql()` helper
  test: pylib: ScyllaCluster: server pause/unpause API
  raft topology: ban left nodes
  raft topology: skip `left_token_ring` state during `removenode`
  raft topology: prepare decommission path for node banning
  raft topology: introduce `left_token_ring` state
  raft topology: `raft_topology_cmd` implicit constructor
  messaging_service: implement host banning
  messaging_service: exchange host IDs and map them to connections
  messaging_service: store the node's host ID
  messaging_service: don't use parameter defaults in constructor
  main: move messaging_service init after system_keyspace init
2023-06-21 20:16:45 +03:00
Anna Stuchlik
c65abb06cd doc: udpate the OSS docs landing page
Fixes https://github.com/scylladb/scylladb/issues/14333

This commit replaces the documentation landing page with
the Open Source-only documentation landing page.

This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.

Closes #14343
2023-06-21 17:06:48 +03:00
Jan Ciołek
16c21d7252 service/forward_service.cc: make case-sensitivity explicit
Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 16:02:41 +02:00
Jan Ciolek
854b0301be cql-pytest/test_aggregate: test case-sensitive column name in aggregate
There was a bug which made aggregates fail when used with case-sensitive
column names.
Add a test to make sure that this doesn't happen in the future.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 14:49:24 +02:00
Jan Ciolek
7fca350075 forward_service: fix forgetting case-sensitivity in aggregates
There was a bug that caused aggregates to fail when
used on column-sensitive columns.

For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.

The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 14:37:42 +02:00
Nadav Har'El
8a9de08510 sstable: limit compression chunk size to 128 KB
The chunk size used in sstable compression can be set when creating a
table, using the "chunk_length_in_kb" parameter. It can be any power-of-two
multiple of 1KB. Very large compression chunks are not useful - they
offer diminishing returns on compression ratio, and require very large
memory buffers and reading a very large amount of disk data just to
read a small row. In fact, small chunks are recommended - Scylla
defaults to 4 KB chunks, and Cassandra lowered their default from 64 KB
(in Cassandra 3) to 16 KB (in Cassandra 4).

Therefore, allowing arbitrarily large chunk sizes is just asking for
trouble. Today, a user can ask for a 1 GB chunk size, and crash or hang
Scylla when it runs out of memory. So in this patch we add a hard limit
of 128 KB for the chunk size - anything larger is refused.

Fixes #9933

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14267
2023-06-21 14:26:02 +03:00
Kefu Chai
f014ccf369 Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai""
This reverts commit 562087beff.

The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.

Fixes #10459
2023-06-21 13:02:40 +03:00
Avi Kivity
e233f471b8 Merge 'Respect tablet shard assignment' from Tomasz Grabiec
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet

After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.

To make the life easier, support was added to obtain table pointer from the schema pointer:

```
schema_ptr s;
s->table().shard_of(...)
```

Closes #13939

* github.com:scylladb/scylladb:
  locator: network_topology_startegy: Allocate shards to tablets
  locator: Store node shard count in topology
  service: topology: Extract topology updating to a lambda
  test: Move test_tablets under topology_experimental
  sstables: Add trace-level logging related to shard calculation
  schema: Catch incorrect uses of schema::get_sharder()
  dht: Rename dht::shard_of() to dht::static_shard_of()
  treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
  storage_proxy: Avoid multishard reader for tablets
  storage_proxy: Obtain shard from erm in the read path
  db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
  forward_service: Use table sharder
  alternator: Use table sharder
  db: multishard: Obtain sharder from erm
  sstable_directory: Improve trace-level logging
  db: table: Introduce shard_of() helper
  db: Use table sharder in compaction
  sstables: Compute sstable shards using sharder from erm when loading
  sstables: Generate sharding metadata using sharder from erm when writing
  test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
  dht: Make split_range_to_single_shard() prepared for tablet sharder
  sstables: Move compute_shards_for_this_sstable() to load()
  dht: Take sharder externally in splitting functions
  locator: Make sharder accessible through effective_replication_map
  dht: sharder: Document guarantees about mapping stability
  tablets: Implement tablet sharder
  tablets: Include pending replica in get_shard()
  dht: sharder: Introduce next_shard()
  db: token_ring_table: Filter out tablet-based keyspaces
  db: schema: Attach table pointer to schema
  schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
  schema_registry: Make learn(schema_ptr) attach entry to the target schema
  test: lib: cql_test_env: Expose feature_service
  test: Extract throttle object to separate header
2023-06-21 10:20:41 +03:00
Calle Wilund
f18e967939 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294
2023-06-21 10:08:27 +03:00
Tomasz Grabiec
ebdebb982b locator: network_topology_startegy: Allocate shards to tablets
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.

Takes load due to current tablet allocation into account.

Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
e110167a2a locator: Store node shard count in topology
Will be needed by tablet allocator.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
dd968e16bf service: topology: Extract topology updating to a lambda
Reduces code duplication.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
6defcb7bd5 test: Move test_tablets under topology_experimental
Tablets will rely on shard_count information in topology, which is set
only when using eperimental raft-based topology.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
34f28aa0cb sstables: Add trace-level logging related to shard calculation 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
f6625e16ee schema: Catch incorrect uses of schema::get_sharder()
We still use it in many places in unit tests, which is ok because
those tables are vnode-based.

We want to check incorrect uses in production as they may lead to hard
to debug consistency problems.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
29cbdb812b dht: Rename dht::shard_of() to dht::static_shard_of()
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
21198e8470 treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fb0bdcec0c storage_proxy: Avoid multishard reader for tablets
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.

The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
10e05eec66 storage_proxy: Obtain shard from erm in the read path
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
d4497a058e forward_service: Use table sharder
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ab94e74774 alternator: Use table sharder
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
d92287f997 db: multishard: Obtain sharder from erm
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
18f567385c sstable_directory: Improve trace-level logging 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
34ba8a6a53 db: table: Introduce shard_of() helper
Saves some boiler plate code.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
36da062bcb db: Use table sharder in compaction 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ad983ac23d sstables: Compute sstable shards using sharder from erm when loading
schema::get_sharder() does not use the correct sharder for
tablet-based tables.  Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
17d6163548 sstables: Generate sharding metadata using sharder from erm when writing
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
36e12020b9 test: partitioner: Test split_range_to_single_shard() on tablet-like sharder 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
28b972a588 dht: Make split_range_to_single_shard() prepared for tablet sharder
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.

Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:

  start_token = sharder.token_for_next_shard(end_token, shard);

If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fe7922d65c sstables: Move compute_shards_for_this_sstable() to load()
Soon, compute_shards_for_this_sstable() will need to take a sharder object.

open_data() is called indirectly from sstable::load() and directly
after writing an sstable from various paths. The latter don't really
need to compute shards, since the field is already set by the writer. In
order to reduce code churn, move compute_shards_for_this_sstable() to
the load() path only so that only load() needs to take the sharder.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
390bcf3fae dht: Take sharder externally in splitting functions
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.

Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
353ce1a6d1 locator: Make sharder accessible through effective_replication_map
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.

Existing users will be transitioned incrementally in later patches.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
606a8ee2da dht: sharder: Document guarantees about mapping stability 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
22ab100b41 tablets: Implement tablet sharder 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e44e6033d8 tablets: Include pending replica in get_shard()
We need to move get_shard() from tablet_info to tablet_map in order to
have access to transition_info.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e8dd5e34c3 dht: sharder: Introduce next_shard()
The logic was extracted from ring_position_range_sharder::next(), and
the latter was changed to rely on sharder::next_shard().

The tablet sharder will have a different implementation for
next_shard(). This way, ring_position_range_sharder can work with both
current sharder and the tablet sharder.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
16797c2d1a db: token_ring_table: Filter out tablet-based keyspaces
Querying from virtual table system.token_ring fails if there is a
tablet-based table due to attempt to obtain a per-keyspace erm.

Fix by not showing such keyspaces.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
2303466375 db: schema: Attach table pointer to schema
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.

Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.

It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
84cb0f5df7 schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
The netyr may exist, but its schema may not yet be loaded. learn()
didn't take that into account. This problem is not reachable in
production code, which currently always calls get_or_load() before
learn(), except for boot, but there's no concurrency at that point.

Exposed by unit test added later.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
053484e762 schema_registry: Make learn(schema_ptr) attach entry to the target schema
System tables have static schemas and code uses those static schemas
instead of looking them up in the database. We want those schemas to
have a valid table() once the table is created, so we need to attach
registry entry to the target schema rather than to a schema duplicate.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ebc49e89ab test: lib: cql_test_env: Expose feature_service 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ad6d2b42f2 test: Extract throttle object to separate header 2023-06-21 00:58:24 +02:00
Kamil Braun
643e69af89 Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.

Two new columns are added to `system.topology`:

- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.

These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.

During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.

Closes #14232

* github.com:scylladb/scylladb:
  storage_service: update supported cluster features in group0 on start
  storage_service: add methods for features to topology mutation builder
  storage_service: use explicit ::set overload instead of a template
  storage_service: reimplement mutation builder setters
  storage_service: introduce topology_mutation_builder_base
  topology_state_machine: include information about features
  system_keyspace: introduce deserialize_set_column
  db/system_keyspace: add storage for cluster features managed in group 0
2023-06-20 18:32:00 +02:00
Avi Kivity
453bbc1115 cql3: expr: improve error message when rejecting aggregation functions in illegal contexts
Fix a small grammatical error, and capitalize WHERE in accordance
with SQL tradition.

Closes #14288
2023-06-20 17:52:53 +03:00
Piotr Dulikowski
3e955945de storage_service: update supported cluster features in group0 on start
Now, when a node starts, it will update its `supported_features` row in
`system.topology` via `update_topology_with_local_metadata`.

At this point, the functionality behind cluster features on raft is
mostly incomplete and the state of the `supported_features` column does
not influence anything so it's safe to update this column
unconditionally. In the future, the node will only join / start group0
server if it is sure that it supports all enabled features and it can
safely update the `supported_features` parameter.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
707e929831 storage_service: add methods for features to topology mutation builder
The newly added `supported_features` and `enabled_features` columns can
now be modified via topology mutation builders:

- `supported_features` can now be overwritten via a new overload of
  `topology_node_mutation_builder::set`.
- `enabled_features` can now be extended (i.e. more elements can be
  added to it) via `topology_mutation_builder::add_enabled_features`. As
  the set of enabled features only grows, this should be sufficient.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
2a4462a01f storage_service: use explicit ::set overload instead of a template
The `topology_node_mutation_builder::set` function has an overload which
accepts any type which can be converted to string via `::format`. Its
presence can lead to easy mistakes which can only be detected at runtime
rather at compile time. A concrete example: I wrote a function that
accepts an std::set<S> where S is convertible to sstring; it turns out
that std::string_view is not std::convertible_to sstring and overload
resolution falled back to the catch-all overload.

This commit gets rid of the catch-all overload and replaces it with
explicit ones. Fortunately, it was used for only two enums, so it wasn't
much work.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
a8aaeabfac storage_service: reimplement mutation builder setters
As promised in the previous commit which introduced
topology_mutation_builder_base, this commit adjusts existing setters of
topology mutation builder and topology node mutation builder to use
helper methods defined in the base class.

Note that the `::set` method for the unordered set of tokens now does
not delete the column in case an empty value is set, instead it just
writes an empty set. This semantic is arguably more clear given that we
have an explicit `::del` method and it shouldn't affect the existing
implementation - we never intentionally insert an empty set of tokens.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
ee12192125 storage_service: introduce topology_mutation_builder_base
Introduces `topology_mutation_builder_base` which will be a base class
for both topology mutation builder and topology node mutation builder.
Its purpose is to abstract away some detail about setting/deleting/etc.
column in the mutation, the actual topology (node) mutation builder will
only have to care about converting types and/or allowing only particular
columns to be set. The class is using CRTP: derived classes provide
access to the row being modified, schema and the timestamp.

For the sake of commit diff readability, this commt only introduces this
class and changes the builders to derive from it but no setter
implementations are modified - this will be done in the next commit.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
bc84d59665 topology_state_machine: include information about features
Now, the newly added `supported_features` and `enabled_features` columns
are reflected in the `topology_state_machine` structure.
2023-06-20 16:41:05 +02:00
Piotr Dulikowski
e527e63abc system_keyspace: introduce deserialize_set_column
There are three places in system_keyspace.cc which deserialize a column
holding a set of tokens and convert it to an unordered set of
dht::token. The deserialization process involves a small number of steps
that are the same in all of those places, therefore they can be
abstracted away.

This commit adds `deserialize_set_column` function which takes care of
deserializing the column to `set_type_impl::native_type` which can be
then passed to `decode_tokens`. The new function will also be useful for
decoding set columns with cluster features, which will be handled in the
next commit.
2023-06-20 16:37:09 +02:00
Avi Kivity
77ff78328b cql3: select_statement: reindent indexed_table_select_statement::do_execute 2023-06-20 14:12:58 +03:00
Avi Kivity
218d9fe384 cql3: select_statement: simplify inner lambda in indexed_table_select_statement::do_execute()
The lambda is defined to return a coordinator_result<stop_iteration>,
but in fact only returns successful outcomes, never failures.

Change it to return a plain stop_iteration, so its callers don't have
to check for failure.
2023-06-20 14:11:36 +03:00
Kamil Braun
b38dcba6ed test: pylib: increase checking period for get_alive_endpoints
`server_sees_others` and similar functions periodically call
`get_alive_endpoints`. The period was `.1` seconds, increase it to `.5`
to reduce the log spam (I checked empirically that `.5` is usually how
long it takes in dev mode on my laptop.)
2023-06-20 13:03:46 +02:00
Kamil Braun
279a109ce0 test: add node banning test
Pause one of the nodes and once it's marked as DOWN, remove it from the
cluster.

Check that it is not able to perform queries once it unpauses.
2023-06-20 13:03:46 +02:00
Kamil Braun
ae92932240 test: pylib: manager_client: get_cql() helper 2023-06-20 13:03:46 +02:00
Kamil Braun
e02249f0cd test: pylib: ScyllaCluster: server pause/unpause API 2023-06-20 13:03:46 +02:00
Kamil Braun
63229e48e8 raft topology: ban left nodes 2023-06-20 13:03:46 +02:00
Kamil Braun
737c1b4ae6 raft topology: skip left_token_ring state during removenode
The "tell the node to shut down" RPC would fail every time in the
removenode path (since the node is dead), which is kind of awkward.

Besides, for removenode we don't really need the `left_token_ring`
state, we don't need to coordinate with the node - writes destined for
it are failing anyway (since it's dead) and we can ban the node
immediately.

Remove the node from group 0 while in `write_both_read_new` transition
state (even when we implement abort, in this state it's too late to
abort, we're committed to removing the node - so it's fine to remove it
from group 0 at this point).
2023-06-20 13:03:46 +02:00
Kamil Braun
977680773b raft topology: prepare decommission path for node banning
Currently the decommissioned node waits until it observes that it was
moved to the `left` state, then proceeds to leave group 0 and shut down.

Unfortunately, this strategy won't work once we introduce banning nodes
that are in `left` state - there is no guarantee that the
decommissioning node will observe that it entered `left` state. The
replication of Raft commands races with the ban propagating through the
cluster.

We also can't make the node leave as soon as it observes the
`left_token_ring` state, which would defeat the purpose of
`left_token_ring` - allowing all nodes to observe that the node has left
the token ring before it shuts down.

We could introduce yet another state between `left_token_ring` and
`left`, which the node waits for before shutting down; the coordinator
would request a barrier from the node before moving to `left` state.

The alternative - which we chose here - is to have the coordinator
explicitly tell the node to shutdown while we're in `left_token_ring`
through a direct RPC. We introduce
`raft_topology_cmd::command::shutdown` and send it to the node while in
`left_token_ring` state, after we requested a cluster barrier.

We don't require the RPC to succeed; we need to allow it to fail to
preserve availability. This is because an earlier incarnation of the
coordinator may have requested the node to shut down already, so the
new coordinator will fail the RPC as the node is already dead. This also
improves availability in general - if the node dies while we're in
`left_token_ring`, we can proceed.

We don't lose safety from that, since we'll ban the node (later commit).
We only lose a bit of user experience if there's a failure at this
decommission step - the decommissioning node may hang, never receiving
the RPC (it will be necessary to shut it down manually).

Another complication arising from banning the node is that it won't be
able to leave group 0 on its own; by the time it tries that, it may have
already been banned by the cluster (the coordinator moves the node to
`left` state after telling it to shut down). So we get rid of the
`leave_group0` step from `raft_decommission()` (which simplifies the
function too), putting a `remove_from_raft_config` inside the
coordinator code instead - after we told the node to shut down.
(Removing the node from configuration is also another reason why we need
to allow the above RPC to fail; the node won't be able to handle the
request once it's outside the configuration, because it handles all
coordinator requests by starting a read barrier.)

Finally, a complication arises when the coordinator is the
decommissioning node. The node would shut down in the middle of handling
the `left_token_ring` state, leading to harmless but awkward errors even
though there were no node/network failures (the original coordinator
would fail the `left_token_ring` state logic; a new coordinator would take
over and do it again, this time succeeding). We fix that by checking if
we're the decommissioning node at the beginning of `left_token_ring`
state handler, and if so, stepping down from leadership by becoming a
nonvoter first.
2023-06-20 13:03:46 +02:00
Kamil Braun
b8ddfd9ef9 raft topology: introduce left_token_ring state
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).

Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.

After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
2023-06-20 13:03:46 +02:00
Kamil Braun
c94c07804d raft topology: raft_topology_cmd implicit constructor
Saves some redundant typing when passing `raft_topology_cmd` parameters,
so we can change this:
```
raft_topology_cmd{raft_topology_cmd::command::fence_old_reads}
```
into this:
```
raft_topology_cmd::command::fence_old_reads
```
2023-06-20 13:03:46 +02:00
Kamil Braun
8cf47d76a4 messaging_service: implement host banning
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
  be immediately dropped) when receiving the `CLIENT_ID` verb.
2023-06-20 13:03:46 +02:00
Kamil Braun
95c726a8df messaging_service: exchange host IDs and map them to connections
When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.

Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.

This mapping will be used later when we ban nodes that we remove from
the cluster.
2023-06-20 13:03:46 +02:00
Kamil Braun
87f65d01b8 messaging_service: store the node's host ID 2023-06-20 13:03:46 +02:00
Kamil Braun
a78cc17bd4 messaging_service: don't use parameter defaults in constructor 2023-06-20 13:03:46 +02:00
Kamil Braun
7f3ad6bd25 main: move messaging_service init after system_keyspace init 2023-06-20 13:03:46 +02:00
Kamil Braun
8b152361f4 Merge 'raft topology: fixes after #13884' from Gusev Petr
This PR fixes some problems found after the PR was merged:
  * missed `node_to_work_on` assignment in `handle_topology_transition`;
  * change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation.
  * `update_fence_version` has beed moved  after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error.

Fixes: #14303

Closes #14292

* github.com:scylladb/scylladb:
  main.cc: move update_fence_version after group0_service.setup_group0_if_exist
  shared_token_metadata: update_fence_version: on_internal_error -> throw
  storage_service: handle_topology_transition: fix missed node assignment
2023-06-20 13:02:17 +02:00
Avi Kivity
e5ed07e3e1 cql3: select_statement: coroutinize indexed_table_select_statement::do_execute()
Will lead to more readable code after a bit more prettifying. Also
has fewer allocations, though this isn't a hot path.
2023-06-20 13:51:50 +03:00
Aleksandra Martyniuk
8ad6f1f481 test: extend test_compaction_task.py
Extend test_compaction_task.py to test major compaction tasks
covering compaction group compaction.
2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk
648cf4e748 test: use named variable for task tree depth 2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk
74e5b4ebfc compaction: turn major_compaction_task_executor into major_compaction_task_impl
major_compaction_task_executor inherits both from compaction_task_executor
and major_compaction_task_impl.

Thanks to that an executed operation is represented in task manager.
2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk
4922f4cf80 compaction: take gate holder out of task executor
In the following commits, classes deriving from compaction_task_executor
will be alive longer than they are kept in compaction_manager::_tasks.
Thus, the compaction_task_executor::_gate_holder would be held,
blocking other compactions.

compaction_task_executor::_gate_holder is moved outside of
compaction_task_executor object.
2023-06-20 12:12:45 +02:00
Tomasz Grabiec
87b4606cd6 Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability
2023-06-20 12:11:48 +02:00
Petr Gusev
41b950dd21 main.cc: move update_fence_version after group0_service.setup_group0_if_exist
Otherwise, the validation
new_fence_version <= token_metadata::version
inside update_fence_version will use an uninitialized
token_metadata::version == 0
and we will get an error.

The test_topology_ops was improved to
catch this problem.

Fixes: #14303
2023-06-20 13:40:01 +04:00
Petr Gusev
246eaec14e shared_token_metadata: update_fence_version: on_internal_error -> throw
on_internal_error is wrong for fence_version
condition violation, since in case of
topology change coordinator migrating to another
node we can have raft_topology_cmd::command::fence
command from the old coordinator running in
parallel with the fence command (or topology version
upgrading raft command) from the new one.
The comment near the raft_topology_cmd::command::fence
handling describes this situation, assuming an exception
is thrown in this case.
2023-06-20 13:39:17 +04:00
Botond Dénes
8bfe3ca543 query: move max_result_size to query-request.hh
It is currently located in query_class_config.hh, which is named after a
now defunct struct. This arrangement is unintuitive and there is no
upside to it. The main user of max_result_size is query_comand, so
colocate it next to the latter.

Closes #14268
2023-06-20 11:37:50 +02:00
Benny Halevy
26ff8f7bf7 docs: dml: add update ordering section
and add docs/dev/timestamp-conflict-resolution.md
to document the details of the conflict resolution algorithm.

Refs scylladb/scylladb#14063

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 11:55:54 +03:00
Aleksandra Martyniuk
e317ffe23a compaction: extend signature of some methods
Extend a signature of table::compact_all_sstables and
compaction_manager::perform_major_compaction so that they get
the info of a covering task.

This allows to easily create child tasks that cover compaction group
compaction.
2023-06-20 10:45:34 +02:00
Aleksandra Martyniuk
ea470316fb tasks: keep shared_ptr to impl in task
Keep seastar::shared_ptr to task::impl instead of std::unique_ptr
in task. Some classes deriving from task::impl may be used outside
task manager context.
2023-06-20 10:45:34 +02:00
Aleksandra Martyniuk
3007fbeee3 compaction: rename compaction_task_executor methods
compaction_task_executor methods are renamed to prevent name
colisions between compaction_task_executor
and tasks::task_manager::task::impl.
2023-06-20 10:45:34 +02:00
Benny Halevy
31a3152a59 cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
Add reproducers for #14182:

test_rewrite_different_values_using_same_timestamp verifies
expiration-based cell reconciliation.

test_rewrite_different_values_using_same_timestamp_and_expiration
is a scylla_only test, verifying that when
two cells with same timestamp and same expiration
are compared, the one with the lesser ttl prevails.

test_rewrite_using_same_timestamp_select_after_expiration
reproduces the specific issue hit in #14182
where a cell is selected after it expires since
it has a lexicographically larger value than
the other cell with later expiration.

test_rewrite_multiple_cells_using_same_timestamp verifies
atomicity of inserts of multiple columns, with a TTL.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Benny Halevy
0aa13f70eb mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
As in compare_atomic_cell_for_merge, we want to consider
the row marker ttl for ordering, in case both are expiring
and have the same expiration time.

This was missed in a57c087c89
and a085ef74ff.

With that in mind, add documentation to compare_row_marker_for_merge
and a mutual note to both functions about their
equivalence.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Benny Halevy
6717e45ff0 atomic_cell: compare_atomic_cell_for_merge: update and add documentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Benny Halevy
761d62cd82 compare_atomic_cell_for_merge: compare value last for live cells
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was changed in CASSANDRA-14592
for consistency with the preference for dead cells over live cells,
as expiring cells will become tombstones at a future time
and then they'd win over live cells with the same timestamp,
hence they should win also before expiration.

In addition, comparing the cell value before expiration
can lead to unintuitive corner cases where rewriting
a cell using the same timestamp but different TTL
may cause scylla to return the cell with null value
if it expired in the meanwhile.

Also, when multiple columns are written using two upserts
using the same write timestamp but with different expiration,
selecting cells by their value may return a mixed result
where each cell is selected individually from either upsert,
by picking the cells with the largest values for each column,
while using the expiration time to break tie will lead
to a more consistent results where a set of cell from
only one of the upserts will be selected.

Fixes scylladb/scylladb#14182

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Benny Halevy
ec034b92c0 mutation_test: test_cell_ordering: improve debuggability
Currently, it is hard to tell which of the many sub-cases
fail in this unit test, in case any of them fails.

This change uses logging in debug and trace level
to help with that by reproducing the error
with --logger-log-level testlog=trace
(The cases are deterministic so reproducing should not
be a problem)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-20 10:10:39 +03:00
Botond Dénes
63b395fe70 Merge 'docs: changes subdomain to opensource.docs.scylladb.com' from David Garcia
docs.scylladb.com will point to https://github.com/scylladb/scylladb-docs-homepage

This pull request changes the domain of this repo to opensource.docs.scylladb.com and moves all the redirects to https://github.com/scylladb/scylladb-docs-homepage/blob/main/docs/_utils/redirects.yaml

Closes #14221

* github.com:scylladb/scylladb:
  Update conf.py
  docs: separate homepage
2023-06-20 10:00:40 +03:00
Botond Dénes
9e9636ef15 Merge 'cql3: select_statement: coroutinize and simplify do_execute()' from Avi Kivity
Split off do_execute() into a fast path and slow(ish) path, and
coroutinize the latter.

perf-simple-query shows no change in performance (which is
unsurprising since it picks the fast path which is essentially unchanged).

Closes #14246

* github.com:scylladb/scylladb:
  cql3: select_statement: reindent execute_without_checking_exception_message_aggregate_or_paged()
  cql3: select_statement: coroutinize execute_without_checking_exception_message_aggregate_or_paged()
  cql3: select_statement: split do_execute into fast-path and slow/slower paths
  cql3: select_statement: disambiguate execute() overloads
2023-06-20 08:02:07 +03:00
Kamil Braun
732feca115 storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300
2023-06-20 07:56:40 +03:00
Botond Dénes
ddf8547f25 Merge 'Add concurrency control and workload isolation for S3 client' from Pavel Emelyanov
In its current state s3 client uses a single default-configured http client thus making different sched classes' workload compete with each other for sockets to make requests on. There's an attempt to handle that in upload-sink implementation that limits itself with some small number of concurrent PUT requests, but that doesn't help much as many sinks don't share this limit.

This PR makes S3 client maintain a set of http clients, one per sched-group, configures maximum number of TCP connections proportional to group's shares and removes the artificial limit from sinks thus making them share the group's http concurrency limit.

As a side effect, the upload-sink fixes the no-writes-after-flush protection -- if it's violated, write will result in exception, while currently it just hangs on a semaphore forever.

fixes: #13458
fixes: #13320
fixes: #13021

Closes #14187

* github.com:scylladb/scylladb:
  s3/client: Replace skink flush semaphore with gate
  s3/client: Configure different max-connections on http clients
  s3/client: Maintain several http clients on-board
  s3/client: Remove now unused http reference from sink and file
  s3/client: Add make_request() method
2023-06-20 07:09:21 +03:00
Nadav Har'El
7deba4f4a5 test/cql-pytest: add tests reproducing bugs in compression configuration
This patch adds some minimal tests for the "with compression = {..}" table
configuration. These tests reproduce three known bugs:

Refs #6442: Always print all schema parameters (including default values)

  Scylla doesn't return the default chunk_length_in_kb, but Cassandra
  does.

Refs #8948: Cassandra 3.11.10 uses "class" instead of "sstable_compression"
            for compression settings by default

  Cassandra switched, long ago, the "sstable_compression" attribute's
  name to "class". This can break Cassandra applications that create
  tables (where we won't understand the "class" parameter) and applications
  that inquire about the configuration of existing tables. This patch adds
  tests for both problems.

Refs #9933: ALTER TABLE with "chunk_length_kb" (compression) of 1MB caused a
            core dump on all nodes

  Our test for this issue hangs Scylla (or crashes, depending on the test
  environment configuration), when a huge allocation is attempted during
  memtable flush. So this test is marked "skip" instead of xfail.

The tests included here also uncovered a new minor/insignificant bug,
where Scylla allows floating point numbers as chunk_length_in_kb - this
number is truncated to an integer, and allowed, unlike Cassandra or
common sense.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14261
2023-06-20 06:36:13 +03:00
Avi Kivity
792c46c0f8 cql3: expr: simplify evaluate()
Now that all branches in the visitor are uniform and consist
of a single call to do_evaluate() overloads, we can simplify
by calling a lambda template that does just that.
2023-06-20 02:33:10 +03:00
Tomasz Grabiec
5fa08adc88 Merge 'cache_flat_mutation_reader: use the correct schema in prepare_hash' from Michał Chojnowski
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.

However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.

This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.

This series fixes the bug and adds a unit test which reproduces the issue.

Fixes #14110

Closes #14305

* github.com:scylladb/scylladb:
  test: boost/row_cache_test: add a reproducer for #14110
  cache_flat_mutation_reader: use the correct schema in prepare_hash
  mutation: mutation_cleaner: add pause()
2023-06-20 01:30:11 +02:00
Avi Kivity
66e0326385 cql3: expr: standardize evaluate() branches to call do_evaluate()
Extract the various snippets into do_evaluate() overloads. We'll
exploit this in the next patch.
2023-06-20 02:19:33 +03:00
Avi Kivity
b64eeefa35 cql3: expr: rename evaluate(ExpressionElement) to do_evaluate()
evaluate(expression) calls the various evaluate(ExpressionElement)
overloads to perform its work. However, if we add an ExpressionElement
and forget to implement its evaluate() overload, we'll end up in
with infinite recursion. It will be caught immediately, but better to
avoid it.

Also sprinkle static:s on do_evaluate() where missing.
2023-06-20 02:10:18 +03:00
Israel Fruchter
3889e9040c Update tools/cqlsh submodule
* tools/cqlsh 6e1000f1...2254e920 (2):
  > test: add support for testing cloud bundle option
  > Fix cloudconf handling

Closes #14259
2023-06-20 00:10:53 +03:00
Michał Chojnowski
02bcb5d539 test: boost/row_cache_test: add a reproducer for #14110 2023-06-19 22:50:46 +02:00
Michał Chojnowski
d56b0c20f4 cache_flat_mutation_reader: use the correct schema in prepare_hash
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.

However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.

This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.

Fixes #14110
2023-06-19 22:50:43 +02:00
Michał Chojnowski
4f73a28174 mutation: mutation_cleaner: add pause()
In unit tests, we would want to delay the merging of some MVCC
versions to test the transient scenarios with multiple versions present.

In many cases this can be done by holding snapshots to all versions.
But sometimes (i.e. during schema upgrades) versions are added and
scheduled for merge immediately, without a window for the test to
grab a snapshot to the new version.

This patch adds a pause() method to mutation_cleaner, which ensures
that no asynchronous/implicit MVCC version merges happen within
the scope of the call.

This functionality will be used by a test added in an upcoming patch.
2023-06-19 22:50:43 +02:00
David Garcia
43bb19ce62 Merge branch 'master' into separate-homepage 2023-06-19 14:02:32 +01:00
David Garcia
73806de8b0 Update conf.py 2023-06-19 14:01:04 +01:00
Nadav Har'El
a66c407bf1 Merge 'scylla-sstable: add scrub operation' from Botond Dénes
Exposing scrub compaction to the command-line. Allows for offline scrub of sstables, in cases where online scrubbing (via scylla itself) is not possible or not desired. One such case recently was an sstable from a backup which turned out to be corrupt, `nodetool refresh --load-and-stream` refusing to load it.

Fixes: #14203

Closes #14260

* github.com:scylladb/scylladb:
  docs/operating-scylla/admin-tools: scylla-sstable: document scrub operation
  test/cql-pytest: test_tools.py: add test for scylla sstable scrub
  tools/scylla-sstable: add scrub operation
  tools/scylla-sstable: write operation: add none to valid validation levels
  tools/scylla-sstable: handle errors thrown by the operation
  test/cql-pytest: add option to omit scylla's output from the test output
  tools/scylla-sstable: s/option/operation_option/
  tool/scylla-sstable: add missing comments
2023-06-19 15:40:51 +03:00
Nadav Har'El
25bbc424c3 Merge 'test_using_timestamp: update expected errors' from Benny Halevy
This mini-series updates the expected errors in `test/cql-pytest/test-timestamp.py`
to the ones changed in b7bbcdd178.
Then, it renamed the test to `test_using_timestamp.py` so it would
run automatically with `test.py`.

Closes #14293

* github.com:scylladb/scylladb:
  cql-pytest: rename test-timestamp.py to test_using_timestamp.py
  cql-pytest: test-timestamp: test_key_writetime: update expected errors
2023-06-19 15:12:10 +03:00
Avi Kivity
1c6c7992e4 Revert "build: cmake: use -O0 for debug build"
This reverts commit 8a54e478ba. As
commit 7dadd38161 ("Revert "configure: Switch debug build from
-O0 to -Og") was reverted (by b7627085cb, "Revert "Revert
"configure: Switch debug build from -O0 to -Og""")), we do the
same to cmake to keep the two build systems in sync.

Closes #14286
2023-06-19 14:31:28 +03:00
Botond Dénes
bd7a3e5871 Merge 'Sanitize sstables-making utils in tests' from Pavel Emelyanov
There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those

Closes #14280

* github.com:scylladb/scylladb:
  test/utils: Generalize making memtable from vector<mutation>
  test/util: Generalize make_sstable_easy()-s
  test/sstable_mutation: Remove useless helper
  test/sstable_mutation: Make writer config in make_sstable_mutation_source()
  test/utils: De-duplicate make_sstable_containing-s
  test/sstable_compaction: Remove useless one-line local lambda
  test/sstable_compaction: Simplify sstable making
  test/sstables*: Make sstable from vector of mutations
  test/mutation_reader: Remove create_sstable() helper from test
2023-06-19 14:05:29 +03:00
Pavel Emelyanov
6bec03f96f test: Remove sstable_utils' storage_prefix() helper
It's excessive, test case that needs it can get storage prefix without
this fancy wrapper-helper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14273
2023-06-19 13:51:04 +03:00
Pavel Emelyanov
1a332ef5e2 test: Check sstable bytes correctness on S3 too
Commit 4e205650 (test: Verify correctness of sstable::bytes_on_disk())
added a test to verify that sstable::bytes_on_disk() is equal to the
real size of real files. The same test case makes sense for S3-backed
sstables as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14272
2023-06-19 13:47:31 +03:00
Piotr Dulikowski
0bd8b7c663 test/topology/test_cluster_features: workaround for the python driver not reconnecting after full cluster restart in test_downgrade_after_successful_upgrade_fails
Followup to 9bfa63fe37. Like in
`test_downgrade_after_successful_upgrade_fails`, the test
`test_joining_old_node_fails` also restarts all nodes at once and is prone
to a bug in the Python driver which can prevent the session from
reconnecting to any of the nodes. This commit applies the same
workaround to the other test (manual reconnect by recreating the Python
driver session).

Closes #14291
2023-06-19 12:38:23 +02:00
Anna Stuchlik
5dbf169068 doc: remove the rpm-info file (What is in each RPM) from the installation section 2023-06-19 12:37:57 +02:00
Kamil Braun
aa2ccb3ac4 Merge 'raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs' from Mikołaj Grzebieluch
Another node can stop after it joined the group0 but before it
advertised itself in gossip. `get_inet_addrs` will try to resolve all
IPs and `wait_for_peers_to_enter_synchronize_state` will loop
indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one
of the nodes confirms that the upgrade procedure has finished. For that,
it doesn't need the IPs of all group 0 members - only the IP of some
nodes which can do the confirmation.

This pr restructures the code so that IPs of nodes are resolved inside
the `max_concurrent_for_each` that
`wait_for_peers_to_enter_synchronize_state` performs. Then, even if some
IPs won't be resolved, but one of the nodes confirms a successful
upgrade, we can continue.

Fixes #13543

Closes #14046

* github.com:scylladb/scylladb:
  raft topology: test: check if aborted node replacing blocks bootstrap
  raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs
2023-06-19 12:31:27 +02:00
Benny Halevy
b0bcad0c91 cql-pytest: rename test-timestamp.py to test_using_timestamp.py
1. Otherwise test.py doesn't recognize it.
2. As it represents what the test does in a better way.
3. Following `test_using_timeout.py` naming convention.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-19 13:26:24 +03:00
Benny Halevy
19208c42dc cql-pytest: test-timestamp: test_key_writetime: update expected errors
The error messages were changed in
b7bbcdd178.

Extend the `match` regular expression param
to pytest.raises to include both old and new message
to remain backward compatible also with Cassandra,
as this test is run against both Cassandra and Scylla.

Note that the test didn't run automatically
since it's named `test-timestamp.py` and test.py
looks up only test scripts beginning with `test_`.
The test will be renamed in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-19 13:25:13 +03:00
Anna Stuchlik
77ebc18730 Merge branch 'master' into anna-install-cloud-v2 2023-06-19 12:09:31 +02:00
Anna Stuchlik
d0bae532bd doc: move cloud deployment instruction to docs -v2
This is V2 of https://github.com/scylladb/scylladb/pull/14108

This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs.

The scope:

* Added new files with instructions for AWS, GCP, and Azure.
* Added the new files to the index.
* Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section.
* Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:"
* Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files
in the Enterprise repository.

In addition:
- The Configuration Reference file was moved out of the installation
  section (it's not about installation at all)
- The links to creating a cluster were removed from the installation
page (as not related).

Related: https://github.com/scylladb/scylla-docs/issues/4091
2023-06-19 12:06:28 +02:00
Nadav Har'El
ac3d0d4460 Merge 'cql3: expr: support evaluate(column_mutation_attribute)' from Avi Kivity
In preparation for converting selectors to evaluate expressions,
add support for evaluating column_mutation_attribute (representing
the WRITETIME/TTL pseudo-functions).

A unit test is added.

Fixes #12906

Closes #14287

* github.com:scylladb/scylladb:
  test: expr: test evaluation of column_mutation_attribute
  test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps
  cql3: expr: evaluate() column_mutation_attribute
2023-06-19 11:11:49 +03:00
Petr Gusev
1770feebda storage_service: handle_topology_transition: fix missed node assignment
This defect remained after the refactoring of
exec_global_command in #13884.
2023-06-19 11:26:57 +04:00
Botond Dénes
562087beff Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.

Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.

Refs: #14283
2023-06-19 08:49:27 +03:00
Avi Kivity
135efa3360 Merge 'Simplify system_keyspace initialization' from Kamil Braun
Initialization of `system_keyspace` is now done in a single place instead of
being spread out through the entire procedure. `system_keyspace` is also
available for queries much earlier which allows, for example, to load our Host
ID before we initialize any of the distributed services (like gossiper,
messaging_service etc.) This is doable because `query_processor` is now
available early. A couple of FIXMEs have been resolved.

Refs: #14202

Closes #14285

* github.com:scylladb/scylladb:
  main, cql_test_env: simplify `system_keyspace` initialization
  db: system_keyspace: take simpler service references in `make`
  db: system_keyspace: call `initialize_virtual_tables` from `main`
  db: system_keyspace: refactor virtual tables creation
  db: system_keyspace: remove `system_keyspace_make`
  db: system_keyspace: refactor local system table creation code
  replica: database: remove `is_bootstrap` argument from create_keyspace
  replica: database: write a comment for `parse_system_tables`
  replica: database: remove redundant `keyspace::get_erm_factory()` getter
  db: system_keyspace: don't take `sharded<>` references
2023-06-18 23:48:46 +03:00
Avi Kivity
0f98e9f8c8 test: expr: test evaluation of column_mutation_attribute
There's no way to evaluate a column_mutation_attribute via CQL
yet (the only user uses old-style cql3::selection::selector), so
we only supply a unit test.
2023-06-18 22:47:46 +03:00
Avi Kivity
5e2fd0bbaf test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps
While remaining backwards compatible, allow supplying custom timestamp/ttl
with each fake column value.

Note: I tried to use a formatter<> for the new data structure, but
got entangled in a template loop.
2023-06-18 22:45:25 +03:00
Avi Kivity
7090f4c43b cql3: expr: evaluate() column_mutation_attribute
Enhance evaluation_inputs with timestamps and ttls, and use
them to evaluate writetime/ttl.

The data structure is compatible with the current way of doing
things (see result_set_builder::_timestamps, result_set_build::_ttls).
We use std::span<> instead of std::vector<> as it is more general
and a tiny bit faster.

The algorithm is taken from writetime_or_ttl_selector::add_input().
2023-06-18 22:41:09 +03:00
Avi Kivity
3b3f28fc12 test.py: report CPU utilization
Low CPU utilization is a major contributor to high test time.
Low CPU utilization can happen due to tests sleeping, or lack
of concurrency due to Amdahl's law.

Utilization is computed by dividing the utilized CPU by the available
CPU (CPU count times wall time).

Example output:

Found 134 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[134/134]   boost     dev   [ PASS ] boost.json_cql_query_test.test_unpack_decimal.1
------------------------------------------------------------------------------
CPU utilization: 4.8%

Closes #14251
2023-06-18 19:33:02 +03:00
Michał Chojnowski
db0871a644 test: test_keyspace: add a test checking that ALTER KEYSPACE preserves UDTs
Reproduces #14139

Closes #14144
2023-06-18 16:50:39 +03:00
Kamil Braun
028183c793 main, cql_test_env: simplify system_keyspace initialization
Initialization of `system_keyspace` is now all done at once instead of
being spread out through the entire procedure. This is doable because
`query_processor` is now available early. A couple of FIXMEs have been
resolved.
2023-06-18 13:39:27 +02:00
Kamil Braun
33c19baabc db: system_keyspace: take simpler service references in make
Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.

This will allow us to move the `make` step right after starting
`system_keyspace`.
2023-06-18 13:39:27 +02:00
Kamil Braun
b34605d161 db: system_keyspace: call initialize_virtual_tables from main
`initialize_virtual_tables` was called from `system_keyspace::make`,
which caused this `make` function to take a bunch of references to
late-initialized services (`gossiper`, `storage_service`).

Call it from `main`/`cql_test_env` instead.

Note: `system_keyspace::make` is called from
`distributed_loader::init_system_keyspace`. The latter function contains
additional steps: populate the system keyspaces (with data from
sstables) and mark their tables ready for writes.

None of these steps apply to virtual tables.

There exists at least one writable virtual table, but writes into
virtual tables are special and the implementation of writes is
virtual-table specific. The existing writable virtual table
(`db_config_table`) only updates in-memory state when written to. If a
virtual table would like to create sstables, or populate itself with
sstable data on startup, it will have to handle this in its own
initialization function.

Separating `initialize_virtual_tables` like this will allow us to
simplify `system_keyspace` initialization, making it independent of
services used for distributed communication.
2023-06-18 13:39:27 +02:00
Kamil Braun
c931d9327d db: system_keyspace: refactor virtual tables creation
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.

This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
2023-06-18 13:39:27 +02:00
Kamil Braun
035045c288 db: system_keyspace: remove system_keyspace_make
The code can now be inlined in `system_keyspace::make` as we no longer
access private members of `database`.
2023-06-18 13:39:27 +02:00
Kamil Braun
cf120e46b8 db: system_keyspace: refactor local system table creation code
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).

Extract this part into `database::create_local_system_table`.

Make `database::add_column_family` private.
2023-06-18 13:39:27 +02:00
Kamil Braun
3f04a5956c replica: database: remove is_bootstrap argument from create_keyspace
Unused.
2023-06-18 13:39:27 +02:00
Kamil Braun
8848c3b809 replica: database: write a comment for parse_system_tables 2023-06-18 13:39:27 +02:00
Kamil Braun
4ca149c1f0 replica: database: remove redundant keyspace::get_erm_factory() getter
`keyspace` can simply access its private field.
2023-06-18 13:39:27 +02:00
Kamil Braun
53cf646103 db: system_keyspace: don't take sharded<> references
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.

Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
2023-06-18 13:39:26 +02:00
Nadav Har'El
97d444bbf7 Merge 'cql3/expression: implement evaluate(field_selection) ' from Jan Ciołek
Implement `expr:valuate()` for `expr::field_selection`.

`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`.

Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it.

Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`.
Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906)

Fixes: https://github.com/scylladb/scylladb/issues/12906

Closes #14235

* github.com:scylladb/scylladb:
  boost/expr_test: test evaluate(field_selection)
  cql3/expr: fix printing of field_selection
  cql3/expression: implement evaluate(field_selection)
  types/user: modify idx_of_field to use bytes_view
  column_identifer: add column_identifier_raw::text()
  types: add read_nth_user_type_field()
  types: add read_nth_tuple_element()
2023-06-18 11:08:25 +03:00
Avi Kivity
b7627085cb Revert "Revert "configure: Switch debug build from -O0 to -Og""
This reverts commit 7dadd38161.

The latest revert cited debuggability trumping performance, but the
performance loss is su huge here that debug builds are unusable and
next promotions time out.

In the interest of progress, pick the lesser of two evils.
2023-06-17 15:20:26 +03:00
Pavel Emelyanov
15ac192cc2 test/utils: Generalize making memtable from vector<mutation>
Both, make_sstable_easy() and make_sstable_containing() prepare memtable
by allocating it and applying mutations from vector. Make a local
helper. Many test cases can, probably, benefit from it too, but they
often do more stuff before applying mutation to memtable, so this is
left for future patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:24:24 +03:00
Pavel Emelyanov
2badad1b15 test/util: Generalize make_sstable_easy()-s
There are two of them, one making sstable from memtable and the other
one doing the same from a custom reader. The former can just call the
latter with memtable's flat reader

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:23:46 +03:00
Pavel Emelyanov
85310bc043 test/sstable_mutation: Remove useless helper
There are two make_sstable_mutation_source() helpers that call one
another and test cases only need one of them, so leave just one that's
in use.

Also don't pass env's tempdir to make_sstable() util call, it can get
env's tempdir on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:21:40 +03:00
Pavel Emelyanov
4a7be304ac test/sstable_mutation: Make writer config in make_sstable_mutation_source()
These local helpers accept writer config which's made the same way by
callers, so the helpers can do it on their own

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:20:50 +03:00
Pavel Emelyanov
6fe7476ba9 test/utils: De-duplicate make_sstable_containing-s
The function that prepares memtable from mutations vector can call its
overload that writes this memtable into an sstable

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:19:55 +03:00
Pavel Emelyanov
753b674c31 test/sstable_compaction: Remove useless one-line local lambda
The get_usable_sst() wrapper lambda is not needed, calling the
make_sstable_containing() is shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:19:15 +03:00
Pavel Emelyanov
5b46993438 test/sstable_compaction: Simplify sstable making
There's a temporary memtable and on-stack lambda that makes the
mutation. Both are overkill, make_sstable_containing() can work on just
plan on-stack-constructed mutation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:18:13 +03:00
Pavel Emelyanov
ce29f41436 test/sstables*: Make sstable from vector of mutations
There are many cases that want to call make_sstable_containing() with
the vector of mutations at hand. For that they apply it to a temporary
memtable, but sstable-utils can work with the mutations vector as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:17:12 +03:00
Pavel Emelyanov
c2eb3e2c4c test/mutation_reader: Remove create_sstable() helper from test
It's a one-liner wrapper, caller can get the same result with existing
utils facilities

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-16 21:16:34 +03:00
Kamil Braun
9bfa63fe37 Merge 'test/topology/test_cluster_features: workaround for python driver not reconnecting after full cluster restart' from Piotr Dulikowski
The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully.

More context can be found on the python driver issue.

This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart.

Refs: scylladb/python-driver#230

Closes #14276

* github.com:scylladb/scylladb:
  tests/topology: work around python driver issue in cluster feature tests
  test/topology{_raft_disabled}: move reconnect_driver to topology utils
2023-06-16 16:54:58 +02:00
Pavel Emelyanov
900c609269 Merge 'Initialize query_processor early, without messaging_service or gossiper' from Kamil Braun
In https://github.com/scylladb/scylladb/pull/14231 we split `storage_proxy` initialization into two phases: for local and remote parts. Here we do the same with `query_processor`. This allows performing queries for local tables early in the Scylla startup procedure, before we initialize services used for cluster communication such as `messaging_service` or `gossiper`.

Fixes: #14202

As a follow-up we will simplify `system_keyspace` initialization, making it available earlier as well.

Closes #14256

* github.com:scylladb/scylladb:
  main, cql_test_env: start `query_processor` early
  cql3: query_processor: split `remote` initialization step
  cql3: query_processor: move `migration_manager&`, `forwarder&`, `group0_client&` to a `remote` object
  cql3: query_processor: make `forwarder()` private
  cql3: query_processor: make `get_group0_client()` private
  cql3: strongly_consistent_modification_statement: fix indentation
  cql3: query_processor: make `get_migration_manager` private
  tracing: remove `qp.get_migration_manager()` calls
  table_helper: remove `qp.get_migration_manager()` calls
  thrift: handler: move implementation of `execute_schema_command` to `query_processor`
  data_dictionary: add `get_version`
  cql3: statements: schema_altering_statement: move `execute0` to `query_processor`
  cql3: statements: pass `migration_manager&` explicitly to `prepare_schema_mutations`
  main: add missing `supervisor::notify` message
2023-06-16 17:41:08 +03:00
Kamil Braun
23d5ddbecb Merge 'storage_service: remove optimization in cleanup_group0_config_if_needed' from Piotr Dulikowski
The `topology_coordinator::cleanup_group0_config_if_needed` function first checks whether the number of group 0 members is larger than the number of non-left entries in the topology table, then attempts to remove nodes in left state from group 0 and prints a warning if no such nodes are found. There are some problems with this check:

- Currently, a node is added to group 0 before it inserts its entry to the topology table. Such a node may cause the check to succeed but no nodes will be removed, which will cause the warning to be printed needlessly.
- Cluster features on raft will reverse the situation and it will be possible for an entry in system.topology to exist without the corresponding node being a part of group 0. This, in turn, may cause the check not to pass when it should and nodes could be removed later than necessary.

This commit gets rid of the optimization and the warning, and the topology coordinator will always compute the set of nodes that should be removed. Additionally, the set of nodes to remove is now computed differently: instead of iterating over left nodes and including only those that are in group 0, we now iterate over group 0 members and include those that are in `left` state. As the number of left nodes can potentially grow unbounded and the number of group 0 members is more likely to be bounded, this should give better performance in long-running clusters.

Closes #14238

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous commit
  storage_service: remove optimization in cleanup_group0_config_if_needed
2023-06-16 15:59:32 +02:00
Piotr Dulikowski
fadb1351bd tests/topology: work around python driver issue in cluster feature tests
The test `test_downgrade_after_successful_upgrade_fails` stops all
nodes, reconfigures them to support the test-only feature and restarts
them. Unfortunately, it looks like python driver sometimes does not
handle this properly and might not reconnect after all nodes are shut
down.

This commit adds a workaround for scylladb/python-driver#230 - the test
re-creates python driver session right after nodes are restarted.
2023-06-16 15:25:02 +02:00
Piotr Dulikowski
b3771e6011 test/topology{_raft_disabled}: move reconnect_driver to topology utils
The `reconnect_driver` function will be useful outside the
`topology_raft_disabled` test suite - namely, for cluster feature tests
in `topology`. The best course of action for this function would be to
put it into pylib utils; however, the function depends on ManagerClient
which is defined in `test.pylib.manager_client` that depends on
`test.pylib.utils` - therefore we cannot put it there as it would cause
an import cycle. The `topology.utils` module sounds like the next best
thing.

In addition, the docstring comment is updated to reflect that this
function will now be used to work around another issue as well.
2023-06-16 15:25:02 +02:00
Kamil Braun
9f9f4c224b main, cql_test_env: start query_processor early
Start it right after `storage_proxy`.

We also need to start `cql_config` earlier
because `query_processor` uses it.
2023-06-16 14:29:59 +02:00
Kamil Braun
c212370cf1 cql3: query_processor: split remote initialization step
Pass `migration_manager&`, `forward_service&` and `raft_group0_client&`
in the remote init step which happens after the constructor.

Add a corresponding uninit remote step.
Make sure that any use of the `remote` services is finished before we
destroy the `remote` object by using a gate.

Thanks to this in a later commit we'll be able to move the construction
of `query_processor` earlier in the Scylla initialization procedure.
2023-06-16 14:29:59 +02:00
Kamil Braun
ec5b831c13 cql3: query_processor: move migration_manager&, forwarder&, group0_client& to a remote object
These services are used for performing distributed queries, which
require remote calls. As a preparation for 2-phase initialization of
`query_processor` (for local queries vs for distributed queries), move
them to a separate `remote` object which will be constructed in the
second phase.

Replace the getters for the different services with a single `remote()`
getter. Once we split the initialization into two phases, `remote()`
will include a safety protection.
2023-06-16 14:08:21 +02:00
Kamil Braun
c2fa6406ad cql3: query_processor: make forwarder() private 2023-06-16 13:45:59 +02:00
Kamil Braun
f616408a87 cql3: query_processor: make get_group0_client() private 2023-06-16 13:45:19 +02:00
Kamil Braun
db769c8eb3 cql3: strongly_consistent_modification_statement: fix indentation 2023-06-16 13:44:59 +02:00
Kamil Braun
2e441e17cf cql3: query_processor: make get_migration_manager private
After previous commits it's no longer used outside `query_processor`.
Also remove the `const` version - not needed for anything.

Use the getter instead of directly accessing `_mm` in `query_processor`
methods. Later we will put `_mm` in a separate object.
2023-06-16 13:44:14 +02:00
Piotr Dulikowski
dcd520f6cf db/system_keyspace: add storage for cluster features managed in group 0
The `system.topology` table is extended with two new columns that will
be used to manage cluster features:

- `supported_features set<text>` is a new clustering column which holds
  the features that given node advertises as supported. It will be first
  initialized when the node joins the cluster, and then updated every
  time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the
  features that are considered enabled by the cluster. Unlike in the
  current gossip-based implementation the features will not be enabled
  implicitly when all nodes support a feature, but rather when via an
  explicit action of the topology coordinator.
2023-06-16 13:19:53 +02:00
Botond Dénes
e92b71c451 docs/operating-scylla/admin-tools: scylla-sstable: document scrub operation 2023-06-16 06:20:14 -04:00
Botond Dénes
19708d39ae test/cql-pytest: test_tools.py: add test for scylla sstable scrub
The tests are meant to excercise the command line interface and the
plumbing, not the scrub logic itself, we have dedicated tests for that.
2023-06-16 06:20:14 -04:00
Botond Dénes
c294f2480c tools/scylla-sstable: add scrub operation
Exposing scrub compaction to the command-line. Scrubbed sstables are
written into a directory specified by the `--output-directory` command
line parameter. This directory is expected to be empty, to avoid
clashes with any pre-existing sstables. This can be overriden by the
user if they wish.
2023-06-16 06:20:14 -04:00
Botond Dénes
84aeb21297 tools/scylla-sstable: write operation: add none to valid validation levels
This validation level was added recently, but scylla sstable write
didn't know about it yet, fix that.
2023-06-16 06:20:14 -04:00
Botond Dénes
34f1827ffc tools/scylla-sstable: handle errors thrown by the operation
Instead of letting the runtime catch them. Also, make sure all exception
throw due to bad arguments are instances of `std::invalid_argument`,
these are now reported differently from other, runtime errors.
Remove the now extraneous `error:` prefix from all exception messages.
2023-06-16 06:20:14 -04:00
Botond Dénes
e32fdcba06 test/cql-pytest: add option to omit scylla's output from the test output
Scylla's output is often unnecessary to debug a failed test, or even
detrimental because one has to scroll back in the terminal after each
test run, to see the actual test's output. Add an option,
--omit-scylla-output, which when present on the command line of `run`,
the output of scylla will be omitted from the test output.
Also, to help discover this option (and others), don't run the tests
when either -h or --help is present on the command line. Just invoke
pytest (with said option) and exit.
2023-06-16 06:20:14 -04:00
Botond Dénes
21d9fbe875 tools/scylla-sstable: s/option/operation_option/
A future include will bring in a type with a similar name, resulting in
a name clash. Avoid by renaming to something more specific.
2023-06-16 06:20:14 -04:00
Botond Dénes
f31bf152aa tool/scylla-sstable: add missing comments
Separating entries in the operation list (pretty hard to visually
separate without comments).
2023-06-16 06:20:14 -04:00
Tomasz Grabiec
e41ff4604d Merge 'raft_topology: fencing and global_token_metadata_barrier' from Gusev Petr
This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit).

* the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm;
* when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely;
* the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and  then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future.
* after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced.
* the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command;
* currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness;
* along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done;
* we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present.

Closes #13884

* github.com:scylladb/scylladb:
  storage_service: warn if can't find ip for server
  storage_proxy.cc: add and use global_token_metadata_barrier
  storage_service: exec_global_command: bool result -> exceptions
  raft_topology: add cmd_index to raft commands
  storage_proxy.cc: add fencing to read RPCs
  storage_proxy.cc: extract handle_read
  storage_proxy.cc: refactor encode_replica_exception_for_rpc
  storage_proxy: fix indentation
  storage_proxy: add fencing for mutation
  storage_servie: fix indentation
  storage_proxy: add fencing_token and related infrastructure
  raft topology: add fence_version
  raft_topology: add barrier_and_drain cmd
  token_metadata: add topology version
2023-06-16 12:07:31 +02:00
Mikołaj Grzebieluch
fa76d6bd64 raft topology: test: check if aborted node replacing blocks bootstrap
Scenario:
1. Start a cluster with nodes node1, node2, node3
2. Start node4 replacing node node2
3. Stop node node4 after it joined group0 but before it advertised itself in gossip
4. Start node5 replacing node node2

Test simulates the behavior described in #13543.

Test passes only if `wait_for_peers_to_enter_synchronize_state` doesn't need to
resolve all IPs to return early. If not, node5 will hang trying to resolve the
IP of node4:
```
raft_group0_upgrade - : failed to resolve IP addresses of some of the cluster members ([node4's host ID])
```
2023-06-16 11:09:19 +02:00
Pavel Emelyanov
5412c7947a backlog_controller: Unwrap scheduling_group
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14266
2023-06-16 12:02:14 +03:00
Michał Chojnowski
3cf15e6ad7 test: perf: memory_footprint_test: don't use obsolete sstable versions
memory_footprint_test fails with:
`sstable - writing sstables with too old format`
because it attempts to write the obsolete sstables formats,
for which the writer code has been long removed.

Fix that.

Closes #14265
2023-06-16 11:58:26 +03:00
Kefu Chai
f6c24c9b70 repair: set repair state correctly
repair_node_state::state is only for debugging purpose, see
ab57cea783 which introduced it.
so this change does not impact the behavior of scylla, but can
improve the debugging experience by reflecting more accurate state
of repair when we are actually inspecting it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14255
2023-06-16 11:16:59 +03:00
Jan Ciolek
d6728a7eb5 boost/expr_test: test evaluate(field_selection)
Add a unit test which tests evaluating field selections.

Alas at the moment it's impossible to add a cql-pytest,
as the grammar and query planning doesn't handle field
selections inside the WHERE clause.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:21:02 +02:00
Jan Ciolek
ee660f2d61 cql3/expr: fix printing of field_selection
expression printing has two modes: debug and user.
The user mode should output standard CQL that can be
parsed back to an expression.
In debug mode there can be some additional information
that helps with debugging stuff.

The code for printing `field_selection` didn't distinguish
between user mode and debug mode. It just always printed
in debug mode, with extra parenthesis around the field selection.

Let's change it so that it emits valid CQL in user mdoe.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:21:02 +02:00
Jan Ciolek
f79f3ea3ae cql3/expression: implement evaluate(field_selection)
Implement expr:valuate() for expr::field_selection.

`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as
a field_selection, which selects the field 'a'.

Evaluating such an expression consists of finding the
right element's value in a serialized UDT value
and returning it.

Fixes: https://github.com/scylladb/scylladb/issues/12906

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:21:00 +02:00
Jan Ciolek
464437ef90 types/user: modify idx_of_field to use bytes_view
Let's change the argument type from `bytes`
to `bytes_view`. Sometimes it's possible to get
an instance of `bytes_view`, but getting `bytes`
would require a copy, which is wasteful.

`bytes_view` allows to avoid copies.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:11:31 +02:00
Jan Ciolek
d8d5442db8 column_identifer: add column_identifier_raw::text()
I would like to be able to get a reference to the
string inside `column_identifer_raw`, but there was
no such function. There was only `to_string()`, which
copies the entire string, which is wasteful.

Let's add the method `text()`, which returns a reference
instead of a copy. `column_identifier` already has
such method, so `column_identifier_raw` can have one
as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:11:30 +02:00
Jan Ciolek
ab1ba497b5 types: add read_nth_user_type_field()
Add a function which can be used to read the nth
field of a serialized UDT value.

We could deserialize the whole value and then choose
one of the deserialized fields, but that would be wasteful.
Sometimes we only need the value of one field, not all of them.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-16 01:11:30 +02:00
Mikołaj Grzebieluch
a45e0765e4 raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.

This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.

Fixes #13543
2023-06-15 16:28:15 +02:00
Nadav Har'El
e1513f1199 Merge 'cql3: prepare selectors' from Avi Kivity
CQL statements carry expressions in many contexts: the SELECT, WHERE, SET, and IF clauses, plus various attributes. Previously, each of these contexts had its own representation for an expression, and another one for the same expression but before preparation. We have been gradually moving towards a uniform representation of expressions.

This series tackles SELECT clause elements (selectors), in their unprepared phase. It's relatively simple since there are only five types of expression components (column references, writetime/ttl modifiers, function calls, casts, and field selections). Nevertheless, there isn't much commonality with previously converted expression elements so quite a lot of code is involved.

After the series, we are still left with a custom post-prepare representation of expressions. It's quite complicated since it deals with two passes, for aggregation, so it will be left for another series.

Closes #14219

* github.com:scylladb/scylladb:
  cql3: seletor: drop inheritance from assignment_testable
  cql3: selection: rely on prepared expressions
  cql3: selection: prepare selector expressions
  cql3: expr: match counter arguments to function parameters expecting bigint
  cql3: expr: avoid function constant-folding if a thread is needed
  cql3: add optional type annotation to assignment_testable
  cql3: expr: wire unresolved_identifier to test_assignment()
  cql3: expr: support preparing column_mutation_attribute
  cql3: expr: support preparing SQL-style casts
  cql3: expr: support preparing field_selection expressions
  cql3: expr: make the two styles of cast expressions explicit
  cql3: error injection functions: mark enabled_injections() as impure
  cql3: eliminate dynamic_cast<selector> from functions::get()
  cql3: test_assignment: pass optional schema everywhere
  cql3: expr: prepare_expr(): allow aggregate functions
  cql3: add checks for aggregation functions after prepare
  cql3: expr: add verify_no_aggregate_functions() helper
  test: add regression test for rejection of aggregates in the WHERE clause
  cql3: expr: extract column_mutation_attribute_type
  cql3: expr: add fmt formatter for column_mutation_attribute_kind
  cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON
2023-06-15 15:59:41 +03:00
Kefu Chai
befc78274b install.sh: pass -version to java executable
currently, despite that we are moving from Java-8 to
Java-11, we still support both Java versions. and the
docker image used for testing Datatax driver has
not been updated to install java-11.

the "java" executable provided by openjdk-java-8 does
not support "--version" command line argument. java-11
accept both "-version" and "--version". so to cater
the needs of the the outdated docker image, we pass
"-version" to the selected java. so the test passes
if java-8 is found. a better fix is to update the
docker image to install java-11 though.

the output of "java -version" and "java --version" is
attached here as a reference:

```console
$ /usr/lib/jvm/java-1.8.0/bin/java --version
Unrecognized option: --version
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
```

```console
$ /usr/lib/jvm/java-1.8.0/bin/java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)
```

```console
/usr/lib/jvm/jre-11/bin/java --version
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```

```console
$ /usr/lib/jvm/jre-11/bin/java -version
openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```

Fixes #14253
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14254
2023-06-15 15:42:09 +03:00
Botond Dénes
d1dc579062 Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai
this series adds an option named "uuid_sstable_identifier_enabled", and the related cluster feature bit, which is set once all nodes in this cluster set this option to "true". and the sstable subsystem will start using timeuuid instead plain integer for the identifier of sstables. timeuuid should be a better choice for identifiers as we don't need to worry about the id conflicts anymore. but we still have quite a few tests using static sstables with integer in their names, these tests are not changed in this series. we will create some tests to exercise the sstable subsystem with this option set.

a very simple inter-op test with Cassandra 4.1.1 was also performed to verify that the generated sstables can be read by the Cassandra:

1. start scylla, and connect it with cqlsh, run following commands, and stop it
    ```
    cqlsh> CREATE  KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
    cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
    cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
    cqlsh> SELECT * FROM ks.cf;
    ```
2. enable Cassandra's `uuid_sstable_identifiers_enabled`, and start Cassandra 4.1.1, and connect it with cqlsh, run following commands, and stop it
    ```
    cqlsh> CREATE  KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
    cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
    cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
    cqlsh> SELECT * FROM ks.cf;
    ```
2. move away the sstables generated by Cassandra, and replace it with the sstables generated by scylladb:
    ```console
    $ mv ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49 /tmp
    $ mv ~/scylla/ks/cf-db47a12009d611eea6b8b179df3a2d5d ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49
    ```
3. start Cassandra 4.1.1 again, and connect it with cqlsh, run following commands
    ```
    cqlsh> SELECT * FROM ks.cf;
     name | value
    ------+-------
        1 |   one
    ```

Fixes https://github.com/scylladb/scylladb/issues/10459

Closes #13932

* github.com:scylladb/scylladb:
  replica,sstable: introduce invalid generation id
  sstables, replica: pass uuid_sstable_identifiers to generation generator
  gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature
  db: config: add uuid_sstable_identifiers_enabled option
  sstables, replica: support UUID in generation_type
2023-06-15 15:23:24 +03:00
Petr Gusev
fe5e1a5462 storage_service: warn if can't find ip for server
This shouldn't happen during normal operation.
2023-06-15 15:52:50 +04:00
Petr Gusev
5a3384f495 storage_proxy.cc: add and use global_token_metadata_barrier
fence_old_reads is removed since it's replaced by this barrier.
2023-06-15 15:52:50 +04:00
Petr Gusev
d9d29ec293 storage_service: exec_global_command: bool result -> exceptions
This allows to reflect cause-and-effect
relationships in the logs: if some command
failed, we write to the log at the top level
of the topology state machine. The log message
includes the current state of the state
machine and a description of what
exactly went wrong.

Note that in the exec_global_command overload
returning node_to_work_on we don't call retake_node()
if the nested exec_global_command failed.
This is fine, since all the callers
just log/break in this case.
2023-06-15 15:52:50 +04:00
Petr Gusev
96a1c661bd raft_topology: add cmd_index to raft commands
In this commit we add logic to protect against
raft commands reordering. This way we can be
sure that the topology state
(_topology_state_machine._topology) on all the
nodes processing the command is consistent
with the topology state on the topology change
coordinator. In particular, this allows
us to simply use _topology.version as the current
version in barrier_and_drain instead of passing it
along with the command as a parameter.

Topology coordinator maintains an index of the last
command it has sent to the cluster. This index is
incremented for each command and sent along with it.
The receiving node compares it with the last index
it received in the same term and returns an error
if it's not greater. We are protected
against topology change coordinator migrating
to other node by the already existing
terms check: if the term from the command
doesn't match the current term we return an error.
2023-06-15 15:52:50 +04:00
Petr Gusev
94605e4839 storage_proxy.cc: add fencing to read RPCs
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.

Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
2023-06-15 15:52:50 +04:00
Petr Gusev
4004ce1f44 storage_proxy.cc: extract handle_read
We continue the refactoring by introducing
the common implementation for all read methods.
2023-06-15 15:52:50 +04:00
Petr Gusev
2d791a5ed4 storage_proxy.cc: refactor encode_replica_exception_for_rpc
We are going to add fencing to read RPCs, it would be easier
to do it once for all three of them. This refactoring
enables this since it allows to use
encode_replica_exception_for_rpc for handle_read_digest.
2023-06-15 15:52:50 +04:00
Petr Gusev
6b115e902b storage_proxy: fix indentation 2023-06-15 15:52:50 +04:00
Petr Gusev
46f73fcaa6 storage_proxy: add fencing for mutation
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.

Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
2023-06-15 15:52:49 +04:00
Petr Gusev
7fe707570a storage_servie: fix indentation 2023-06-15 15:48:00 +04:00
Petr Gusev
d34da12240 storage_proxy: add fencing_token and related infrastructure
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.

An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
2023-06-15 15:48:00 +04:00
Petr Gusev
f6b019c229 raft topology: add fence_version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
2023-06-15 15:48:00 +04:00
Petr Gusev
4f99302c2b raft_topology: add barrier_and_drain cmd
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
2023-06-15 15:48:00 +04:00
Petr Gusev
253d8a8c65 token_metadata: add topology version
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.

The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
2023-06-15 15:48:00 +04:00
Kefu Chai
2d265e860d replica,sstable: introduce invalid generation id
the invalid sstable id is the NULL of a sstable identifier. with
this concept, it would be a lot simpler to find/track the greatest
generation. the complexity is hidden in the generation_type, which
compares the a) integer-based identifiers b) uuid-based identifiers
c) invalid identitifer in different ways.

so, in this change

* the default constructor generation_type is
  now public.
* we don't check for empty generation anymore when loading
  SSTables or enumerating them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
939fa087cc sstables, replica: pass uuid_sstable_identifiers to generation generator
before this change, we assume that generation is always integer based.
in order to enable the UUID-based generation identifier if the related
option is set, we should populate this option down to generation generator.

because we don't have access to the cluster features in some places where
a new generation is created, a new accessor exposing feature_service from
sstable manager is added.

Fixes #10459
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
49071e48ae gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature
UUID_SSTABLE_IDENTIFIERS is a new cluster wide feature. if it is
enabled, all nodes will generate new sstables with the UUID as their
generation identifiers. this feature is configured using
config option of "uuid_sstable_identifiers_enabled".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
4c2df04449 db: config: add uuid_sstable_identifiers_enabled option
unlike Cassandra 4.1, this option is true by default, will be used
for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet.

please note, because we are still using sstableloader and
sstabledump based on 3.x branch, while the Cassandra upstream
introduced the uuid sstable identifier in its 4.x branch, these tool
fail to work with the sstables with uuid identifier, so this option
is disabled when performing these tests. we will enable it once
these tools are updated to support the uuid-basd sstable identifiers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
15543464ce sstables, replica: support UUID in generation_type
this change generalize the value of generation_type so it also
supports UUID based identifier.

* sstables/generation_type.h:
  - add formatter and parse for UUID. please note, Cassandra uses
    a different format for formatting the SSTable identifier. and
    this formatter suits our needs as it uses underscore "_" as the
    delimiter, as the file name of components uses dash "-" as the
    delimiter. instead of reinventing the formatting or just use
    another delimiter in the stringified UUID, we choose to use the
    Cassandra's formatting.
  - add accessors for accessing the type and value of generation_type
  - add constructor for constructing generation_type with UUID and
    string.
  - use hash for placing sstables with uuid identifiers into shards
    for more uniformed distrbution of tables in shards.
* replica/table.cc:
  - only update the generator if the given generation contains an
    integer
* test/boost:
  - add a simple test to verify the generation_type is able to
    parse and format

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kamil Braun
59d4bb3787 tracing: remove qp.get_migration_manager() calls
Pass `migration_manager&` from top-level instead.
2023-06-15 09:48:54 +02:00
Kamil Braun
1b68e8582b table_helper: remove qp.get_migration_manager() calls
Push those calls up the call stack, to `trace_keyspace_helper` module.
Pass `migration_manager` reference around together with
`query_processor` reference.
2023-06-15 09:48:54 +02:00
Kamil Braun
9e25a3cbed thrift: handler: move implementation of execute_schema_command to query_processor
It's now named `execute_thrift_schema_command` in `query_processor`.

This allows us to remove yet another
`query_processor::get_migration_manager()` call.

Now that `execute_thrift_schema_command` sits near
`execute_schema_statement` (the latter used for CQL), we can see a
certain similarity. The Thrift version should also in theory get a retry
loop like the one CQL has, so the similarity would become even stronger.
Perhaps the two functions could be refactored to deduplicate some logic
later.
2023-06-15 09:48:54 +02:00
Kamil Braun
26cd3b9b78 data_dictionary: add get_version
The `replica::database` version simply calls `get_version`
on the real database.

The `schema_loader` version throws `bad_function_call`.
2023-06-15 09:48:54 +02:00
Kamil Braun
eace351ca3 cql3: statements: schema_altering_statement: move execute0 to query_processor
Rename it to `execute_schema_statement`.

This allows us to remove a call to
`query_processor::get_migration_manager`, the goal being to make it a
private member function.
2023-06-15 09:48:54 +02:00
Kamil Braun
2606c190af cql3: statements: pass migration_manager& explicitly to prepare_schema_mutations
We want to stop relying on `qp.get_migration_manager()`, so we can make
the function private in the future. This in turn is a prerequisite for
splitting `query_processor` initialization into two phases, where the
first phase will only allow local queries (and won't require
`migration_manager`).
2023-06-15 09:48:54 +02:00
Kamil Braun
817aff6615 main: add missing supervisor::notify message 2023-06-15 09:48:54 +02:00
Nadav Har'El
3a73048bc9 test/cql-pytest: reproducer for bug of PER PARTITION LIMIT with INDEX
This patch adds an xfailing test reproducing the bug in issue #12762:
When a SELECT uses a secondary index to list rows, if there is also a
PER PARTITION LIMIT given, Scylla forgets to apply it.

The test shows that the PER PARTITION LIMIT is correctly applied when
the index doesn't exist, but forgotten when the index is added.
In contrast, both cases work correctly in Cassandra.

This patch also adds a second variant of this test, which adds filtering
to the mix, and ensures that PER PARTITION LIMIT 1 doesn't give up on the
first row of each partition - but rather looks for the first row that
passes the filter, and only then moves on to the next partition.

Refs #12762.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14248
2023-06-15 09:17:50 +03:00
Piotr Dulikowski
41fff6f425 storage_service: fix indentation after previous commit 2023-06-14 17:47:42 +02:00
Piotr Dulikowski
1f58c1e762 storage_service: remove optimization in cleanup_group0_config_if_needed
The `topology_coordinator::cleanup_group0_config_if_needed` function
first checks whether the number of group 0 members is larger than the
number of non-left entries in the topology table, then attempts to
remove nodes in left state from group 0 and prints a warning if no such
nodes are found. There are some problems with this check:

- Currently, a node is added to group 0 before it inserts its entry to
  the topology table. Such a node may cause the check to succeed but no
  nodes will be removed, which will cause the warning to be printed
  needlessly.
- Cluster features on raft will reverse the situation and it will be
  possible for an entry in system.topology to exist without the
  corresponding node being a part of group 0. This, in turn, may cause
  the check not to pass when it should and nodes could be removed later
  than necessary.

This commit gets rid of the optimization and the warning, and the
topology coordinator will always compute the set of nodes that should be
removed. Additionally, the set of nodes to remove is now computed
differently: instead of iterating over left nodes and including only
those that are in group 0, we now iterate over group 0 members and
include those that are in `left` state. As the number of left nodes can
potentially grow unbounded and the number of group 0 members is more
likely to be bounded, this should give better performance in
long-running clusters.
2023-06-14 17:47:24 +02:00
Kefu Chai
780aee9568 repair: extract remove_shard_task_id() in shard_repair_task_impl::run()
simpler this way. and this also matches with the "add" call at the
beginning of this function.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14239
2023-06-14 17:03:27 +03:00
Avi Kivity
48f66dab38 cql3: select_statement: reindent execute_without_checking_exception_message_aggregate_or_paged() 2023-06-14 16:57:48 +03:00
Botond Dénes
a5ce2d5fb4 Merge 'Initialize storage_proxy early, without messaging_service and gossiper' from Kamil Braun
Move the initialization of `storage_proxy` early in the startup procedure, before starting
`system_keyspace`, `messaging_service`, `gossiper`, `storage_service` and more.

As a follow-up, we'll be able to move initialization of `query_processor` right
after `storage_proxy` (but this requires a bit of refactoring in
`query_processor` too).

Local queries through `storage_proxy` can be done after the early initialization step.
In a follow-up, when we do a similar thing for `query_processor`, we'll be able
to perform local CQL queries early as well. (Before starting `gossiper` etc.)

Closes #14231

* github.com:scylladb/scylladb:
  main, cql_test_env: initialize `storage_proxy` early
  main, cql_test_env: initialize `database` early
  storage_proxy: rename `init_messaging_service` to `start_remote`
  storage_proxy: don't pass `gossiper&` and `messaging_service&` during initialization
  storage_proxy: prepare for missing `remote`
  storage_proxy: don't access `remote` during local queries in `query_partition_key_range_concurrent`
  db: consistency_level: remove overload of `filter_for_query`
  storage_proxy: don't access `remote` when calculating target replicas for local queries
  storage_proxy: introduce const version of `remote()`
  replica: table: introduce `get_my_hit_rate`
  storage_proxy: `endpoint_filter`: remove gossiper dependency
2023-06-14 15:37:33 +03:00
Kefu Chai
4f85839be3 migration_manager: use try_emplace() when appropriate
try_emplace() is

- simpler than the lookup-and-insert dance, and
- presumably, it is more efficient.
- also, most importantly, it is simpler to read.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14237
2023-06-14 15:00:53 +03:00
Avi Kivity
809c67ad77 cql3: select_statement: coroutinize execute_without_checking_exception_message_aggregate_or_paged()
This function will have a continuation for sure, so converting it to a
coroutine will not cause extra allocations for sure.

wrap_result_to_error_message(), which is used to convert a
coordinator_result-unaware continuation to a coordinator_result
aware continuation, is converted to traditional check-error-and-return.
2023-06-14 14:27:46 +03:00
Avi Kivity
f54049322d cql3: select_statement: split do_execute into fast-path and slow/slower paths
select_statement::do_execute() has a fast path where it forwards
to execute_without_checking_exception_message_non_aggregate_unpaged().
In this fast path, we aren't paging (a good reason for that is reading
a partition without clustering keys) and in the slow/slower paths
we page and/or perform complex processing like aggregation.

The fast path doesn't need any continuations, but the slow/slower paths
do. Split them off so that the slow/slower paths can be coroutinized
without impacting the fast path.
2023-06-14 14:24:41 +03:00
Kefu Chai
c508c656c5 Revert "build: make gen_headers a dependency of gen/*.o"
This reverts commit 9526258b89.
Because the issue (#14213) supposed to be fix only exists in the
enterprise branch. And that issue has been fixed in a different
way in a different place.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14234
2023-06-14 14:13:46 +03:00
Tomasz Grabiec
8986cdafb9 Merge 'migration_manager: coroutinize some member functions of migration_manager' from Kefu Chai
to reduce the indentation level, and to improve the readability. also, take this opportunity to name some variables for better readability.

Closes #14236

* github.com:scylladb/scylladb:
  migration_manager: coroutineize migration_manager::do_merge_schema_from()
  migration_manager: coroutineize migration_manager::sync_schema()
2023-06-14 12:49:07 +02:00
Tomasz Grabiec
87bbd2614b raft: Populate address mapping from system.peers early
Currently, the mapping is initialized from the gossiper state when
group0 server is started and updated from a gossiper change
listener. Gossiper state is restored from system.peers in
storage_service::join_cluster(), which is later than
setup_group0_if_exists() is called.

The restarted server will hang in
group0_service.setup_group0_if_exist(), which waits for snapshot
loading, which waits for storage_service::topology_state_load(), which
waits for IP mapping for servers mentioned in the topology, and
produces logs like this:

  WARN  2023-06-12 15:45:21,369 [shard 0] storage_service - (rate limiting dropped 196 similar messages) raft topology: cannot map c94ae68f-869d-4727-8b2f-d40814e395f0 to ip, retrying.

This is a regression after f26179c, where group0 server is initialized
before the gossiper is started.

The fix is to load the mapping from system.peers before group0 is
started. Gossiper state is not available at this point, so we read the
mapping directly from system keyspace. This change will also be needed
to implement messaging by host id, even if raft is disabled, where we
will need to restore the mapping early.

Fixes #14217

Closes #14220
2023-06-14 11:52:47 +02:00
Kamil Braun
b23cc9b441 main, cql_test_env: initialize storage_proxy early
This is another part of splitting Scylla initialization into two phases:
local and remote parts. Performing queries is done with `storage_proxy`,
so for local queries we want to initialize it before we initialize
services specific to cluster communication such as `gossiper`,
`messaging_service`, `storage_service`.

`system_keyspace` should also be initialized after `storage_proxy` (and
is after this patch) so in the future we'll be able to merge the
multiple initialization steps of `system_keyspace` into one (it only
needs the local part to work).
2023-06-14 11:41:36 +02:00
Kamil Braun
a8f6afc2fd main, cql_test_env: initialize database early
We want to separate two phases of Scylla service initialization: first
we initialize the local part, which allows performing local queries,
then a remote part, which requires contacting other nodes in a cluster
and allows performing distributed queries.

The `database` object is crucial for both remote and local queries, but it
was created pretty late, after services such as `gossiper` or
`storage_service` which are used for distributed operations.

Fortunately we can easily move `database` initialization and all of its
prerequisites early in the init procedure.
2023-06-14 11:41:36 +02:00
Kamil Braun
a740fbf58a storage_proxy: rename init_messaging_service to start_remote
The function now has more responsibilities than before, rename it
and add a comment to better illustrate this.
2023-06-14 11:41:36 +02:00
Kamil Braun
f26e98c3be storage_proxy: don't pass gossiper& and messaging_service& during initialization
These services are now passed during `init_messaging_service`, and
that's when the `remote` object is constructed.

The `remote` object is then destroyed in `uninit_messaging_service`.

Also, `migration_manager*` became `migration_manager&` in
`init_messaging_service`.
2023-06-14 11:41:36 +02:00
Kamil Braun
10f11b89ea storage_proxy: prepare for missing remote
Prepare the users of `remote` for the possibility that it's gone.

The `remote()` accessor throws an error if it's gone. Observe that
`remote()` is only used in places where it's verified that we really
want to send a message to a remote node, with a small exception:
`truncate_blocking`, which truncates locally by sending an RPC to
ourselves (and truncate always sends RPC to the whole cluster; we might
want to change this behavior in the future, see #11087). Other places
are easy to check (it's either implementations of `apply_remotely`
which is only called for remote nodes, or there's an `if` that checks
we don't apply the operation to ourselves).

There is one direct access to `_remote` which checks first if `_remote`
is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we
consider nodes other than us dead. Indeed, if `gossiper` is unavailable,
we didn't have a chance to gossip with other nodes and mark them alive.
2023-06-14 11:41:36 +02:00
Kamil Braun
8db1d75c6c storage_proxy: don't access remote during local queries in query_partition_key_range_concurrent
In `query_partition_key_range_concurrent` there's a calculation of cache
hit rates which requires accessing `gossiper` through `remote`. We want
to support local queries when `remote` is unavailable.

Check if it's a local query and only if not, fetch `gossiper` from `remote`.
2023-06-14 11:41:36 +02:00
Kamil Braun
0e36377f56 db: consistency_level: remove overload of filter_for_query
Not used anymore after the previous commit.
2023-06-14 11:41:36 +02:00
Kamil Braun
ddcbade919 storage_proxy: don't access remote when calculating target replicas for local queries
We only want to access `remote` when it's necessary - when we're
performing a query that involves remote nodes. We want to support local
queries when `remote` (in particular, `gossiper&`) is unavailable.

Add a helper, `storage_proxy::filter_replicas_for_read`, which will
check if it's a local query and return early in that case without
accessing `remote`.
2023-06-14 11:41:34 +02:00
Kefu Chai
6687aa35df repair: coroutinize repair_cf_range_row_level()
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14240
2023-06-14 12:39:53 +03:00
Pavel Emelyanov
d1de796f6b sstable: Move XFS renamer hack into fs storage
The method sits on sstable, but is called only from fs storage and it's
the only place that really needs it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14230
2023-06-14 12:35:04 +03:00
Wojciech Mitros
89b6c84b49 database: remove unused header
After recent changes, all wasm related logic has been moved from
the database class to the query_processor. As a result, the wasm
headers no longer need to be included there, and in particular,
files that include replica/database.hh no longer need to wait
on the generated header rust/wasmtime_bindings.hh to compile.

Fixes #14224

Closes #14223
2023-06-14 12:33:20 +03:00
Nadav Har'El
5a75713ea7 cql-pytest: translate Cassandra's test for UPDATE operations
This is a translation of Cassandra's CQL unit test source file
validation/operations/UpdateTest.java into our cql-pytest framework.

There are 18 tests, and they did not reproduce any previously-unknown
bug, but did provide additional reproducers for two known issues:

Refs #12243: Setting USING TTL of "null" should be allowed

Refs #12474: DELETE/UPDATE print misleading error message suggesting
             ALLOW FILTERING would work
     Note that we knew about this issue for the DELETE operation, and
     the new test shows the same issue exists for UPDATE.

I had to modify some of the tests to allow for different error messages
in ScyllaDB (in cases where the different message makes sense), as well
as cases where we decided to allow in Scylla some behaviors that are
forbidden in Cassandra - namely Refs #12472.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14222
2023-06-14 12:31:15 +03:00
Botond Dénes
4191e97d19 Update tools/java submodule
* tools/java 0cbfeb03...9f63a96f (1):
  > s/egrep/grep -E/
2023-06-14 12:29:59 +03:00
Botond Dénes
3479adc85f Merge 'Prepare sstable_directory lister to garbage_collect() s3 stuff' from Pavel Emelyanov
When scylla starts it collects dangling sstables from the datadir. It includes temporary sstable directories and pending-deletion log. S3-backed sstables cannot be garbage-collected like that, instead "garbage" entries from the ownership table should be processed. Currently the g.c. code is unaware of storage and scans datadir for whatever sstable it's called for.

This PR prepares the garbage_collect() call to become virtual, but no-op for ownership-table lister. Proper S3 garbage-collecting is not yet here, it needs an extra patch to seastar http client.

refs: #13024

Closes #14023

* github.com:scylladb/scylladb:
  sstable_directory: Do not collect filesystem garbage for S3-backed sstables
  sstable_directory: Deduplicate .process() location argument
  sstable_directory: Keep directory lister on stack
  sstable_directory: Use directory_lister API directly
2023-06-14 12:06:37 +03:00
Botond Dénes
aaac455ebe Merge 'doc: add OS support for ScyllaDB 5.3' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/14084

This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.

Closes #14228

* github.com:scylladb/scylladb:
  doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
  doc: add OS support for ScyllaDB 5.3
2023-06-14 11:42:48 +03:00
Anna Stuchlik
bbd7c7db72 doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x 2023-06-14 09:46:23 +02:00
Kefu Chai
25587b679d migration_manager: coroutineize migration_manager::do_merge_schema_from()
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-14 15:31:46 +08:00
Kefu Chai
befc9096d1 migration_manager: coroutineize migration_manager::sync_schema()
to reduce the indentation level, and to improve the readability.
also, take this opportunity to name some variables for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-14 15:19:56 +08:00
Jan Ciolek
5fce4d9675 types: add read_nth_tuple_element()
Add a function which retrieves the value of nth
field from a serialized tuple value.

I tried to make it as efficient as possible.
Other functions, like evaluate(subscript) tend to
deserialize the whole structure and put all of its
elements in a vector. Then they select a single element
from this vector.
This is wasteful, as we only need a single element's value.

This function goes over the serialized fields
and directly returns the one that is needed.
No allocations are needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-14 07:22:39 +02:00
Avi Kivity
190d1b20bf cql3: seletor: drop inheritance from assignment_testable
Since all function overload selection is done by prepare_expression(),
we no longer need to implement the assignment_testable interface, so
drop it.

Since there's now just one implementation of assignment_testable,
we can drop it and replace it by the implementation (expressions),
but that is left for later.
2023-06-13 21:04:49 +03:00
Avi Kivity
f438b9b044 cql3: selection: rely on prepared expressions
Now that selector expressions are prepared, we can avoid doing
the work ourselves:

 - function_name:s are resolved into functions, so we can error
   out if we see a function_name (and drop the with_function class)
 - casts are converted to anonymous functions, so we can error
   out if we see them (and drop with with_cast class)
 - field_selection:s can relay on the prepared field_idx
2023-06-13 21:04:49 +03:00
Avi Kivity
1040589828 cql3: selection: prepare selector expressions
Call prepare_expression() on selector expressions to resolve types. This
leaves us with just one way to move from the unprepared domain to the
prepared domain.

The change is somewhat awkward since do_prepare_selectable() is re-doing
work that is done by prepare_expression(), but somehow it all works. The
next patch will tear down the unnecessary double-preparation.
2023-06-13 21:04:49 +03:00
Avi Kivity
6c55bdc417 cql3: expr: match counter arguments to function parameters expecting bigint
assignment_testable is used to convey type information to function overload
selection. The implementation for `selector` recognizes that counters are
really bigints and special cases them. The equivalent implementation for
expressions doesn't, so bring over that nuance here too.

With this, things like sum(counter_column) match the overload for
sum(bigint) rather than failing.
2023-06-13 21:04:49 +03:00
Avi Kivity
2c1e36d0ac cql3: expr: avoid function constant-folding if a thread is needed
Our prepare phase performs constant-folding: if an expression
is composed of constants, and is pure, it is evalauted during
the preparation phase rather than during query execution.

This however can't work for user-defined functions as these require
running in a thread, and we aren't running in a thread during
prepration time. Skip the optimization in this case.
2023-06-13 21:04:49 +03:00
Avi Kivity
8d3d8eeedb cql3: add optional type annotation to assignment_testable
Before this series, function overload resolution peeked
at function arguments to see if they happened to be selectors,
and if so grabbed their type. If they did not happen to be
selectors, we woudln't know their type, but as it happened
all generic functions are aggregates, and aggregates are only
legal in the SELECT clause, so that didn't matter.

In a previous patch, we changed assignment_testable to carry
an optional type and wired it to selector, so we wouldn't
need to dynamic_cast<selector>.

Now, we wire the optional type to assignment_testable_expression,
so overload resolution of generic functions can happen during
expression preparation.

The code that bridges the function argument expressions to
assignment_testable is extracted into a function, since it's
too complicated to be written as a transform.
2023-06-13 21:04:49 +03:00
Avi Kivity
2cb15d0829 cql3: expr: wire unresolved_identifier to test_assignment() 2023-06-13 21:04:49 +03:00
Avi Kivity
b7bbcdd178 cql3: expr: support preparing column_mutation_attribute
Fairly straightforward. A unit test is added.
2023-06-13 21:04:49 +03:00
Avi Kivity
73b6b6e3d1 cql3: expr: support preparing SQL-style casts
We convert the cast to a function, just like the existing
with_function selectable.
2023-06-13 21:04:49 +03:00
Avi Kivity
521a128a2a cql3: expr: support preparing field_selection expressions
The field_selection structure is augmented with the field
index so that does not need to be done at evaluation time,
similar to the current with_field_selection selectable.
2023-06-13 21:04:49 +03:00
Avi Kivity
ecfe4ad53a cql3: expr: make the two styles of cast expressions explicit
CQL supports two cast styles:

 - C-style: (type) expr, used for casts between binary-compatible types
  and for type hinting of bind variables
 - SQL-tyle: (expr AS type), used for real type convertions

Currently, the expression system differentiates them by the cast::type
field, which is a data_type for SQL-style casts and a cql3_type::raw
for C-style casts, but that won't work after the prepare phase is applied
to SQL-style casts when the type field will be prepared into a data_type.

Prepare for this by adding a separate enum to distinguish between the
two styles.
2023-06-13 21:04:49 +03:00
Avi Kivity
871c1c4f99 cql3: error injection functions: mark enabled_injections() as impure
A pure function should return the same value on every invocation,
but enabled_injections() returns true or false depending on global
state.

Mark it impure to reflect that.

Currently, the bug has no effect, but once we prepare selectors,
the prepare_function_call() will constant-fold calls to pure
functions, so we'll capture global state at prepare time rather
than evaluate it each time anew.
2023-06-13 21:04:49 +03:00
Avi Kivity
c0f59f0789 cql3: eliminate dynamic_cast<selector> from functions::get()
Type inference for function calls is a bit complicated:
 - a function argument can be inferred from the signature: a call to
   my_func(:arg) will infer :arg's type from the function signature
 - a function signature can be inferred from its argument types:
   a call to max(my_column) will select the correct max() signature
   (as max is generic) from my_column's type

Currently, functions::get() implements this by invoking
dynamic_cast<selector*> on the argument. If the caller of
functions::get() is the SELECT clause preparation, then the
cast will succeed and we'll be able to find the type. If not,
we fail (and fall back to inferring the argument types from a
non-generic function signature).

Since we're about to move selectors to expressions, the dynamic_cast
will fail, so we must replace it with a less fragile approach.

The fix is to augment assignment_testable (the interface representing
a function argument) with an intentionally-awkwardly-named
assignment_testable_type_opt(), that sees whether we happen to know
the type for the argument in order to implement signature-from-argument
inference.

A note about assignment_testable: this is a bridge interface
that is the least common denominator of anything that calls functions.
Since we're moving towards expressions, there are fewer implementations of
the interface as the code evolves.
2023-06-13 21:04:49 +03:00
Avi Kivity
5983e9e7b2 cql3: test_assignment: pass optional schema everywhere
test_assignment() and related functions check for type compatibility between
a right-hand-side and a left-hand-side.

It started its life with a limited functionality for INSERT and UPDATE,
but now it's about to be used for cast expression in selectors, which
can cast a column_value. A column_value is still an unresolved_identifier
during the prepare phase, and cannot be resolved without a schema.

To prepare for this, pass an optional schema everywhere.

Ultimately, test_assignment likely needs to be folded into prepare_expr(),
but before that prepare_expr() has to be used everywhere.
2023-06-13 21:04:49 +03:00
Avi Kivity
8dc22293bf cql3: expr: prepare_expr(): allow aggregate functions
prepare_expr() began its life as a replacement for the WHERE clause,
so it shares its restrictions, one of which is not supporting aggregate
functions.

In previous patches, we added an explicit check to all users, so we can
now remove the check here, so that we can later prepare selectors.

In addition to dropping the check, we drop the dynamic_cast<scalar_function>,
as it can now fail. It turns out it's unnecessary since everything is available
from the base class.

Note we don't allow constant folding involving aggregate functions: first,
our evaluator doesn't support it, and second, we don't have the iteration count
at prepare time.
2023-06-13 21:04:49 +03:00
Avi Kivity
b7a90d51d2 cql3: add checks for aggregation functions after prepare
Since we don't yet prepare selectors, all calls to prepare_expr()
are adjusted.

Note that missing a check isn't fatal - it will be trapped at runtime
because evaluate(aggregate) will throw.
2023-06-13 21:04:49 +03:00
Avi Kivity
6db916e5b6 cql3: expr: add verify_no_aggregate_functions() helper
Aggregate functions are only allowed in certain contexts (the
SELECT clause and the HAVING clause, which we don't yet have).

prepare_expr() currently rejects aggregate functions, but that means
we cannot use it to prepare selectors.

To prepare for the use of prepare_expr() in selectors, we'll have to
move the check out of prepare_expr(). This helper is the beginning of
that change.

I considered adding a parameter to prepare_expr(), but that is even
more noisy than adding a call to the helper.
2023-06-13 21:04:49 +03:00
Avi Kivity
e7c1824ed0 test: add regression test for rejection of aggregates in the WHERE clause
The test passes on Cassandra and ScyllaDB.
2023-06-13 21:04:49 +03:00
Avi Kivity
54f3050225 cql3: expr: extract column_mutation_attribute_type
column_mutation_attribute_type() returns int32_type or long_type
depending on whether TTL or WRITETIME is requested.

Will be used later when we prepare column_mutation_attribute
expressions.
2023-06-13 21:04:49 +03:00
Avi Kivity
d2f4bd8b85 cql3: expr: add fmt formatter for column_mutation_attribute_kind
It's easier to use for logging.
2023-06-13 21:04:49 +03:00
Avi Kivity
220a3efa73 cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON
We store the result of to_selectables() in a local variable, then compute it
again in the next line. Fix by reusing the variable.
2023-06-13 21:04:49 +03:00
Avi Kivity
096e569054 cql3: select_statement: disambiguate execute() overloads
There are two execute() overloads, but they don't do the same thing - one is a
partial implementation of the other.

The same is true of two execute_without_checking_exception_message() overloads.

Change the name of the subordinate overload to indicate its role. Overloads should
be used when the only difference between overloads is the argument type, not when
one does a subset of the other.
2023-06-13 19:28:29 +03:00
Kefu Chai
8a54e478ba build: cmake: use -O0 for debug build
per clang's document, -Og is like -O1, which is in turn an optimization
level between -O0 and -O2. -O0 "generates the most debuggable code".
for instance, with -O0, presumably, the variables are not allocated in
the registers and later get overwritten, they are always allocated on
the stack. this helps with the debugging.

in this change, -O0 is used for better debugging experience. the
downside is that the emitted code size will be greater than the one
emitted from -Og, and the executable will be slower.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14210
2023-06-13 18:25:36 +03:00
Avi Kivity
7dadd38161 Revert "configure: Switch debug build from -O0 to -Og"
This reverts commit 7e68ed6a5d. With -Og
simple 'gdb build/debug/scylla -ex start' shows the function parameters
as "optimized out" while with -O0 they display fine. This applies to
all variables, not just main's parameters.

Bisect revealed that this behavior started with the reverted commit; it's
not due to a later toolchain update.

Fixes #14196.

Closes #14197
2023-06-13 18:08:21 +03:00
David Garcia
7c22877613 docs: separate homepage
Update index.rst

Update index.rst

Update index.rst
2023-06-13 15:16:24 +01:00
Anna Stuchlik
152e90d4fa doc: add OS support for ScyllaDB 5.3
Fixes https://github.com/scylladb/scylladb/issues/14084

This commit adds OS support for version 5.3 to the table
on the OS Support by Linux Distributions and Version page.
2023-06-13 16:03:48 +02:00
Nadav Har'El
ed1f37a8f9 Merge 'Tests for cluster features' from Piotr Dulikowski
This PR adds a set of tests for cluster features. The set covers those tests from the test plan for the upcoming "cluster features on raft" that do not depend on implementation-specific details. Those tests are applicable to the existing gossip-based implementation, so they can be useful for us right now.

The tests simulate cluster upgrades by conditionally marking support for the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the tests work only in non-release mode.

The `test_partial_upgrade_can_be_finished_with_removenode` test is marked as skipped because of a bug in gossip that prevents features from being enabled if a node was removed within the last 3 days (#14194).

Closes #14211

* github.com:scylladb/scylladb:
  test/topology: add cluster feature tests
  test: introduce get_supported_features/get_enabled_features
  test: move wait_for_feature to pylib utils
  feature_service: add a test-only cluster feature
2023-06-13 16:59:56 +03:00
Avi Kivity
78f4ee385f cql3: functions: fix count(col) for non-scalar types
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.

The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.

As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.

A unit test is added. It passes with Cassandra as well.

Fixes #14198.

Closes #14199
2023-06-13 14:40:14 +03:00
Michał Sala
e0855b1de2 forward_service: introduce shutdown checks
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.

The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.

This fix should resolve the issue where forward_service retries could
block the shutdown.

Fixes #12604

Closes #13922
2023-06-13 13:44:33 +03:00
Kamil Braun
ff8d88a228 storage_proxy: introduce const version of remote()
One version is implemented using the other (with `const_cast`) because
some additional safety checks will be added in later commit.
2023-06-13 12:44:03 +02:00
Anna Stuchlik
9d1f62fdbf doc: remove warnings against reverse queries
Refs: https://github.com/scylladb/scylla-doc-issues/issues/831

This commit removes the troubleshooting page about reverse
queries, as well as a warning on the Tips page against using
reverse queries.

Closes #14190
2023-06-13 13:19:39 +03:00
David Garcia
b4b13f43dd docs: edit landing page
docs: add icons

docs: update icons

Closes #13559
2023-06-13 12:14:01 +03:00
Piotr Dulikowski
423dc613c3 system_keyspace: overwrite, not add tokens in topology_node_mutation_builder::set
The `topology_node_mutation_builder::set` function, when passed a
non-empty set of tokens, will construct a mutation that adds given
tokens to the column instead of overwriting them. This is not a problem
today because we are always calling `set` on an empty column, but given
the fact that the function is called `set` not `add` and other overloads
of `set` do overwrite, the function might be misused in the future.

This commit fixes the problem by initializing the tombstone in
`collection_mutation_description` properly, causing the previous state
to be dropped before applying new tokens. The tombstone has a timestamp
which is one less than the timestamp of the added cells, mimicking the
CQL behavior which happens when a non-frozen collection is overwritten.

Closes #14216
2023-06-12 23:37:43 +03:00
Kamil Braun
2cd17819cd replica: table: introduce get_my_hit_rate
Doesn't require `gossiper&`.
2023-06-12 15:23:56 +02:00
Kamil Braun
c5c78a7922 storage_proxy: endpoint_filter: remove gossiper dependency
The function used `gossiper&` to check whether an endpoint is considered
alive. Abstract this out through `noncopyable_function`.

This will allow us to use `endpoint_filter` during local queries when
`remote` (which contains the `gossiper` reference) is unavailable.
2023-06-12 15:23:48 +02:00
Kefu Chai
9526258b89 build: make gen_headers a dependency of gen/*.o
when compiling the generated source files, sometimes, we can run
into the FTBFS like:

02:18:54  FAILED: build/release/gen/cql3/CqlParser.o
02:18:54  clang++ ... -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
...
02:18:54  In file included from build/release/gen/cql3/CqlParser.cpp:44:
02:18:54  In file included from build/release/gen/cql3/CqlParser.hpp:75:
02:18:54  In file included from ./cql3/statements/create_function_statement.hh:12:
02:18:54  In file included from ./cql3/functions/user_function.hh:16:
02:18:54  ./lang/wasm.hh:15:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
02:18:54  #include "rust/wasmtime_bindings.hh"
02:18:54           ^~~~~~~~~~~~~~~~~~~~~~~~~~~

CqlParser.cc is a source file generated from cql3/Cql.g, this source in
turn includes another source file generated from
wasmtime_bindings/src/lib.rs. but we failed to setup this dependency in
the build.ninja rules -- we only teach ninja that "to compile the
grammer source files, please prepare the`serializers` source files
first". but this is not enough.

so, in this change, we just replace `serializers` with `gen_headers`,
as the latter is a superset of the former. and should fulfill the needs
of CqlParser.cc.

Fixes #14213
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14214
2023-06-12 15:36:03 +03:00
Piotr Dulikowski
1fb17061d2 test/topology: add cluster feature tests
This commit adds a set of tests for cluster features. The set covers
those tests from the test plan for the upcoming "cluster features on
raft" that do not depend on implementation-specific details. Those tests
are applicable to the existing gossip-based implementation, so they can
be useful for us right now.

The tests simulate cluster upgrades by conditionally marking support for
the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the
tests work only in non-release mode.

The `test_partial_upgrade_can_be_finished_with_removenode` test is
marked as skipped because of a bug in gossip that prevents features from
being enabled if a node was removed within the last 3 days (#14194).
2023-06-12 13:28:31 +02:00
Piotr Dulikowski
e7c355e84f test: introduce get_supported_features/get_enabled_features
Introduces two helper functions that allow getting information about
supported/enabled features on a node, according to its system tables.

As a bonus, the `wait_for_feature` function is refactored to use
`get_enabled_features`.
2023-06-12 13:28:16 +02:00
Anna Stuchlik
b7022cd74e doc: remove support for Ubuntu 18
Fixes https://github.com/scylladb/scylladb/issues/14097

This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.

The update is in sync with the change made for
ScyllaDB 5.2.

This commit must be backported to branch-5.2 and
branch-5.3.

Closes #14118
2023-06-12 13:27:07 +03:00
Piotr Dulikowski
56d3d8b9e2 test: move wait_for_feature to pylib utils
The `wait_for_feature` can be useful, and will be used, in other test
suites than `topology_raft_disabled`, so it is moved to the common pylib
utils.
2023-06-12 10:09:00 +02:00
Piotr Dulikowski
4c5e44a1cd feature_service: add a test-only cluster feature
Adds a cluster feature called TEST_ONLY_FEATURE. It can only be marked
as supported via error injection "features_enable_test_feature", which
can be enabled on node startup via CLI option or YAML configuration.

The purpose of this cluster feature is to simulate upgrading a node to a
version that supports a new feature. This allows us to write tests which
verify that the cluster feature mechanism works.

The fact that TEST_ONLY_FEATURE can only be enabled via error injection
should make it impossible to accidentally enable it in release mode and,
consequently, in production.
2023-06-12 10:05:59 +02:00
Harsh Soni
d8c3b144cb alternator: optimize validate_table_name call
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).

Fixes: #12538

Closes #13966
2023-06-12 10:46:13 +03:00
Avi Kivity
79bfe04d2a cql3: remove abstract_marker vestiges
Removed by e458340821 ("cql3: Remove term")

Closes #14192
2023-06-12 10:41:04 +03:00
Nadav Har'El
59f331c4e1 Merge 'create-relocatable-package.py: package build/node_export only for stripped version' from Kefu Chai
because we build stripped package and non-stripped package in parallel
using ninja. there are chances that the non-stripped build job could
be adding build/node_exporter directory to the tarball while the job
building stripped package is using objcopy to extract the symbols from
the build/node_exporter/node_exporter executable. but objcopy creates
temporary files when processing the executables. and the temporary
files can be spotted by the non-stripped build job. there are two
consequences:

1. non-stripped build job includes the temporary files in its tarball,
   even they are not supposed to be distributed
2. non-stripped build job fails to include the temporary file(s), as
   they are removed after objcopy finishes its job. but the job did spot
   them when preparing the tarball. so when the tarfile python module
   tries to include the previous found temporary file(s), it throws.

neither of these consequences is expected. but fortunately, this only
happens when packaging the non-stripped package. when packaging the
stripped package, the build/node_exported directory is not in flux
anymore. as ninja ensures the dependencies between the jobs.

so, in this change, we do not add the whole directory when packaging
the non-stripped version. as all its ingredients have been added
separately as regular files. and when packaing the stripped version,
we still use the existing step, as we don't have to list all the
files created by strip.sh:

node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz}

we could do so in this script, but the repeatings is unnecessary and
error-prune. so, let's keep including the whole directory recursively,
so all the debug symbols are included.

Fixes https://github.com/scylladb/scylladb/issues/14079

Closes #14081

* github.com:scylladb/scylladb:
  create-relocatable-package.py: package build/node_export only for stripped version
  create-relocatable-package.py: use positive condition when possible
2023-06-12 10:39:10 +03:00
Kefu Chai
c3d91f5190 tracing: drop trace(.., std::string&&) overload
this change is a follow-up of 4f5fcb02fd,
the goal is to avoid the programming oversights like

```c++
trace(trace_ptr, "foo {} with {} but {} is {}");
```

as `trace(const trace_state_ptr& p, const std::string& msg)` is
a better match than the templated one, i.e.,
`trace(const trace_state_ptr& p, fmt::format_string<T...> fmt, T&&...
args)`. so we cannot detect this with the compile-time format checking.

so let's just drop this overload, and update its callers to use
the other overload.

The change was suggested by Avi. the example also came from him.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14188
2023-06-10 20:09:35 +03:00
Kefu Chai
e464ad2568 table: s/lw_shared/unique_ptr/ when appropriate
sel is a local variable, and it is not shared with anybody else.
so make it a unique_ptr<> for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14189
2023-06-09 17:38:18 +03:00
Pavel Emelyanov
c1c1752f88 s3/client: Replace skink flush semaphore with gate
Uploading sinks have internal semaphore limiting the maximum number of
uploading parts and pieces with the value of two. This approach has
several drawbacks.

1. The number is random. It could as well be three, four and any other

2. Jumbo upload in fact violates this parallelizm, because it applies to
   maximum number of pieces _and_ maximum number of parts in each piece
   that can be uploaded in parallels. Thus jumbo upload results in four
   parts in parallel.

3. Multiple uploads don't sync with each other, so uploading N objects
   would result in N * 2 (or even N * 4 with jumbo) uploads in parallel.

4. Single upload could benefit from using more sockets if no other
   uploads happen in parallel. IOW -- limit should be shard-wide, not
   single-upload-wide

Previous patches already put the per-shard parallelizm under (some)
control, so this semaphore is in fact used as a way to collect
background uploading fibers on final flush and thus can be replaced with
a gate.

As a side effect, this fixes an issue that writes-after-flush shouldn't
happen (see #13320) -- when flushed the upload gate is closed and
subsequent writes would hit gate-closed error.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:38:57 +03:00
Pavel Emelyanov
99b92f0ed8 s3/client: Configure different max-connections on http clients
After previous patch different sched groups got different http clients.
By default each client is started with 100 allowed connections. This can
be too much -- 100 * nr-sched-groups * smp::count can be quite huge
number. Also, different groups should have different parallelizm, e.g.
flush/compaction doesn't care that much about latency and can use fewer
sockets while query class is more welcome to have larger concurrency.

As a starter -- configure http clients with maximum shares/100 sockets.
Thus query class would have 10 and flush/compaction -- 1.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:35:59 +03:00
Pavel Emelyanov
81d1bfce2a s3/client: Maintain several http clients on-board
The intent is to isolate workloads from different sched groups from each
other and not let one sched group consume all sockets from the http
client thus affecting requests made by other sched groups.

The conention happens in the maximim number of socket an http client may
have (see scylladb/seastar#1652). If requests take time and client is
asked to make more and more it will eventually stop spawning new
connections and would get blocked internally waiting for running
requests to complete and put a socket back to pool. If a sched group
workload (e.g. -- memtable flush) consumes all the available sockets
then workload from another group (e.g. -- query) would be blocked thus
spoiling its latency (which is poor on its own, but still)

After this change S3 client maintains a sched_group:http_client map
thus making sure different sched groups don't clash with each other so
that e.g. query requests don't wait for flush/compaction to release a
socket.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:28:55 +03:00
Pavel Emelyanov
a8492a065b s3/client: Remove now unused http reference from sink and file
Now these two classes use client-> calls and don't need the http&
shortcut

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:28:30 +03:00
Pavel Emelyanov
b9ee0d385b s3/client: Add make_request() method
This helper call will serve several purposes.

First, make necessary preparations to the request before making, in
particular -- calling authorize()

Second, there's the need to re-make requests that failed with
"connection closed" error (see #13736)

Third, one S3 client is shared between different scheduling groups. In
order to isolate groups' workload from each other different http clients
should be used, and this helper will be in change of selecting one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-08 18:19:19 +03:00
Asias He
4592bbe182 repair: Use the updated estimated_partitions to create writer
The estimated_partitions is estimated after the repair_meta is created.

Currently, the default estimated_partitions was used to create the
write which is not correct.

To fix, use the updated estimated_partitions.

Reported by Petr Gusev

Closes #14179
2023-06-08 11:44:48 +03:00
Botond Dénes
b4c21cfaa0 Merge 'api: task_manager: Return proper response status code' from Aleksandra Martyniuk
Return 400 Bad Request instead of 500 Internal Server Error
when user requests task or module which does not exist through
task manager and task manager test apis.

Closes #14166

* github.com:scylladb/scylladb:
  test: add test checking response status when requested module does not exist
  api: fix indentation
  api: throw bad_param_exception when requested task/module does not exists
2023-06-08 11:31:41 +03:00
Botond Dénes
4f4d3f9d9e Merge 'tracing: use compile-time formatting check and avoid creating temporary string' from Kefu Chai
in this series, we use {fmt}'s compile-time formatting check, and avoid deep copy when creating sstring from std::string.

Closes #14169

* github.com:scylladb/scylladb:
  tracing: use std::string instead of sstring for event_record::message
  tracing: use compile-time formatting check
2023-06-08 11:26:43 +03:00
Botond Dénes
51672219f8 Merge 'Prevent errors while running compaction task tests in parallel' from Aleksandra Martyniuk
Compaction task test should only check the intended group of task.
Thus, the tasks are filtered in each test.

In order to be able to run the tests in parallel, checks for the tasks
of the same type are grouped together.

Fixes: #14131.

Closes #14161

* github.com:scylladb/scylladb:
  test: put compaction task checks of the same type together
  test: filter tasks of given compaction type
2023-06-08 11:23:42 +03:00
Kefu Chai
c123f4644a test.py: do not abort if fails to parse an XML logger file
there are chances that a Boost::test test fails to generate a
valid XML file after the test finishes. and
xml.etree.ElementTree.parse() throws when parsing it.
see https://github.com/scylladb/scylla-pkg/issues/3196

before this change, the exception is not handled, and test.py
aborts in this case. this does not help and could be misleading.

after this change, the exception is handled and printed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14180
2023-06-08 11:02:01 +03:00
Avi Kivity
5acb137c2e Merge 'docs/dev/reader-concurrency-semaphore.md: add section about operations' from Botond Dénes
Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps.

Closes #14171

* github.com:scylladb/scylladb:
  docs/dev/reader-concurrency-semaphore.md: add section about operations
  docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
  reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
2023-06-07 22:53:18 +03:00
Avi Kivity
fc0357de79 Merge 'Coroutinize and change return type for table::get_sstables_by_partition_key()' from Pavel Emelyanov
The helper if huge in the form of then-chain. Also it's generic enough not to limit itself in returning sstables' Data file names only.

refs: #14122 (detached from the one that needs more thinking about)

Closes #14174

* github.com:scylladb/scylladb:
  table: Return shared sstable from get_sstables_by_partition_key()
  table: Coroutinize get_sstables_by_partition_key()
2023-06-07 22:29:37 +03:00
Pavel Emelyanov
ce6a1ca13b Update seastar submodule
* seastar afe39231...99d28ff0 (16):
  > file/util: Include seastar.hh
  > http/exception: Use http::reply explicitly
  > http/client: Include lost condition-variable.hh
  > util: file: drop unnecessary include of reactor.hh
  > tests: perf: add a markdown printer
  > http/client: Introduce unexpected_status_error for client requests
  > sharded: avoid #include <seastar/core/reactor.hh> for run_in_background()
  > code: Use std::is_invocable_r_v instead of InvokeReturns
  > http/client: Add ability to change pool size on the fly
  > http/client: Add getters for active/idle connections counts
  > http/client: Count and limit the number of connections
  > http/client: Add connection->client RAII backref
  > build: use the user-specified compiler when building DPDK
  > build: use proper toolchain based on specified compiler
  > build: only pass CMAKE_C_COMPILER when building ingredients
  > build: use specified compiler when building liburing

Two changes are folded into the commit:

1. missing seastar/core/coroutine.hh include in one .cc file that
   got it indirectly included before seastar reactor.hh drop from
   file.hh

2. http client now returns unexpected_status_error instead of
   std::runtime_error, so s3 test is updated respectively

Closes #14168
2023-06-07 20:25:49 +03:00
Pavel Emelyanov
c68c154fb6 code: Reduce tracing/*hh fanout
There are some headers that include tracing/*.hh ones despite all they
need is forward-declared trace_state_ptr

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14155
2023-06-07 19:19:22 +03:00
Tomasz Grabiec
b9a30dd5ac Merge 'raft: server: throw fewer commit_status_unknowns from wait_for_entry' from Kamil Braun
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.

Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.

Fixes: #14029
Fixes: #14072

Closes #14167

* github.com:scylladb/scylladb:
  raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry`
  raft: replication test: don't hang if `_seen` overshots `_apply_entries`
  raft: replication test: print a warning when handling `commit_status_unknown`
2023-06-07 16:34:51 +02:00
Kamil Braun
fd66bb1a61 storage_service: reduce timeout in wait_for_ring_to_settle
In 297c75c6d8 I set the timeout to
5 minutes mainly due to debug mode which is often quite slow on Jenkins.
But 5 minutes is a bit of an overkill. It wouldn't be a problem but
there is a dtest that waits for a node to fail bootstrap; it's wasteful
for the test to sleep for an entire 5 minutes.

Set it to:
- 3 minutes in debug mode,
- 30 seconds in dev/release modes.

Ref: scylladb/scylla-dtest#3203

Closes #14140
2023-06-07 17:31:43 +03:00
Alejo Sanchez
5b8fc86737 test/pylib: minio unique temp dir
Create a unique minio server temp dir for each test run.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #14095
2023-06-07 16:29:58 +03:00
Aleksandra Martyniuk
2e54a322fb test: add test checking response status when requested module does not exist 2023-06-07 14:49:48 +02:00
Kamil Braun
5504da3745 raft: server: throw fewer commit_status_unknowns from wait_for_entry
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.

Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.

Fixes: #14029
2023-06-07 14:17:23 +02:00
Kamil Braun
2fea2fc19c raft: replication test: don't hang if _seen overshots _apply_entries
As in the previous commit, if a command gets doubly applied due to
`commit_status_unknown`, this will could lead to hard-to-debug failures;
one of them was the test hanging because we would never call
`_done.set_value()` in `state_machine::apply` due to `_seen`
overshooting `_apply_entries`.

Fix the problem and print a warning if we apply too many commands.

Fixes: #14072
2023-06-07 14:17:23 +02:00
Kamil Braun
43b48c59fd raft: replication test: print a warning when handling commit_status_unknown
`commit_status_unknown` may lead to double application and then a
hard-to-debug failure. But some tests actually rely on retrying it, so
print a warning and leave a FIXME for maybe a better future solution.

Ref: #14029
2023-06-07 14:17:20 +02:00
Piotr Sarna
9064d3c6ec docs: mention the new synchronous_updates option in mv docs
This commit adds a table (with 1 row) explaining Scylla-specific
materialized view options - which now consists just of
synchronous_updates.

Tested manually by running `make preview` from docs/ directory.

Closes #11150
2023-06-07 15:06:06 +03:00
Pavel Emelyanov
198bca98ec table: Return shared sstable from get_sstables_by_partition_key()
The call is generic enough not to drop the sstable itself on return so
that callers can do whatever they need with it. The only today's caller
is API which will convert sstables to filenames on its own

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-07 15:04:48 +03:00
Pavel Emelyanov
f895ac0adb table: Coroutinize get_sstables_by_partition_key()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-07 15:04:48 +03:00
Botond Dénes
0c632b6e3d docs/dev/reader-concurrency-semaphore.md: add section about operations
Containing two tables, describing all the possible operations seen in
user, system and streaming semaphore diagnostics dumps.
2023-06-07 14:22:52 +03:00
Botond Dénes
0067fa0a09 docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
As they allow for more levels, than the current `---` and `===` ones.
2023-06-07 14:22:10 +03:00
Botond Dénes
c4faa05888 reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
"description" is not the respective column contains, so fix the header.
2023-06-07 14:21:48 +03:00
Kefu Chai
428c13076f tracing: use std::string instead of sstring for event_record::message
when creating an event_record, the typical use case is to use a
string created using fmt::format(), which returns a std::string.

before this change, we always convert the std::string to a sstring,
and move this shinny new sstring into a new event_record. but
when creating sstring, we always performs a deep copy, which is not
necessary, as we own the std::string already.

so, in this change, instead of performing a deep copy, we just keep
the std::string and pass it all the way to where event_record is
created. please note, the std::string will be implicitly converted
to data_value, and it will be dropped on the floor after being
serialized in abstract_type::decompose(). so this deep copy is
inevitable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-07 18:59:37 +08:00
Kefu Chai
4f5fcb02fd tracing: use compile-time formatting check
in this change we pass the fmt string using fmt::format_string<T...>
in order to {fmt}'s compile-time formatting. so we can identify
the bad format specifier or bad format placeholders at compile-time.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-07 18:55:32 +08:00
Aleksandra Martyniuk
fcb15d3b8f api: fix indentation 2023-06-07 11:59:39 +02:00
Aleksandra Martyniuk
be3317623f api: throw bad_param_exception when requested task/module does not exists
In task manager and task manager test rest apis when a task or module
which does not exist is requested, we get Internal Server Error.

In such cases, wrap thrown exceptions in bad_param_exception
to respond with Bad Request code.

Modify test accordingly.
2023-06-07 11:58:28 +02:00
Marcin Maliszkiewicz
8b06684a8c docs: dev: document pytest run convenience script
Closes #13995
2023-06-07 12:37:52 +03:00
Nadav Har'El
5984db047d Merge 'mv: forbid IS NOT NULL on columns outside the primary key' from Jan Ciołek
statement_restrictions: forbid IS NOT NULL on columns outside the primary key

IS NOT NULL is currently allowed only when creating materialized views.
It's used to convey that the view will not include any rows that would make the view's primary key columns NULL.

Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365).

Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns.
It would be a big effort, and there are some reasons why we don't support them.

For now let's forbid such restrictions, it's better to fail than be wrong silently.

Throwing a hard error would be a breaking change.
To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag.

This flag can have the following values:
* `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown.
* `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored.
* `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored.

The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml.

This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view.

New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`.
New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning.
This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters.

Fixes: #10365

Closes #13013

* github.com:scylladb/scylladb:
  boost/restriction_test: test the strict_is_not_null_in_views flag
  docs/cql/mv: columns outside of view's primary key can't be restricted
  cql-pytest: enable test_is_not_null_forbidden_in_filter
  statement_restrictions: forbid IS NOT NULL on columns outside the primary key
  schema_altering_statement: return warnings from prepare_schema_mutations()
  db/config: add strict_is_not_null_in_views config option
  statement_restrictions: add get_not_null_columns()
  test: remove invalid IS NOT NULL restrictions from tests
2023-06-07 12:12:19 +03:00
Kamil Braun
2dbf6f32cd Merge 'Fix crash during restart of a single node with topology over raft' from Gleb
This is a regression introduced in f26179cd27.

Fixes: #14136

* 'gleb/set_group0' of github.com:scylladb/scylla-dev:
  test: restart first node to see if it can boot after restart
  service: move setting of group0 point in storage_service earlier
2023-06-07 10:21:17 +02:00
Aleksandra Martyniuk
3c5094dce8 test: put compaction task checks of the same type together
In test_compaction_task.py tests concerning the same type of compaction
are squashed together so that they are run synchronously and there is
no data race when the tests are run in parallel.
2023-06-07 09:49:42 +02:00
Aleksandra Martyniuk
94a2895874 test: filter tasks of given compaction type
In test_compaction_task.py the tasks are filtered by compaction type
so that each test case checks only the intended tasks.
2023-06-07 09:30:44 +02:00
Jan Ciolek
ec0cac8862 boost/restriction_test: test the strict_is_not_null_in_views flag
Add unit tests for the strict_is_not_null_in_views flag.
This flag controls the behavior in case of an invalid
IS NOT NULL restrictions on a materialized view column.

Materialized views allow only restricting columns
that belong to the view's primary key, all other
restrictions should be rejected.

There was a bug where IS NOT NULL restrictions
weren't rejected, but simply ignored instead.

This flags controls what should happen when the user
runs a query with such an invalid IS NOT NULL restriction.

strict_is_not_null_in_views can have the following values:
* `true` - strict checking, invalid queries are rejected
* `warn` - the query is allowed, but a warning is printed
* `false` - the query is allowed, the invalid restrictions
            are silently ignored.

The tests are based on the ones for strict_allow_filtering,
which reside in the lines preceding the newly added tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 02:30:11 +02:00
Jan Ciolek
83932f9f37 docs/cql/mv: columns outside of view's primary key can't be restricted
We used to allow IS NOT NULL restrictions on columns
that were not part of the materialized view's primary key.
It runs out that such restrictions are silently ignored (see #10365),
so we no longer allow such restrictions.

Update the documentation to reflect that change.

Also there was a mistake in the documentation.
It said that restrictions are allowed on all columns
of the base table's primary key, but they are actually
allowed on all columns of the view table's primary key,
not the base tables.
This change also fixes that mistake.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 02:30:11 +02:00
Jan Ciolek
50943e825b cql-pytest: enable test_is_not_null_forbidden_in_filter
IS NOT NULL is now allowed only on the view's primary key columns,
so the xfail marker can be removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 02:30:11 +02:00
Jan Ciolek
3326f90709 statement_restrictions: forbid IS NOT NULL on columns outside the primary key
IS NOT NULL is currently allowed only
when creating materialized views.
It's used to convey that the view will
not include any rows that would make the
view's primary key columns NULL.

Generally materialized views allow
to place restrictions on the primary key
columns, but restrictions on the regular
columns are forbidden. The exception was
IS NOT NULL - it was allowed to write
regular_col IS NOT NULL. The problem is
that this restriction isn't respected,
it's just silently ignored.

Supporting IS NOT NULL on regular columns
seems to be as hard as supporting
any other restrictions on regular columns.
It would be a big effort, and there are some
reasons why we don't support them.

For now let's forbid such restrictions,
it's better to fail than be wrong silently.

Throwing a hard error would be a breaking change.
To avoid breaking existing code the reaction to
invalid IS NOT NULL restrictions is controlled
by the `strict_is_not_null_in_views` flag.

The default values for this flag are `warn` in db::config
and `true` in scylla.yaml.

This way the existing clusters will have `warn` by default,
so they'll get a warning if they try to create such an
invalid view.

New clusters with fresh scylla.yaml will have the flag set
to `true`, as scylla.yaml overwrites the default value
in db::config.
New clusters will throw a hard error for invalid views,
but in older existing clusters it will just be a warning.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 02:30:11 +02:00
Jan Ciolek
a8cc5ed491 schema_altering_statement: return warnings from prepare_schema_mutations()
Validation of a CREATE MATERIALIZED VIEW statement takes place inside
the prepare_schema_mutations() method.
I would like to generate warnings during this validation, but there's
currently no way to pass them.

Let's add one more return value - a vector of CQL warnings generated
during the execution of this statement.

A new alias is added to make it clear what the function is returning:
```c++
// A vector of CQL warnings generated during execution of a statement.
using cql_warnings_vec = std::vector<sstring>;
```

Later the warnings will be sent to the user by the function
schema_altering_statment::execute(), which is the only caller
of prepare_schema_mutations().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 02:30:07 +02:00
Jan Ciolek
c67d65987e db/config: add strict_is_not_null_in_views config option
IS NOT NULL shouldn't be allowed on columns
which are outside of the materialized view's primary key.
It's currently allowed to create views with such restrictions,
but they're silently ignored, it's a bug.

In the following commits restricting regular columns
with IS NOT NULL will be forbidden.
This is a breaking change.

Some users might have existing code that creates
views with such restrictions, we don't want to break it.

To deal with this a new feature flag is introduced:
strict_is_not_null_in_views.

By default it's set to `warn`. If a user tries to create
a view with such invalid restrictions they will get a warning
saying that this is invalid, but the query will still go through,
it's just a warning.

The default value in scylla.yaml will be `true`. This way new clusters
will have strict enforcement enabled and they'll throw errors when the
user tries to create such an invalid view,
Old clusters without the flag present in scylla.yaml will
have the flag set to warn, so they won't break on an update.

There's also the option to set the flag to `false`. It's dangerous,
as it silences information about a bug, but someone might want it
to silence the warnings for a moment.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-07 01:48:39 +02:00
Kefu Chai
84683c3549 sstable_loader: update comment to reflect latest changes
we have a dedicated facility for loading sstables since
68dfcf5256, and column_family (i.e. table)
is not responsible for loading new sstables. so update the comment
to reflect this change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14154
2023-06-06 14:31:15 +03:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Gleb Natapov
e50f96fc4e test: restart first node to see if it can boot after restart
From: Kamil Braun <kbraun@scylladb.com>
2023-06-06 12:14:27 +03:00
Raphael S. Carvalho
156d771101 compaction: Fix sstable cleanup after resharding on refresh
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)

Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.

Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).

Dtest: https://github.com/scylladb/scylla-dtest/pull/3206

Fixes #14001.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14147
2023-06-06 12:14:03 +03:00
Gleb Natapov
8598cebb11 service: move setting of group0 point in storage_service earlier
group0 pointer in storage_service should be set when group0 starts.
After f26179cd27 we start group0 earlier,
so we need to move setting of the group0 pointer as well.
2023-06-06 12:12:48 +03:00
Benny Halevy
17795757d3 compaction_manager: compact_sstables: fix typo in log message about cleanup
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14151
2023-06-06 11:17:02 +03:00
Michał Chojnowski
1a521172ec data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.

Fixes #14139

Closes #14143
2023-06-06 11:03:17 +03:00
Kefu Chai
9ba610c811 build: specify link-args using build script
as an alternative to passing the link-args using the environmental variable,
we can also use build script to pass the "-C link-args=<FLAG>" to the compiler.
see https://doc.rust-lang.org/nightly/cargo/reference/build-scripts.html#cargorustc-link-argflag
to ensure that cargo is called again by ninja, after build.rs is
updated, build.rs is added as a dependency of {wasm} files along with
Cargo.lock.

this change is verified using following command
```
RUSTFLAGS='--print link-args' cargo build \
  --target=wasm32-wasi \
  --example=return_input \
  --locked \
  --manifest-path=Cargo.toml \
  --target-dir=build/cmake/test/resource/wasm/rust
```

the output includes "-zstack-size=131072" in the argument passed to lld:
```
   Compiling examples v0.0.0 (/home/kefu/dev/scylladb/test/resource/wasm/rust)
LC_ALL="C"
PATH="/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin:/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/kefu/.local/bin:/home/kefu/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin"
VSLANG="1033"
"lld"
"-flavor" "wasm" "--rsp-quoting=posix" "--export"
"_scylla_abi" "--export" "_scylla_free" "--export" "_scylla_malloc"
"--export" "return_input" "-z" "stack-size=1048576" "--stack-first"
"--allow-undefined" "--fatal-warnings" "--no-demangle"
...
"-L" "/usr/lib/rustlib/wasm32-wasi/lib"
"-L" "/usr/lib/rustlib/wasm32-wasi/lib/self-contained"
"-o"
"/home/kefu/dev/scylladb/build/cmake/test/resource/wasm/rust/wasm32-wasi/debug/examples/return_input-ef03083560989040.wasm"
"--gc-sections"
"--no-entry"
"-O0"
"-zstack-size=131072"
```

with this change, it'd be easier to build .wat files in CMake, so
we don't need to repeat the settings in both configure.py and
CMakeLists.txt

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14123
2023-06-06 10:54:39 +03:00
Kefu Chai
9e562f8707 build: cmake s/FATAL/FATAL_ERROR/
we should have used "FATAL_ERROR" instead of "FATAL", as the first
parameter passed to the "message()" command. see
https://cmake.org/cmake/help/v3.0/command/message.html

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14124
2023-06-06 10:53:32 +03:00
Botond Dénes
80b944a9b8 Merge 'Table compaction tasks' from Aleksandra Martyniuk
Implementation of task_manager's tasks that cover major, cleanup,
offstrategy, and upgrade sstables compaction of one table.

Closes #13619

* github.com:scylladb/scylladb:
  test: extend compaction tasks test
  compaction: fix indentation
  compaction: create table_upgrade_sstables_compaction_task_impl
  compaction: create table_offstrategy_keyspace_compaction_task_impl
  compaction: create table_cleanup_keyspace_compaction_task_impl
  compaction: create table_major_keyspace_compaction_task_impl
  compaction: add helpers for table tasks scheduling
  compaction: add run_on_table
  compaction: pass std::string to run_on_existing_tables
2023-06-06 10:51:53 +03:00
Botond Dénes
33e4ac9f2a Merge 'Enlighten messaging_service::shutdown()' from Pavel Emelyanov
Recent seastar update added rpc::server::shutdown() method that only isolates the server from the network, but lets all internal handler callbacks continue running up until stop() is called. This patch makes use of it in messaging service by calling this new shiny shutdown() in its shutdown() and calling good old stop() in its stop().

Intentionally, it will prevent scylla from freezing on drain in case some RPC handler gets stuck. It may later freeze on stop(), but it's less horrible. Also chances are that by stop time some other handler's dependencies would have been drained/shut-down so the handler can wake up and stop normally.

fixes: #14031

Closes #14115

* github.com:scylladb/scylladb:
  messaging_service: Shutdown rpc server on shutdown
  messaging_service: Generalize stop_servers()
  messaging_service: Restore indentation after previous patch
  messaging_service: Coroutinize stop()
  messaging_service: Coroutinize stop_servers()
2023-06-06 10:47:06 +03:00
Pavel Emelyanov
dba00acbe9 Merge 's3/test: cleanups to avoid using hardcoded values' from Kefu Chai
this series replaces hard-coded values with variables. will need to expand this test to cover most test cases when working on tiered-storage.

Closes #14137

* github.com:scylladb/scylladb:
  s3/test: use variable for inserted data
  s3/test: replace test_ks and test_cf with variables
  s3/test: introduce format_tuples() for formatting CQL queries
2023-06-06 10:43:53 +03:00
Kefu Chai
32f5026ccb s3/test: use variable for inserted data
instead of repeating it, let's define it and reuse it later.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-06 14:16:23 +08:00
Kefu Chai
236d2ded42 s3/test: replace test_ks and test_cf with variables
instead of hardwiring the dataset in test, let's define them with
variables and use the variables instead.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-06 14:16:23 +08:00
Kefu Chai
8ec56599f5 s3/test: introduce format_tuples() for formatting CQL queries
in order to make data set for testing more visible, format_tuples() is
introduced for formatting a dict into a set of structured values
consumable by CQL.

this function is added to test/cql-pytest/util.py in hope that it
can be reused by other tests using CQL.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-06 14:16:23 +08:00
Asias He
32cad54c00 repair: Add aborted_by_user to repair status report
Add the aborted_by_user flag to the repair status report, for example:

INFO  [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: starting user-requested repair for keyspace ks2a, repair id 1,
      options {{trace -> false}, {columnFamilies -> tb5}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> false}}
INFO  [shard 0] repair - Started to aborting repair jobs={4342512b-5a5f-48fc-a840-934100264cbc}, nr_jobs=1
WARN  [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: Repair job aborted by user, job=4342512b-5a5f-48fc-a840-934100264cbc, keyspace=ks2a, tables={tb5}
WARN  [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true
WARN  [shard 1] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true
WARN  [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: user-requested repair failed: std::runtime_error ({
      shard 0: std::runtime_error (repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true),
      shard 1: std::runtime_error (repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true)})

In addition, change the log

from

"Aborted {} repair job(s), aborted={}"

to

"Started to abort repair jobs={}, nr_jobs={}"

to reflect the fact the user requested abort api is async.

Closes #14062
2023-06-06 09:08:00 +03:00
Takuya ASADA
45ef09218e test/perf/perf_fast_forward: avoid allocating AIO slots on startup
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().

Fixes #13945

Closes #14026
2023-06-06 08:53:36 +03:00
David Garcia
285066e8eb docs: update theme 1.5
Closes #14119
2023-06-06 08:36:56 +03:00
Avi Kivity
26c8470f65 treewide: use #include <seastar/...> for seastar headers
We treat Seastar as an external library, so fix the few places
that didn't do so to use angle brackets.

Closes #14037
2023-06-06 08:36:09 +03:00
Kamil Braun
f51312e580 auth: don't use infinite timeout in default_role_row_satisfies query
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.

With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.

Use the same timeout configuration as other queries in this module do.

Fixes #13545.

Closes #14134
2023-06-05 17:17:02 +03:00
Nadav Har'El
d2e089777b Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation
2023-06-04 23:22:51 +03:00
Nadav Har'El
8a1334cf6f Merge 'alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues' from Marcin Maliszkiewicz
Summary of the patch set:
- eliminates not needed calls to rjson::find (~1% tps improvement in `perf-simple-query --write`)
- adds some very specific test in this area (more general cases were covered already)
- fixes some minor validation bug

Fixes https://github.com/scylladb/scylladb/issues/13251

Closes #12675

* github.com:scylladb/scylladb:
  alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem
  alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues
2023-06-04 23:10:12 +03:00
Alexey Novikov
ffd4fcceec Alternator: return full table description on return of DeleteTable
The DeleteTable operation in Alternator shoudl return a TableDescription
object describing the table which has just been deleted, similar to what
DescribeTable returns

Fixes scylladb#11472

Closes #11628
2023-06-04 21:00:26 +03:00
Israel Fruchter
1ce739b020 Update tools/cqlsh submodule
* tools/cqlsh 8769c4c2...6e1000f1 (5):
  > build: erase uid/gid information from tar archives
  > Add github action to update the dockerhub description
  > cqlsh: Add extension handler for "scylla_encryption_options"
  > requirements.txt: update python-driver==3.26.0
  > Add support for arm64 docker image

Closes #13878
2023-06-04 19:56:52 +03:00
Kefu Chai
3cd9aa1448 build: cmake: build .wat from source files
we compile .wat files from .rs and .c source files since
6d89d718d9.
these .wat are used by test/cql-pytest/test_wasm.py . let's update
the CMake building system accordingly so these .wat files can also
be generated using the "wasm" target. since the ctest system is
not used. this change should allow us to perform this test manually.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14126
2023-06-04 14:55:38 +03:00
Aarav Arora
a12d2d5f16 fix: keyspace spell
Closes #14121
2023-06-04 13:48:43 +03:00
Kefu Chai
421331a20b test.py: consolidate multiple runs of the same test
before this change, when consolidating the boost's XML logger file,
we just practically concatenate all the tests' logger file into a single
one. sometimes, we run the tests for multiple times, and these runs share
the same TestSuite and TestCase tags. this has two sequences,

1. there is chance that only a test has both successful and failed
   runs. but jenkins' "Test Results" page cannot identify the failed
   run, it just picks a random run when one click for the detail of
   the run. as it takes the TestCase's name as part of its identifier.
   and we have multiple of them if the argument passed to the --repeat
   option is greater than 1 -- this is the case when we promote the
   "next" branch.
2. the testReport page of Jenkins' xUnit plugin created for the "next"
   job is 3 times as large as the one for the regular "scylla-ci" run.
   as all tests are repeated for 3 times. but what we really cares is
   history of a certain test not a certain run of it.

in this change, we just pick a representive run of a test if it is
repeated multiple times and add a "Message" tag for including the
summary of the runs. this should address the problems above:

1. the failed tests always stand out so we can always pinpoint it with
   Jenkins's "Test Results" page.
2. the tests are deduped by its name.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14069
2023-06-04 13:15:46 +03:00
Konstantin Osipov
b39ca97919 consistent_cluster_management: make the default
As per our roll out plan, make consistent_cluster_management (aka Raft
for schema changes) the default going forward. It means all
clusters which upgrade from the previous version and don't have
`consistent_cluster_management` explicitly set in scylla.yaml will begin
upgrading to Raft once all nodes in the cluster have moved to the new
version.

Fixes #13980

Closes #13984
2023-06-02 09:05:09 +02:00
Pavel Emelyanov
7e8b9aecab messaging_service: Shutdown rpc server on shutdown
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 21:24:13 +03:00
Pavel Emelyanov
a55fb7f1d7 messaging_service: Generalize stop_servers()
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 21:24:06 +03:00
Pavel Emelyanov
8b3149c942 messaging_service: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 20:48:01 +03:00
Pavel Emelyanov
13a6b25f24 messaging_service: Coroutinize stop()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 20:47:42 +03:00
Pavel Emelyanov
b643f18df6 messaging_service: Coroutinize stop_servers()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 20:47:28 +03:00
Kamil Braun
34b60ba82b test_tablets: use run_async instead of execute
Don't block the thread which prevents concurrent tests from running
during this time. Use the dedicated `run_async`.

Also to silence `mypy` which complains that `manager.cql` is `Optional`
(so in theory might be `None`, e.g. after `driver_close`), use
`manager.get_cql()`.

Closes #14109
2023-06-01 18:05:05 +02:00
Kamil Braun
8be69fc3a0 Merge 'Initialize group0 server on boot before allowing incoming requests' from Gleb
The series includes mostly cleanups and one bug fix.

The fix is for the race where messages that need to access group0 server are arriving
before the server is initialized.

* 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev:
  service: raft: fix typo
  service: raft: split off setup_group0_if_exist from setup_group0
  storage_service: do not allow override_decommission flag if consistent cluster management is enabled
  storage_service: fix indentation after the previous patch
  storage_service: co-routinize storage_service::join_cluster() function
  storage_service: do not reload topology from peers table if topology over raft is enabled
  storage_service: optimize debug logging code in case debug log is not enabled
2023-06-01 17:37:58 +02:00
Kamil Braun
297c75c6d8 storage_service: wait for schema agreement during initial boot
In production environments the Scylla boot procedure includes various
sleeps such as 'ring delay' and 'waiting for gossip to settle'. We
disable those sleeps in test.py tests and we'd also like to disable
them, if possible, in dtests.

Unfortunately, disabling the sleeps causes problems with schema: a
bootstrapping node creates its own versions of distributed keyspaces and
tables (such as `system_distributed`) because it doesn't first wait for
gossip to settle, during which it would usually pull existing schemas of
those keyspaces/tables from existing nodes. This may cause schema
disagreement for the whole duration of the bootstrap procedure (the
other nodes don't pull schema from a bootstrapping node; pulls are only
allowed once it becomes NORMAL), which causes the bootstrapping node to
costantly pull schema in attempts to synchronize, which doesn't work
because it's the other nodes which don't have schema mutations, not this
node. Even when the bootstrapping node finishes, the existing nodes
won't automatically pull schema from that node - only once we perform
another schema change a pull will be triggered.

The continuous pulls and the lack of schema synchronization until manual
schema change cause problems in tests. For example we observed the test
timing out in debug mode because bootstrap took too long due to the node
having to perform ~700 schema pulls (it attempts to synchronize schema
on each range repair). There's also potential for permanent schema
divergence, although I haven't seen this yet - in my experiments, once
the existing nodes pull from the new node, schema would always converge.

In any case, the safe and robust solution is to ensure that the
bootstrapping node pulls schema from existing nodes early in the boot
procedure. Then it won't try to create its own versions of the
distributed keyspaces/tables because it'll see they are already present
in the cluster.

In fact there already is `storage_service::wait_for_ring_to_settle`
which is supposed to wait until schema is in agreement before
proceeding.

However, this schema agreement wait relied on an earlier wait at the
beginning of the function - for a node to show up in gossiper
(otherwise, if we're the only node in gossiper, the schema agreement
wait trivially finishes immediately).

Unfortunately, this wait would timeout after `ring_delay` and proceed,
even if no other node was observed, instead of throwing an error...

To make it safe, modify the logic so if we timeout, we refuse to
bootstrap. To make it work in tests which set `ring_delay` to 0, make it
independent of `ring_delay` - just set the timeout to 5 minutes.

Fixes #14065
Fixes #14073

Closes #14105
2023-06-01 13:24:43 +03:00
Petr Gusev
0415ac3d5f test_secondary_index_collections: change insert/create index order
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.

The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.

The patch just moves the index creation statements
from the second part to the first.

Fixes: #14076

Closes #14090
2023-05-31 23:30:57 +03:00
Nadav Har'El
0e602159b9 storage_service: avoid excessive delay in wait_for_ring_to_settle()
The function storage_service::wait_for_ring_to_settle() is called when
bootstrapping a new node in an existing cluster, and it's supposed to
wait until the caller has the right schema - to allow the bootstrap
to start (the bootstrap needs to copy all existing tables from other
nodes).

The code of this function mostly checks in-memory structures in the
gossiper and migration manager, and if they aren't ready, sleeps and
tries again (until a timeout of "ring_delay_ms"). Today we sleep a
whole second between each try, but that's excessive - the checks are
very cheap, and we can do them much more often, so we can stop the
loop much closer to when the schema becomes available.

This patch changes the sleep from 1 second to 10 milliseconds.

The benefit of this patch is not huge - on average I measured about
0.25 seconds saving on adding a node to a cluster. But I don't see
any downside either.

Noticed while looking into Refs #14073

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14101
2023-05-31 17:49:38 +02:00
Benny Halevy
bda3705974 test/lib: test_reader_conversions: always close reader
read_mutation_from_flat_mutation_reader might throw
so we need to close the reader returned from
ms.make_fragment_v1_stream also on the error
path to avoid the internal error abort when the
reader is destroyed while opened.

Fixes #14098

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14099
2023-05-31 17:49:38 +02:00
Kamil Braun
8b718db42f Update seastar submodule
* seastar aff87d5b...afe39231 (12):
  > rpc: Fix formatting after previous patches
  > rpc: Introduce server::shutdown()
  > rpc: Wait for server socket to stop before killing conns
  > rpc: Document server::stop() method
  > util: remove unused #include
  > rpc: rpc_types: make `connection_id` a class
  > tests: rpc_test: simple test for connection aborting
  > rpc: introduce `server::abort_connection(connection_id)`
  > treewide: add C++ modules support
  > rpc: remove `connection::_server` field
  > rpc: add `server&` and `connection_id` to `client_info`
  > rpc: rpc_types: move `connection_id` definition before `client_info`
2023-05-31 17:49:38 +02:00
Aleksandra Martyniuk
b325bf11bc test: extend compaction tasks test
Compaction task test checks whether child-parent relationship
in tasks tree is valid.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
fecdd75cd6 compaction: fix indentation 2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
53c24c0f7d compaction: create table_upgrade_sstables_compaction_task_impl
Implementation of task_manager's task that covers upgrade sstables
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
143919cfa7 compaction: create table_offstrategy_keyspace_compaction_task_impl
Implementation of task_manager's task that covers offstrategy keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
55ef1c24e1 compaction: create table_cleanup_keyspace_compaction_task_impl
Implementation of task_manager's task that covers cleanup keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
5c7832ab59 compaction: create table_major_keyspace_compaction_task_impl
Implementation of task_manager's task that covers major keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
d0c4028d64 compaction: add helpers for table tasks scheduling
In shard compaction tasks per table tasks will be created all at once
and then they will wait for their turn to run.

A function that allows waking up tasks one after another and a function
that makes the task wait for its turn are added.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
6dacc45c70 compaction: add run_on_table
Extract code which runs a function on a particular table from
run_on_existing_tables to run_on_table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
5c65ac00ef compaction: pass std::string to run_on_existing_tables
Keyspace argument passed to run_on_existing_tables has its type
changed from std::string_view to std::string.
2023-05-31 14:59:24 +02:00
Gleb Natapov
dcfd224e8b service: raft: fix typo 2023-05-31 11:01:33 +03:00
Gleb Natapov
f26179cd27 service: raft: split off setup_group0_if_exist from setup_group0
Currently setup_group0 is responsible to start existing group0 on restart
or create a new one and joining the cluster with it during bootstrap. We
want to create the server for existing group0 earlier, before we start
to accept messages because some messages may assume that the server
exists already. For that we split creation of exiting group0 server into
a separate function and call it on restart before the messaging service
starts accepting messages.

Fixes: #13887
2023-05-31 11:00:41 +03:00
Gleb Natapov
acc035b504 storage_service: do not allow override_decommission flag if consistent cluster management is enabled
If consistent cluster management is enabled it is not possible to
restart decommissioned node since it will not be part of the grouup0.
2023-05-31 10:40:42 +03:00
Raphael S. Carvalho
23443e0574 compaction: Fix incremental compaction for sstable cleanup
After c7826aa910, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038
2023-05-31 06:46:12 +03:00
Pavel Emelyanov
66ccc14fcb scylla-gdb: Add commitlog command
The command prints segment_manager address, because it's the manager
who's on interest, not the db::commitlog itself. Also it prints out all
found segments, it's just for convenience -- segments are in a vector of
shared pointers and it's handy to have object addresses instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14088
2023-05-30 22:55:18 +03:00
Avi Kivity
bb361f41d8 Merge 'RPC: add [[ref]] attribute to heavy parameters' from Gusev Petr
By default `idl-compiler.py` emits code to pass parameters by value. There was an attribute `[[ref]]`, which makes it to use `const&`, but it was not used systematically and in many cases parameters were redundantly copied. In this PR, all `verb` directives have been reviewed and the `[[ref]]` attribute has been added where it makes sense.

The parameters [are serialised synchronously](https://github.com/scylladb/seastar/blob/master/include/seastar/rpc/rpc_impl.hh#L471) so there should be no lifetime issues. This was not the case before, but the behaviour changed in [this commit](3942546d41). Now it's not a problem to get an object by reference when using `send_` methods.

Fixes: #12504

Closes #14003

* github.com:scylladb/scylladb:
  tracing::trace_info: pass by ref
  storage_proxy: pass inet_address_vector_replica_set by ref
  raft: add [[ref]] attribute
  repair: add [[ref]] attribute
  forward_request: add [[ref]] attribute
  storage_proxy: paxos:: add [[ref]] attribute
  storage_proxy: read_XXX:: make read_command [[ref]]
  storage_proxy: hint_mutation:: make frozen_mutation [[ref]]
  storage_proxy: mutation:: make frozen_mutation [[ref]]
2023-05-30 16:37:24 +03:00
Kefu Chai
037113f752 reloc: raise if rmtree fails
occasionally, we are observing build failures like:
```
17:20:54  FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz
17:20:54  dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz'
17:20:54  Traceback (most recent call last):
17:20:54    File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module>
17:20:54      os.makedirs(f'build/{SCYLLA_DIR}')
17:20:54    File "<frozen os>", line 225, in makedirs
17:20:54  FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package'
```

to understand the root cause better, instead of swallowing the error,
let's raise the exception it is not caused by non-existing directory.

a similar change was applied to scripts/create-relocatable-package.py
in a0b8aa9b13. which was correct per-se.
but the original intention was to understand the root cause of the
failure when packaging scylla-debuginfo-*.tar.gz, which is created
by the dist/debuginfo/scripts/create-relocatable-package.py.

so, in this change, the change is ported to this script.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14082
2023-05-30 15:39:24 +03:00
Botond Dénes
9a1e5784b0 Merge 'Use table_info in compaction' from Aleksandra Martyniuk
In compaction logs table is identified by {keyspace}.{table_id}.

Instead, table name should be used in run_on_existing_tables
logs. To do so, task manager's compaction tasks use table_info
instead of table_id.

Keyspace argument is copied to run_on_existing_tables
to ensure it's alive.

Closes #13816

* github.com:scylladb/scylladb:
  compaction: use table_info in compaction tasks
  api: move table_info to schema/schema_fwd.hh
2023-05-30 15:10:47 +03:00
Kefu Chai
82cac8e7cf treewide: s/std::source_location/seastar::compact::source_location/
CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports
an issue on how the default argument is evaluated. this problem is
more obvious when it comes to how `std::source_location::current()`
is evaluated as a default argument. but not all compilers have the
same behavior, see https://godbolt.org/z/PK865KdG4.

notebaly, clang-15 evaluates the default argument at the callee
site. so we need to check the capability of compiler and fall back
to the one defined by util/source_location-compat.hh if the compiler
suffers from CWG 2631. and clang-16 implemented CWG2631 in
https://reviews.llvm.org/D136554. But unfortunately, this change
was not backported to clang-15.

before switching over to clang-16, for using std::source_location::current()
as the default parameter and expect the behavior defined by CWG2631,
we have to use the compatible layer provided by Seastar. otherwise
we always end up having the source_location at the callee side, which
is not interesting under most circumstances.

so in this change, all places using the idiom of passing
std::source_location::current() as the default parameter are changed
to use seastar::compat::source_location::current(). despite that
we have `#include "seastarx.h"` for opening the seastar namespace,
to disambiguate the "namespace compat" defined somewhere in scylladb,
the fully qualified name of
`seastar::compat::source_location::current()` is used.

see also 09a3c63345, where we used
std::source_location as an alias of std::experimental::source_location
if it was available. but this does not apply to the settings of our
current toolchain, where we have GCC-12 and Clang-15.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14086
2023-05-30 15:10:12 +03:00
Petr Gusev
3a88c7769f tracing::trace_info: pass by ref
sizeof(std::optional<tracing::trace_info>) == 64 bytes,
so it should be more efficient.
2023-05-30 14:32:10 +04:00
Petr Gusev
48600049fc storage_proxy: pass inet_address_vector_replica_set by ref
sizeof(inet_address_vector_replica_set) == 96 bytes and
it has complex move constructor.
2023-05-30 14:04:53 +04:00
Pavel Emelyanov
577cd96da8 scripts: Fix options iteration in open-coredump.sh
When run like 'open-coredump.sh --help' the options parsing loop doesn't
run because $# == 1 and [ $# -gt 1 ] evaluates to false.

The simplest fix is to parse -h|--help on its own as the options parsing
loop assumes that there's core-file argument present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14075
2023-05-30 12:25:01 +03:00
Petr Gusev
896e3bb425 raft: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
4ff1adaef9 repair: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
282d66d15d forward_request: add [[ref]] attribute 2023-05-30 13:14:19 +04:00
Petr Gusev
db4030f792 storage_proxy: paxos:: add [[ref]] attribute
read_command, partition_key and paxos::proposal
are marked with [[ref]]. partition_key contains
dynamic allocations and can be big. proposal
contains frozen_mutation, so it's also
contains dynamic allocations.

The call sites are fine, the already passed
by reference.
2023-05-30 13:14:19 +04:00
Petr Gusev
f2cba20945 storage_proxy: read_XXX:: make read_command [[ref]]
We had a redundant copies at the call sites of
these methods. Class read_command does not
contain dynamic allocations, but it's quite
but by itself (368 bytes).
2023-05-30 13:14:19 +04:00
Petr Gusev
ffb4e39e40 storage_proxy: hint_mutation:: make frozen_mutation [[ref]]
We had a redundant copy in hint_mutation::apply_remotely.
This frozen_mutation is dynamically allocated and
can be arbitrary large.
2023-05-30 13:14:19 +04:00
Petr Gusev
5adbb6cde2 storage_proxy: mutation:: make frozen_mutation [[ref]]
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.

Fixes: #12504
2023-05-30 13:14:19 +04:00
Tzach Livyatan
e655060429 Remove Ubuntu 18.04 support from 5.2
Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list
https://github.com/scylladb/scylla-pkg/issues/3346

Closes #13529
2023-05-30 11:12:17 +03:00
Aleksandra Martyniuk
f48b57e7b9 compaction: use table_info in compaction tasks
Task manager compaction tasks need table names for logs.
Thus, compaction tasks store table infos instead of table ids.

get_table_ids function is deleted as it isn't used anywhere.
2023-05-30 09:58:55 +02:00
Aleksandra Martyniuk
4206139e5a api: move table_info to schema/schema_fwd.hh
table_info is moved from api/storage_service.hh to schema/schema_fwd.hh
so that it could be used in task manager's tasks.
2023-05-30 09:57:21 +02:00
Kefu Chai
024b96a211 create-relocatable-package.py: package build/node_export only for stripped version
because we build stripped package and non-stripped package in parallel
using ninja. there are chances that the non-stripped build job could
be adding build/node_exporter directory to the tarball while the job
building stripped package is using objcopy to extract the symbols from
the build/node_exporter/node_exporter executable. but objcopy creates
temporary files when processing the executables. and the temporary
files can be spotted by the non-stripped build job. there are two
consequences:

1. non-stripped build job includes the temporary files in its tarball,
   even they are not supposed to be distributed
2. non-stripped build job fails to include the temporary file(s), as
   they are removed after objcopy finishes its job. but the job did spot
   them when preparing the tarball. so when the tarfile python module
   tries to include the previous found temporary file(s), it throws.

neither of these consequences is expected. but fortunately, this only
happens when packaging the non-stripped package. when packaging the
stripped package, the build/node_exported directory is not in flux
anymore. as ninja ensures the dependencies between the jobs.

so, in this change, we do not add the whole directory when packaging
the non-stripped version. as all its ingredients have been added
separately as regular files. and when packaing the stripped version,
we still use the existing step, as we don't have to list all the
files created by strip.sh:

node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz}

we could do so in this script, but the repeatings is unnecessary and
error-prune. so, let's keep including the whole directory recursively,
so all the debug symbols are included.

Fixes #14079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-30 14:00:02 +08:00
Kefu Chai
665a747fab create-relocatable-package.py: use positive condition when possible
to reduce the programmer's cognitive load.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-30 14:00:02 +08:00
Avi Kivity
ffce6d94fc Merge 'service: storage_proxy: make hint write handlers cancellable' from Kamil Braun
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose:
to make it possible to cancel a handler for a view update write,
which means we stop waiting for a response to the write, timing out
the handler immediately. This was done to solve issue with node
shutdown hanging because it was waiting for a view update to finish;
view updates were configured with 5 minute timeout. See #3966, #4028.

Now we're having a similar problem with hint updates causing shutdown
to hang in tests (#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy`
that a node is down.

To make it possible to reuse this algorithm for other handlers, move
the functionality into `abstract_write_response_handler`. We inherit
from `bi::list_base_hook` so it introduces small memory overhead to
each write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Use this new functionality to also cancel hint write handlers when we
shutdown. This fixes #8079.

Closes #14047

* github.com:scylladb/scylladb:
  test: reproducer for hints manager shutdown hang
  test: pylib: ScyllaCluster: generalize config type for `server_add`
  test: pylib: scylla_cluster: add explicit timeout for graceful server stop
  service: storage_proxy: make hint write handlers cancellable
  service: storage_proxy: rename `view_update_handlers_list`
  service: storage_proxy: make it possible to cancel all write handler types
2023-05-30 01:36:50 +03:00
Avi Kivity
27f7cc4032 Revert "Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros"
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.

Fixes #13919
Reopens #13747
2023-05-29 23:03:25 +03:00
Botond Dénes
a35758607a Update tools/java submodule
* tools/java eb3c43f8...0cbfeb03 (1):
  > nodetool: add `--primary-replica-only` option to `refresh`
2023-05-29 23:03:25 +03:00
Botond Dénes
fc24685b4d Update tools/jmx submodule
* tools/jmx 1fd23b60...d1077582 (1):
  > Support `--primary-replica-only` option from `nodetool refresh`
2023-05-29 23:03:25 +03:00
Pavel Emelyanov
b0525e20d5 main: Ignore sleep_aborted exception in main
When scylla starts it may go to sleep along the way before the "serving"
message appears. If SIGINT is sent at that time the whole thing unrolls
and the main code ends up catching the sleep_aborted exception, printing
the error in logs and exiting with non-zero code. However, that's not an
error, just the start was interrupted earlier than it was expected by
the stop_signal thing.

fixes: #12898

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14034
2023-05-29 23:03:25 +03:00
Avi Kivity
2303f08eea utils: logalloc: correct asan_interface.h location
It's a system header, so it deserves angle brackets.

Closes #14036
2023-05-29 23:03:25 +03:00
Benny Halevy
c685ef9e71 partitioned_sstable_set: insert: return early if sst is already in the set
Currently, partitioned_sstable_set::insert may erase a sstable
from the set inadvertently, if an exception is thrown while
(re-)inserting it.

To prevent that, simply return early after detecting that
insertion didn't took place, based on the unordered_set::insert
result.

This issue is theoretical, as there are no known case
of re-inserting sstables into the partitioned sstable set.

Fixes #14060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14061
2023-05-29 23:03:25 +03:00
Aleksandra Martyniuk
24864e39dd compaction: delete unnecessary sequence number incrementations
Task manager's tasks that have parent task inherit sequence number
from their parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.

Closes #14045
2023-05-29 23:03:25 +03:00
Kefu Chai
c00f4af5d4 build: cmake: link auth against libcrypt
libxcrypt is used by auth subsystem, for instance, `crypt_r()` provided
by this library is used by passwords.cc. so let's link against it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14030
2023-05-29 23:03:24 +03:00
Benny Halevy
774a10017c backlog_controller: destroy _update_timer before _current_backlog
The _update_timer callback calls adjust() that
depends on _current_backlog and currently, _current_backlog is
destroyed before _update_timer.

This is benign since there are no preemption points in
the destructor, but it's more correct and elegant
to destroy the timer first, before other members it depends on.

Fixes #14056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14057
2023-05-29 23:03:24 +03:00
Kefu Chai
a0b8aa9b13 create-relocatable-package.py: raise if rmtree fails
occasionally, we are observing build failures like:
```
17:20:54  FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz
17:20:54  dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz'
17:20:54  Traceback (most recent call last):
17:20:54    File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module>
17:20:54      os.makedirs(f'build/{SCYLLA_DIR}')
17:20:54    File "<frozen os>", line 225, in makedirs
17:20:54  FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package'
```

to understand the root cause better, instead of swallowing the error,
let's raise the exception it is not caused by non-existing directory.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13978
2023-05-29 23:03:24 +03:00
Avi Kivity
2cef3350af Merge 'Initialize/destroy ks/cf directories with explicit class methods' from Pavel Emelyanov
This set encapsulates ks/cf directories creation and deletion into keyspace and table classes methods. This is needed to facilitate making the storage initialization storage-type aware in the future. Also this makes the replica/ code less involved in formatting sstables' directory path by hand.

refs: #13020
refs: #12707

Closes #14048

* github.com:scylladb/scylladb:
  keyspace: Introduce init_storage()
  keyspace: Remove column_family_directory()
  table: Introduce destroy_storage()
  table: Simplify init_storage()
  table: Coroutinize init_storage()
  table: Relocate ks.make_directory_for_column_family()
  distributed_loader: Use cf.dir() instead of ks.column_family_directory()
  test: Don't create directory for system tables in cql_test_env
2023-05-29 23:03:24 +03:00
Kefu Chai
55ee0e2724 build: preserve $libs when linking a single testing executable
if we just want to build a single test and scylla executables, we
might want to use `configure.py` like:

./configure.py --mode debug --compiler clang++ --with scylla --with test/boost/database_test

which generates `build.ninja` for us, with following rules:

build $builddir/debug/test/boost/database_test_g: link.debug ... | $builddir/debug/seastar/libseastar.so
$builddir/debug/seastar/libseastar_testing.so
   libs = $seastar_libs_debug $libs -lthrift -lboost_system $seastar_testing_libs_debug
   libs = $seastar_libs_debug

but the last line prevents database_test_g for linking against
the third-party libraries like libabsl, which could have been
pulled in by $libs. but the second assignment expression just
makes the value of `libs` identical to that of `seastar_libs_debug`.
but that library does not include the libraries which are only
used by scylla. so we could run into link failure with the
`build.ninja` generated with this command line. like:
```
FAILED: build/debug/test/boost/database_test_g
...
ld.lld: error: undefined symbol: seastar::testing::entry_point(int, char**)
>>> referenced by scylla_test_case.hh:22 (./test/lib/scylla_test_case.hh:22)
>>>               build/debug/test/boost/database_test.o:(main)
...
ld.lld: error: undefined symbol: boost::unit_test::unit_test_log_t::set_checkpoint(boost::unit_test::basic_cstring<char const>, unsigned long, boost::unit_tes
t::basic_cstring<char const>)
>>> referenced by database_test.cc:298 (test/boost/database_test.cc:298)
>>>               build/debug/test/boost/database_test.o:(require_exist(seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool))
...
```

with this change, the extra assignment expression is dropped. this
should not cause any regression. as f'$seastar_libs_{mode}' as
been included as a part of `local_libs` before the grand if-the-else
block in the for loop before this `f.write()` statement.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14041
2023-05-29 23:03:24 +03:00
Kefu Chai
74dd6dc185 Revert "test: string_format_test: don't compare std::string with sstring"
This reverts commit 3c54d5ec5e.

The reverted change fixed the FTBFS of the test in question with Clang 16,
which rightly stopped convert the LHS of `"hello" == sstring{"hello"}` to
the type of the type acceptable by the member operator even we have a
constructor for this conversion, like

class sstring {
public:
  bar_t(const char*);
  bool operator==(const sstring&) const;
  bool operator!=(const sstring&) const;
};

because we have an operator!=, as per the draft of C++ standard
https://eel.is/c++draft/over.match.oper#4 :

> A non-template function or function template F named operator==
> is a rewrite target with first operand o unless a search for the
> name operator!= in the scope S from the instantiation context of
> the operator expression finds a function or function template
> that would correspond ([basic.scope.scope]) to F if its name were
> operator==, where S is the scope of the class type of o if F is a
> class member, and the namespace scope of which F is a member
> otherwise.

in 397f4b51c3, the seastar submodule was
updated. in which, we now have a dedicated overload for the `const char*`
case. so the compiler is now able to compile the expression like
`"hello" == sstring{"hello"}` in C++20 now.

so, in this change, the workaround is reverted.

Closes #14040
2023-05-29 23:03:24 +03:00
Benny Halevy
26705ba6af partitioned_sstable_set: erase empty runs
When erasing a sstable first check if its run_id
exists in _all_runs, otherwise do nothing with
that respect, and then if the run becomes empty
when erasing the last sstable (and it could have been
a single-sstable run from get go), erase the run
from `_all_runs`.

Fixes #14052

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14054
2023-05-29 23:03:24 +03:00
Alejo Sanchez
2050a1a125 test.py: warn and skip for missing unit/boost tests
If the executable of a matching unit or boost test is not executable,
warn to console and skip.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13982
2023-05-29 23:03:24 +03:00
Gleb Natapov
775e1dd4fd storage_service: fix indentation after the previous patch 2023-05-29 13:34:00 +03:00
Gleb Natapov
28bc5e365b storage_service: co-routinize storage_service::join_cluster() function 2023-05-29 13:33:32 +03:00
Gleb Natapov
fd1b1279c4 storage_service: do not reload topology from peers table if topology over raft is enabled
If topology over raft is enabled the source if the truth for the
topology is in the group0 state machine and no other code should
create topology metadata.
2023-05-29 13:32:14 +03:00
Gleb Natapov
3a5cb9d35c storage_service: optimize debug logging code in case debug log is not enabled 2023-05-29 13:29:22 +03:00
Kamil Braun
beabb61566 test: reproducer for hints manager shutdown hang 2023-05-29 11:03:39 +02:00
Kamil Braun
7e56388721 test: pylib: ScyllaCluster: generalize config type for server_add
Generalize from `dict[str, str]` to `dict[str, Any]`.
2023-05-29 11:03:36 +02:00
Kamil Braun
ce13395ce4 test: pylib: scylla_cluster: add explicit timeout for graceful server stop
If server shutdown hangs, the `manager.server_stop_gracefully` call
would eventually (after 5 minutes) timeout with a cryptic
`TimeoutError`; it's a generic timeout for performing requests by the
tests to `ScyllaClusterManager`. It was non-obvious how to find what
actually caused the timeout - you'd have to browse multiple logs.

Introduce an explicit timeout in `ScyllaServer.stop_gracefully`. Set it
to 1 minute. Whether this is a good value may be arguable, but shutdown
taking longer than that probably indicates problems. The important thing
is that this timeout is shorter than the generic request timeout.

If this times out we get a nice error in the test:
```
E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/server/1/stop_gracefully, params: None, json: None, body:
E               Stopping server ScyllaServer(1, 127.162.40.1, 826d5884-4696-4a22-80a7-cc872aa43102) gracefully took longer than 60s
```
2023-05-29 11:03:30 +02:00
Kamil Braun
0ef35ceed4 service: storage_proxy: make hint write handlers cancellable
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.

This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.

Fixes #8079
2023-05-29 11:03:18 +02:00
Kamil Braun
eddb7406b4 service: storage_proxy: rename view_update_handlers_list
The list will be used for non-view-update write handlers as well, so
generalize the name. Also generalize some variable names used in the
implementation.

This commit only renames things + some comments were added,
there are no logical changes.
2023-05-29 10:59:50 +02:00
Kamil Braun
c7ef9a12ee service: storage_proxy: make it possible to cancel all write handler types
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See #3966, #4028.

Now we're having a similar problem with hint updates causing shutdown to
hang in tests (#8079).

`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.

To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.

Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
2023-05-29 10:42:57 +02:00
Kefu Chai
af65d5a1e8 test: sstable: use BOOST_REQUIRE_*() when appropriate
instead of using BOOST_REQUIRE() use, for instance
BOOST_REQUIRE_NE() and BOOST_REQUIRE_EQUAL() for better
error message when the test fails, as Boost::test would
print out the LHS and RHS of the comparison expression
if it fails.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14050
2023-05-27 11:10:47 +03:00
Pavel Emelyanov
5861d15912 Merge 'Small gossiper and migration_manager cleanups' from Gleb
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/

Reviewed-by: Konstantin Osipov <kostja@scylladb.com>

* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
  storage_service: avoid unneeded copies in on_change
  storage_service: remove check that is always true
  storage_service: rename handle_state_removing to handle_state_removed
  storage_service: avoid string copy
  storage_service: delete code that handled REMOVING_TOKENS state
  gossiper: remove code related to advertising REMOVING_TOKEN state
  migration_manager: add wait_for_schema_agreement() function
2023-05-27 10:49:54 +03:00
Avi Kivity
e4d6ed7a70 Merge 'Coroutinize utils::verify_owner_and_mode()' from Pavel Emelyanov
Closes #14049

* github.com:scylladb/scylladb:
  utils: Restore indentation after previous patch
  utils: Coroutinize verify_owner_and_mode()
2023-05-26 23:20:30 +03:00
Pavel Emelyanov
2eb88945ea utils: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:53:14 +03:00
Pavel Emelyanov
4ebb812df0 utils: Coroutinize verify_owner_and_mode()
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:52:15 +03:00
Pavel Emelyanov
29d80d1fe9 keyspace: Introduce init_storage()
Similarly to class table, the keyspace class also needs to create
directory for itself for some reason. It looks excessive as table
creation would call recursive_touch_directory() and would create the ks
directory too, but this call is there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:46 +03:00
Pavel Emelyanov
93d8240bfb keyspace: Remove column_family_directory()
It's no longer used outside of make_column_family_config(). Not to
encourage people to use it -- drop it and open-code into that single
caller

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:43 +03:00
Pavel Emelyanov
0e50fc609c table: Introduce destroy_storage()
When table is DROP-ed the directory with all its sstables is removed
(unless it contains snapshots). Wrap this into table.destroy_storage()
method, later it will need to become sstable::storage-specific

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:43 +03:00
Pavel Emelyanov
7ae49f513e table: Simplify init_storage()
There's no need in copying the datadirs vector to call parallel_for_each
upon. The datadirs[0] is in fact datadir field.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:43 +03:00
Pavel Emelyanov
99dfade020 table: Coroutinize init_storage()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:43 +03:00
Pavel Emelyanov
a19b8af187 table: Relocate ks.make_directory_for_column_family()
This method initializes storage for table naturally belongs to that
class. So rename it while moving. Also, there's no longer need to carry
table name and uuid as arguments, being table method it can just get the
paths to work on from config

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 18:15:41 +03:00
Pavel Emelyanov
6db5f08eab distributed_loader: Use cf.dir() instead of ks.column_family_directory()
These two return the same, but the latter makes it the harder way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 17:59:47 +03:00
Pavel Emelyanov
44b811ce19 test: Don't create directory for system tables in cql_test_env
The distributed_loader::init_system_keyspaces() does it when called few
lines above this place

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-26 17:58:46 +03:00
Marcin Maliszkiewicz
9ce65270d5 alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem
BatchGetItem request is a map of table names and 'sub-requests', ExpressionAttributeNames is defined on
'sub-request' level but the code was instead checking the top level, obtaining nullptr every time which
effectively disables unused names check.

Fixes #13251
2023-05-26 15:03:15 +02:00
Marcin Maliszkiewicz
fb5c325cfd alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues
rjson::find is not a very cheap function, it involves bunch of function calls and loop iteration.
Overall it costs 120-170 intructions even for small requests. Some example profile of alternator::executor::query
execution shows ~18 rjson::find calls, taking in total around 7% of query internal proessing time (note that
JSON parse/print and http handling are not part of this function).

This patch eliminates 2 rjson::find calls for most request types. I saw 1-2% tps improvement
in `perf-simple-query --write` although it does improve tps I suspect real percentage is smaller and
don't have much confindentce in this particular number, observed benchmark variance is too high to measure it reliably.
2023-05-26 15:03:15 +02:00
Kamil Braun
a58beb8ce4 Merge 'Fix flakiness of test_tablets.py' from Tomasz Grabiec
We've observed sporadic failures of this test in CI related to driver reconnection after server restart.

Fixes #14032

Closes #14027

* github.com:scylladb/scylladb:
  test: test_tablets.py: Wait for driver to see the hosts after restart
  test: test_tablets.py: Pass server id to server_restart()
  test: test_tablets.py: Add missing await on server_restart()
2023-05-25 14:38:37 +02:00
Gleb Natapov
0e80c5162a storage_service: avoid unneeded copies in on_change
Move array of strings instead of copying.
2023-05-25 14:51:14 +03:00
Gleb Natapov
3a201c25c8 storage_service: remove check that is always true
The array cannot be empty since we access the first element of the array
before we call this function.
2023-05-25 14:50:23 +03:00
Gleb Natapov
715897ff31 storage_service: rename handle_state_removing to handle_state_removed
The function no longer handles REMOVING_TOKING state so rename the
function and drop no longer needed checks for the non existing state.
2023-05-25 14:48:58 +03:00
Gleb Natapov
4103281648 storage_service: avoid string copy 2023-05-25 14:48:39 +03:00
Gleb Natapov
05aa07835d storage_service: delete code that handled REMOVING_TOKENS state
The state is never advertised so the code is never used.
2023-05-25 14:48:09 +03:00
Gleb Natapov
66ff072540 gossiper: remove code related to advertising REMOVING_TOKEN state
Apparently it was needed for removetoken support which was deprecated in
the ORIGIN already.
2023-05-25 14:47:16 +03:00
Gleb Natapov
a429018a8a migration_manager: add wait_for_schema_agreement() function
Several subsystems re-implement the same logic for waiting for schema
agreement. Provide the function in the migration_manager and use it
instead.
2023-05-25 14:44:53 +03:00
Tomasz Grabiec
9d3d9be29e test: test_tablets.py: Wait for driver to see the hosts after restart
Apparently, the driver may be still establishing connections in the
background after connecting to the cluster and queries may fail with:

  cassandra.cluster.NoHostAvailable

Replace reconnection with wait_for_cql_and_get_hosts(), which ensures
that the driver sees the host.
2023-05-25 11:38:40 +02:00
Botond Dénes
5a14c3311a Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov
Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low.

To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object.

The new jumbo sink is capable of uploading ~500Tb of data, which looks enough.

fixes: #13019

Closes #13577

* github.com:scylladb/scylladb:
  sstables: Switch data and index sink to use jumbo uploader
  s3/test: Tune-up multipart upload test alignment
  s3/test: Add jumbo upload test
  s3/client: Wait for background upload fiber on close-abort
  c3/client: Implement jumbo upload sink
  s3/client: Move memory buffers to upload_sink from base
  s3/client: Move last part upload out of finalize_upload()
  s3/client: Merge do_flush() with upload_part()
  s3/client: Rename upload_sink -> upload_sink_base
2023-05-25 11:44:06 +03:00
Kamil Braun
1339ae141a Merge 'Small improvements after pending_ranges, endpoints_for_reading -> erm PR' from Gusev Petr
This is a small follow-up for [this PR](https://github.com/scylladb/scylladb/pull/13715), it resolves some comments in the initial PR that didn't make their way into it.
* remove `noexcept` from `clear_gently`, since exceptions can be raised from move constructor;
* an optimisation for `vnode_effective_replication_map::get_range_addresses`, avoid redundant binary search.

Closes #14015

* github.com:scylladb/scylladb:
  vnode_erm: optimize get_range_addresses
  clear_gently: remove noexcept for rvalue references overload
2023-05-25 10:37:27 +02:00
Pavel Emelyanov
222f21d180 messaging_service: Remove unused headers from m.s..hh
The tracing.hh is quite large to care
Another one is "while at it"

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14024
2023-05-25 08:38:49 +03:00
Kefu Chai
8e7c7e1079 docs/dev/repair_based_node_ops: better formatting
* indent the nested paragraphs of list items
* use table to format the time sequence for better
  readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14016
2023-05-25 08:31:43 +03:00
Kefu Chai
8e6fbb99c7 docs/operating-scylla: lowercase the name of an option
"Enable_repair_based_node_ops" is the name of an option, and the leading
character should be lowecase "e". so fix it.

Fixes #14017
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14018
2023-05-25 08:21:59 +03:00
Tomasz Grabiec
51e3b9321b Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski
After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.

This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade.

The series:
1. Does some code cleanup in the mutation_partition area.
2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
3. Adds upgrading variants of constructors and apply() for `row` and its wrappers.
4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.

Fixes #2577

Closes #13761

* github.com:scylladb/scylladb:
  test: mvcc_test: add a test for gentle schema upgrades
  partition_version: make partition_entry::upgrade() gentle
  partition_version: handle multi-schema snapshots in merge_partition_versions
  mutation_partition_v2: handle schema upgrades in apply_monotonically()
  partition_version: remove the unused "from" argument in partition_entry::upgrade()
  row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades
  partition_version: handle multi-schema entries in partition_entry::squashed
  partition_snapshot_row_cursor: handle multi-schema snapshots
  partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots
  partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots
  partition_version: add a logalloc::region argument to partition_entry::upgrade()
  memtable: propagate the region to memtable_entry::upgrade_schema()
  mutation_partition: add an upgrading variant of lazy_row::apply()
  mutation_partition: add an upgrading variant of rows_entry::rows_entry
  mutation_partition: switch an apply() call to apply_monotonically()
  mutation_partition: add an upgrading variant of rows_entry::apply_monotonically()
  mutation_fragment: add an upgrading variant of clustering_row::apply()
  mutation_partition: add an upgrading variant of row::row
  partition_version: remove _schema from partition_entry::operator<<
  partition_version: remove the schema argument from partition_entry::read()
  memtable: remove _schema from memtable_entry
  row_cache: remove _schema from cache_entry
  partition_version: remove the _schema field from partition_snapshot
  partition_version: add a _schema field to partition_version
  mutation_partition: change schema_ptr to schema& in mutation_partition::difference
  mutation_partition: change schema_ptr to schema& in mutation_partition constructor
  mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor
  mutation_partition: add upgrading variants of row::apply()
  partition_version: update the comment to apply_to_incomplete()
  mutation_partition_v2: clean up variants of apply()
  mutation_partition: remove apply_weak()
  mutation_partition_v2: remove a misleading comment in apply_monotonically()
  row_cache_test: add schema changes to test_concurrent_reads_and_eviction
  mutation_partition: fix mixed-schema apply()
2023-05-24 22:58:43 +02:00
Nadav Har'El
7cdee303cf Merge 'ks_prop_defs: disallow empty replication factor string in NTS' from Jan Ciołek
A CREATE KEYSPACE query which specifies an empty string ('') as the replication factor value is currently allowed:
```cql
CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''};
```

This is wrong, it's invalid to have an empty replication factor string.
It creates a keyspace without any replication, so the tables inside of it aren't writable.

Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error, `NetworkTopolgyStrategy` should do the same.

The problem was in `prepare_options`, it treated an empty replication factor string as no replication factor.
Changing it to `std::optional` fixes the problem,
Now `std::nullopt` means no replication factor, and `make_optional("")` means that there is a replication factor, but it's described by an empty string.

Fixes: https://github.com/scylladb/scylladb/issues/13986

Closes #13988

* github.com:scylladb/scylladb:
  test/network_topology_strategy_test: Test NTS with replication_factor option in test_invalid_dcs
  ks_prop_defs: disallow empty replication factor string in NTS
2023-05-24 21:39:31 +03:00
Pavel Emelyanov
d2f5a44e3b test/alternator: Don't use empty AWS secret key
There's a test case that checks in valid credentials (wrong key).
However, some boto3 libraries don't like empty secret key values:

request = <FixtureRequest for <Function test_wrong_key_access>>
dynamodb = dynamodb.ServiceResource()

    def test_wrong_key_access(request, dynamodb):
        print("Please make sure authorization is enforced in your Scylla installation: alternator_enforce_authorization: true")
        url = dynamodb.meta.client._endpoint.host
        with pytest.raises(ClientError, match='UnrecognizedClientException'):
            if url.endswith('.amazonaws.com'):
                boto3.client('dynamodb',endpoint_url=url, aws_access_key_id='wrong_id', aws_secret_access_key='').describe_endpoints()
            else:
                verify = not url.startswith('https')
>               boto3.client('dynamodb',endpoint_url=url, region_name='us-east-1', aws_access_key_id='whatever', aws_secret_access_key='', verify=verify).describe_endpoints()

test_authorization.py:23:

...

cls = <class 'awscrt.auth.AwsCredentialsProvider'>, access_key_id = 'whatever'
secret_access_key = '', session_token = None

    @classmethod
    def new_static(cls, access_key_id, secret_access_key, session_token=None):
        """
        Create a simple provider that just returns a fixed set of credentials.

        Args:
            access_key_id (str): Access key ID
            secret_access_key (str): Secret access key
            session_token (Optional[str]): Optional session token

        Returns:
            AwsCredentialsProvider:
        """
        assert isinstance(access_key_id, str)
        assert isinstance(secret_access_key, str)
        assert isinstance(session_token, str) or session_token is None

>       binding = _awscrt.credentials_provider_new_static(access_key_id, secret_access_key, session_token)
E       RuntimeError: 34 (AWS_ERROR_INVALID_ARGUMENT): An invalid argument was passed to a function.

$ pip3 show boto3
Name: boto3
Version: 1.26.139
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /home/xemul/.local/lib/python3.11/site-packages
Requires: botocore, jmespath, s3transfer
Required-by:

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14022
2023-05-24 19:46:16 +03:00
Jan Ciolek
55fb91bf10 exceptions: remove relation field from unrecognized_entity_exception
The exception unrecognized_entity_exception used to have two fields:
* entity - the name that wasn't recognized
* relation_str - part of the WHERE clause that contained this entity

In 4e0a089f3e the places that throw
this exception were modified, the thrower started passing unrecognized
column name to both fields - entity and relation_str. It was easier to
do things this way, accessing the whole WHERE clause can be problematic.

The problem is that this caused error messages to get weird, e.g:
"Undefined name x in where clause ('x')".
x is not the WHERE clause, it's the unrecognized name.

Let's remove the `relation_str` field as it isn't used anymore,
it only causes confusion. After this change the message would be:
"Unrecognized name x"
Which makes much more sense.

Refs #10632

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13944
2023-05-24 19:35:26 +03:00
Nadav Har'El
3b2c87a82b cql: fix column name in writetime() error message
Found and fixed yet another place where an error message prints a column
name as "bytes" type which causes it to be printed as hexadecimal codes
instead of the actual characters of the name.

The specific error message fixed here is "Cannot use selection function
writeTime on PRIMARY KEY part k" which happens when you try to use
writetime() or ttl() on a key column (which isn't allowed today - see
issue #14019). Before this patch we got "6b" in the error message instead
of "k".

The patch also includes a regression test that verifies that this
error condition is recognized and the real name of the column is
printed. This test fails before this patch, and passes after it.
As usual, the test also passes on Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14021
2023-05-24 19:28:44 +03:00
Pavel Emelyanov
e435ec1b5e sstable_directory: Do not collect filesystem garbage for S3-backed sstables
The sstable_directory::garbage_collect() scans /var/lib/scylla for
whatever sstable it's called for. S3-backed ones don't have anything
there, so the g.c. run is no-op. Make this call be lister virtual
method, so that only filesystem lister does this scan and the ownership
table lister becomes the real no-op. Later it will be filled with code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:45:50 +03:00
Pavel Emelyanov
16d66f2fe9 sstable_directory: Deduplicate .process() location argument
When sstable directory calls lister it passes the _sstable_dir as an
argument. However, the very same _sstable_dir was used to construct the
lister, and by now all the lister implementations keep this value
aboard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:43:36 +03:00
Pavel Emelyanov
d6b5e18cb3 sstable_directory: Keep directory lister on stack
The directory_lister _lister exists as class member, but is only used
once -- when the .process() is called -- and then is closed forever.
It's simpler to keep the lister on the .process() stack.

This change also makes filesystem lister keep the copy of directory as
class member, it will be useful for the next patch as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:42:08 +03:00
Pavel Emelyanov
524614087a sstable_directory: Use directory_lister API directly
The filesystem components lister has private wrappers on top of
directory lister it uses internally. These are lefrovers from making the
sstable directory storage-aware, now they can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:40:38 +03:00
Tomasz Grabiec
fbd103744c test: test_tablets.py: Pass server id to server_restart()
It works with ids, not ServerInfo
2023-05-24 15:01:06 +02:00
Tomasz Grabiec
b423d132f5 test: test_tablets.py: Add missing await on server_restart()
Could be responsible for test failures due to inability to connect to
the server afterwards.
2023-05-24 15:01:06 +02:00
Kefu Chai
b0c40a2a03 db: config: s/ingore/ignore/
this string is used in as the option description in the command line
help message. so it is a part of user facing interface.

in this change, the typo is fixed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14013
2023-05-24 13:35:24 +03:00
Alejo Sanchez
91f609d065 migration_manager: do not pull schema if raft is on
After consistent schema changes, remove schema pulls from gossiper
events if Raft is enabled, and considering Raft upgrade state.

Only disable pull if Raft is fully enabled.

Fixes #12870

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13695
2023-05-24 10:39:45 +02:00
Petr Gusev
819d710753 vnode_erm: optimize get_range_addresses
In get_range_addresses we are iterating
over vnode tokens, don't need to do
binary search for them in tmptr->first_token,
they can be directly used as keys
for _replication_map.
2023-05-24 12:16:37 +04:00
Petr Gusev
79c6bf0885 clear_gently: remove noexcept for rvalue references overload
We use this overload in vnode_erm, one of the
arguments is boost::icl::interval_map,
whose move constructor is not noexcept.
2023-05-24 12:08:19 +04:00
Botond Dénes
eb457b6104 Merge 'fixed broken links, added community forum link, university link, spelling and other mistakes' from Guy Shtub
Closes #13979

* github.com:scylladb/scylladb:
  Update docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  Update docs/dev/docker-hub.md
  fixed broken links, added community forum link, university link,  other mistakes
2023-05-24 09:58:58 +03:00
Nadav Har'El
02d31786ff test/alternator: better README.md on how to run and write tests
Improve test/alternator/README.md by adding better and more beginner-
friendly introduction to how to run the Alternator tests, as well
as a section about the philosophy of the Alternator test suite, and
some guideliness on how to write good tests in that framework.

Much of this text was copied from test/cql-pytest/README.md.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13999
2023-05-24 09:58:12 +03:00
Kefu Chai
2fbcbc09b0 api: specialize fmt::formatter<api::table_info>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `api::table_info` without the help of `operator<<`.

but the corresponding `operator<<()` is preserved in this change, as we
still have lots of callers relying on this << operator instorage_service.cc
where std::vector<table_info> is formatted using operator<<(ostream&, const Range&)
defined in to_string.hh. we could have used fmt/ranges.h to print the
std::vector<table_info>. but the combination of operator<<(ostream&, const Range&)
and FMT_DEPRECATED_OSTREAM renders this impossible. because
unlike the builtin range formatter specializations, the fallback formatter
synthesized from the operator<< does not have brackets defined for
the range printer. the brackets are used as the left and right marks
of the range, for instance, the array-alike containers are printed
like [1,2,3], while the tuple-alike containers are printed like
(1,2,3). once we are allowed to remove FMT_DEPRECATED_OSTREAM, we
should be able to use the builtin range formatter, and remove the
operator<< for api::table_info by then.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13975
2023-05-24 09:49:44 +03:00
Kefu Chai
8efb5c30ce counters: move fmt::formatter<counter_{shard,cell}_view>::format() to .cc
to reduce the size of header file, in hope to speed the compilation. let's
implement the implementation of format() function into .cc file.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14010
2023-05-24 09:36:49 +03:00
Pavel Emelyanov
132260973a tests: Add perf test for S3 client (reading latencies)
Here's a simple test that can be used to check S3 object read latencies.
To run one must export the same variables as for any other S3 unit test:

- S3_SERVER_ADDRESS_FOR_TEST
- S3_SERVER_PORT_FOR_TEST
- S3_PUBLIC_BUCKET_FOR_TEST

and the AWS creds are a must via AWS_S3_EXTRA='$key:$secret:$region' env
variable.

Accepted options are

   --duration SEC -- test duration in seconds
   --parallel NR -- number of fibers to run in parallel
   --object-size BYTES -- object size to use (1MB by default)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13895
2023-05-24 09:29:48 +03:00
Botond Dénes
57758ec3e1 Merge 'Put streaming sched group onto stream manager' from Pavel Emelyanov
The manager is in charge of updating IO bandwidth on the respective prio class. Nowadays it uses global priority-manager, but unifying sched classes effort will require it to use non-global streaming sched group. After the patch the sched class field is unused, but it's a preparation towards huge (really huge) "switch to seastar API level 7" patch

ref: #13963

Closes #13997

* github.com:scylladb/scylladb:
  stream_manager: Add streaming sched group copy
  cql_test_env: Move sched groups initialization up
2023-05-24 09:27:30 +03:00
Nadav Har'El
644787535a test/cql-pytest: revert incorrect fix to avoid a warning
In commit 0a71151bc4 I wanted to avoid
a incorrect deprecation warning from the Python driver but fixed it
in an incorrect way. I never noticed the fix was incorrect because
the test was already xfailing, and the incorrect fix just made it
fail differently... In this patch I revert that commit.

With this revert, I am *not* bringing back the spurious warning -
the Python driver bug was already fixed in
https://github.com/datastax/python-driver/pull/1103 - so developers
with a fairly recent version will no longer see the spurious warning.
Both old and new drivers will at least do the correct thing, as
it was before that unfortunate commit.

Fixes #8752.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14002
2023-05-24 09:25:57 +03:00
Botond Dénes
2526b232f1 Merge 'Remove explicit default_priority_class() usage from sstable aux methods' from Pavel Emelyanov
There are few places in sstables/ code that require caller to specify priority class to pass it along to file stream options. All these callers use default class, so it makes little sense to keep it. This change makes the sched classes unification mega patch a bit smaller.

ref: #13963

Closes #13996

* github.com:scylladb/scylladb:
  sstables: Remove default prio class from rewrite_statistics()
  sstables: Remove prio class from validate_checksums subs
  sstables: Remove always default io-prio from validate_checksums()
2023-05-24 09:23:24 +03:00
Kefu Chai
cb22492379 raft: specialize fmt::formatter<raft::server_address&> and friends
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print

- raft::server_address
- raft::config_member
- raft::configuration

without the help of `operator<<`.

the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13976
2023-05-24 09:11:55 +03:00
Botond Dénes
1ef600fb7f Merge 'docs/dev/system_keyspace: move regular tables into another section and add the raft table' from Kefu Chai
not all tables in system keyspace are volatile. among other things, system.sstables and system.tablets are persisted using sstables like regular user tables. so add a dedicated section for them. also, in this change, raft table is added to the new section.

Closes #13981

* github.com:scylladb/scylladb:
  docs/dev/system_keyspace: add raft table
  docs/dev/system_keyspace: move sstables and tablets into another section
2023-05-24 08:54:10 +03:00
Botond Dénes
313ae4ddac Merge 'Generalize some file accessing helpers in test/' from Pavel Emelyanov
Several test cases use common operations one files like existence checking, content comparing, etc. with the help of home-brew local helpers. The set makes use of some existing seastar:: ones and generalizes others into test/lib/. The primary intent here is `57 insertions(+), 135 deletions(-)`

Closes #13936

* github.com:scylladb/scylladb:
  test: Generalize touch_file() into test_utils.*
  test/database: Generalize file/dir touch and exists checks
  test/sstables: Use seastar::file_exists() to check
  test/sstables: Remove sstdesc
  test/sstables: Use compare_files from utils/ in sstable_test
  test/sstables: Use compare_files() from utils/ in sstable_3_x_test
  test/util: Add compare_file() helpers
2023-05-24 08:43:41 +03:00
Guy Shtub
65c0afc899 Update docker-hub.md 2023-05-24 07:34:58 +03:00
Guy Shtub
7e3d768369 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:27:07 +03:00
Guy Shtub
6329036656 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:26:42 +03:00
Guy Shtub
3538a2e1c2 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:51 +03:00
Guy Shtub
53183d6302 Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:37 +03:00
Guy Shtub
2677d47bbc Update docs/dev/docker-hub.md
Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>
2023-05-24 07:23:28 +03:00
Kefu Chai
b8c565875b docs/dev/system_keyspace: add raft table
it is one of the non-volatile tables. we need add more of them.
but let's do this piecemeal.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-24 10:08:04 +08:00
Kefu Chai
eee0003312 docs/dev/system_keyspace: move sstables and tablets into another section
not all tables in system keyspace are volatile. among other things,
system.sstables and system.tablets are persisted using sstables like
regular user tables. so move them into the section where we have
other regular tables there.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-24 10:08:03 +08:00
Pavel Emelyanov
5aea6938ae commitlog: Introduce and use comitlog sched group
Nowadays all commitlog code runs in whatever sched group it's kicked
from. Since IO prio classes are going to be inherited from the current
sched group the commitlog IO loops should be moved into commitlog sched
group, not inherit a "random" one.

There are currently two places that need correct context for IO -- the
.cycle() method and segments replenisher.

`$ perf-simple-query --write -c2` results

--- Before the patch ---
194898.36 tps ( 56.3 allocs/op,  12.7 tasks/op,   54307 insns/op,        0 errors)
199286.23 tps ( 56.2 allocs/op,  12.7 tasks/op,   54375 insns/op,        0 errors)
199815.84 tps ( 56.2 allocs/op,  12.7 tasks/op,   54377 insns/op,        0 errors)
198260.98 tps ( 56.3 allocs/op,  12.7 tasks/op,   54380 insns/op,        0 errors)
198572.86 tps ( 56.2 allocs/op,  12.7 tasks/op,   54371 insns/op,        0 errors)

median 198572.86 tps ( 56.2 allocs/op,  12.7 tasks/op,   54371 insns/op,        0 errors)
median absolute deviation: 713.36
maximum: 199815.84
minimum: 194898.36

--- After the patch ---
194751.80 tps ( 56.3 allocs/op,  12.7 tasks/op,   54331 insns/op,        0 errors)
199084.70 tps ( 56.2 allocs/op,  12.7 tasks/op,   54389 insns/op,        0 errors)
195551.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54385 insns/op,        0 errors)
197953.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54386 insns/op,        0 errors)
198710.00 tps ( 56.3 allocs/op,  12.7 tasks/op,   54387 insns/op,        0 errors)

median 197953.47 tps ( 56.3 allocs/op,  12.7 tasks/op,   54386 insns/op,        0 errors)
median absolute deviation: 1131.24
maximum: 199084.70
minimum: 194751.80

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14005
2023-05-23 21:25:57 +03:00
Avi Kivity
da5467c687 Merge 'Use implicit default prio class in tests' from Pavel Emelyanov
There are several places in tests that either use default_priority_class() explicitly, or use some specific prio class obtained from priority manager. There's currently an ongoing work to remove all priority classes, this set makes the final patch a bit smaller and easier to review. In particular -- in many cases default_priority_class() is implicit and can be avoided by callers. Also, using any prio class by test is excessive, it can go with (implicit) default_priority_class.

ref: #13963

Closes #13991

* github.com:scylladb/scylladb:
  test, memtable: Use default prio class
  test, memtable: Add default value for make_flush_reader() last arg
  test, view_build: Use default prio class
  test, sstables: Use implicit default prio class in dma_write()
  test, sstables: Use default sstable::get_writer()'s prio class arg
2023-05-23 18:46:52 +03:00
Avi Kivity
3956e01640 Merge 'Clean index_reader API' from Pavel Emelyanov
The way index_reader maintains io_priority_class can be relaxed a bit. The main intent is to shorten the #13963 final patch a bit, as a side effect index_reader gets its portion of API polishing.

ref: #13963

Closes #13992

* github.com:scylladb/scylladb:
  index_reader: Introduce and use default arguments to constructor
  index_reader: Use _pc field in get_file_input_stream_options() directly
  index_reader: Move index_reader::get_file_input_stream_options to private: block
2023-05-23 18:46:26 +03:00
Piotr Smaroń
5f6491987d Deregister table's metrics when disposing a table to work around #8627
The metrics that are being deregistered (in this PR) cause Scylla to crash when a
table is dropped, but the corresponding table object in memory is not
yet deallocated, and a new table with the same name is created. This
caused a double-metrics-registration exception to be thrown. In order to
avoid it, we are deregistering table's metrics as soon as the table is
marked to be disposed from the database. Table's representation in memory can
still live, but shouldn't forbid other table with the same name to be
created.

Fixes #13548

Closes #13971
2023-05-23 18:41:51 +03:00
Nadav Har'El
88fd7f7111 Merge 'Docs: add feature store tutorial' from Attila Tóth
* Adds the new feature tutorial site to the docs
* fixes the unnecessary redirection (iot.scylladb.com)

Closes #13998

* github.com:scylladb/scylladb:
  Skip unnecessary redirection
  Add links to feature store tutorial
2023-05-23 16:17:23 +03:00
Alejo Sanchez
c276ac3099 test/topology: run first slow topology tests
To speed up total test suite run, change configuration to schedule slow
topology tests first.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13948
2023-05-23 15:12:40 +03:00
Attila Toth
cf686b4238 Skip unnecessary redirection 2023-05-23 14:09:39 +02:00
Attila Toth
a8008760f7 Add links to feature store tutorial 2023-05-23 14:08:01 +02:00
Kefu Chai
1246568e3b docs/dev/system_keyspace: use timeuuid for sstables.generation
we changed the type of generation column in system.sstables
from bigint to timeuuid in 74e9e6dd1a
but that change failed to update the document accordingly. so let's
update the document to reflect the change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13994
2023-05-23 14:37:28 +03:00
Pavel Emelyanov
678f8fb1b7 stream_manager: Add streaming sched group copy
The manager in question is responsible for maintaining the streaming
class IO bandwidth update. Nowadays it does it via priority manager's
global streaming IO priority class field, but it will need to switch to
streaming sched group.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 14:31:23 +03:00
Pavel Emelyanov
ff9d65f6ad cql_test_env: Move sched groups initialization up
The streaming manager will need to keep its copy of
streaming/maintenance group, so groups should be created early.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 14:31:23 +03:00
Avi Kivity
1c0e8c25ca Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.

Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.

Fixes: #13784

Closes #13790

* github.com:scylladb/scylladb:
  test/boost/database_test: add unit test for semaphore mismatch on range scans
  partition_slice_builder: add set_specific_ranges()
  multishard_mutation_query: make reader_context::lookup_readers() exception safe
  multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
2023-05-23 14:05:10 +03:00
Pavel Emelyanov
6c453df9d7 sstables: Remove default prio class from rewrite_statistics()
The method is called with explicitly default pririty class and puts one
into the fstream options. This whole chain can be avoided

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
438132ad4b sstables: Remove prio class from validate_checksums subs
The sstable.read_checksum() and .read_digest() accept prio class
argument from validate_checsums(), but it's always the "default" one.
Remove the arg and remove stream options initializations as they'll pick
up default prio class on their default constructing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
7396d9d291 sstables: Remove always default io-prio from validate_checksums()
All calls to sstables::validate_checksums() happen with explicitly
default priority class. Just hard-code it as such in the method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
2bb024c948 index_reader: Introduce and use default arguments to constructor
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:29:04 +03:00
Avi Kivity
397f4b51c3 Update seastar submodule
* seastar f94b1bb9cb...aff87d5bb9 (9):
  > prometheus.cc: change function in foreach_metrics to use const ref
Fixes #13929
  > sstring: add == comparison operators
  > Add a simple code example for promise in tutorial.md
  > Add example for cpuset docs
  > Merge 'httpd: modernize' from Avi Kivity
  > Merge 'TLS: support for extracting certificate subject alt names from client certs' from Calle Wilund
  > Merge 'Update IO stats label set' from Pavel Emelyanov
  > file: s/(void)/()/ in function's parameter list
  > scripts: addr2line: allow specifying kallsyms path

Closes #13985
2023-05-23 11:24:39 +03:00
Pavel Emelyanov
3fd5d3cc2b index_reader: Use _pc field in get_file_input_stream_options() directly
No need to pass this-> field into this-> call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Pavel Emelyanov
21d24e8ea3 index_reader: Move index_reader::get_file_input_stream_options to private: block
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Asias He
7056b7ee9a repair: Log nodes down during repair in case of failed repair
This helps users to figure if the repair has failed due to a peer node
was down during repair.

For example:

```
WARN  [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: Repair
1026 out of 1026 ranges, keyspace=ks2a, table={test_table, tb},
range=(9203128250168517738,+inf), peers={127.0.0.2}, live_peers={},
status=skipped_no_live_peers

INFO  [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: stats:
repair_reason=repair, keyspace=ks2a, tables={test_table, tb}, ranges_nr=513,
round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0,
tx_hashes_nr=0, rx_hashes_nr=0, duration=0 seconds, tx_row_nr=0, rx_row_nr=0,
tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={}, row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s, row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}

WARN  [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out
of 1026 ranges failed, keyspace=ks2a, tables={test_table, tb},
repair_reason=repair, nodes_down_during_repair={127.0.0.2}

WARN  [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]:
repair_tracker run failed: std::runtime_error ({shard 0: std::runtime_error
(repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out of 1026 ranges failed,
keyspace=ks2a, tables={test_table, tb}, repair_reason=repair,
nodes_down_during_repair={127.0.0.2})})
```

In addition, change the `status=skipped` to `status=skipped_no_live_peers`
to make it more clear.

Closes #13928
2023-05-23 11:12:42 +03:00
Anna Stuchlik
f45976730c doc: add versioning and support information
Fixes https://github.com/scylladb/scylla-docs/issues/3966
Fixes https://github.com/scylladb/scylladb/issues/12753

This commit adds a new page that explains the ScyllaDB
versioning convention and the new ScyllaDB Enterprise
support policy.

Closes #13987
2023-05-23 11:08:38 +03:00
Pavel Emelyanov
9bdc0d3f44 test: Generalize touch_file() into test_utils.*
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:40:55 +03:00
Pavel Emelyanov
730c0439e0 test/database: Generalize file/dir touch and exists checks
There are cases that implement the same set of lambda helpers. Keep them
common in this .cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:40:55 +03:00
Pavel Emelyanov
54fb8a022e test/sstables: Use seastar::file_exists() to check
There's a rather boring test_sstable_exists() helper in the test that
can be replaced with a more standard seastar API call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:40:54 +03:00
Pavel Emelyanov
c06b5e2714 test/sstables: Remove sstdesc
The helper class is used to transfer directory name and generation int
value into the compare_sstables() helper. Remove both, the utils/ stuff
is useful enough not to use wrappers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:40:20 +03:00
Pavel Emelyanov
c3dbe37669 test/sstables: Use compare_files from utils/ in sstable_test
There's yet another implementation of read-the-whole-file and
check-file-contents-matches helpers in the test. Replace it with the
utils/ facility. Next patch will be able to wash more stuff out of
this test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:39:32 +03:00
Pavel Emelyanov
6619e87b70 test/sstables: Use compare_files() from utils/ in sstable_3_x_test
There's a static helper under the same name that can be replaced with
utils/ one. The code here runs in async context to .get0() the result.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:39:31 +03:00
Pavel Emelyanov
1f4c3be50c test/util: Add compare_file() helpers
To be used later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:37:08 +03:00
Anna Stuchlik
508b68377e doc: add the upgrade guide from 5.2 to 5.2
Fixes https://github.com/scylladb/scylladb/issues/13288

This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to 5.3.
The details of the metric update will be added with a separate commit.

Closes #13960
2023-05-23 09:36:39 +02:00
Pavel Emelyanov
f9ff5cdfdf test, memtable: Use default prio class
Similarly to previous patch with view-building -- using default class is
OK for a unit test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:21:27 +03:00
Pavel Emelyanov
daa808aa21 test, memtable: Add default value for make_flush_reader() last arg
Many places call memtable::make_flush_reader() with default priority
class. Make it a default-arg for the method, other reader making methods
of memtable already have it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:20:37 +03:00
Pavel Emelyanov
5e0a1d7546 test, view_build: Use default prio class
The test case tries to be "correct" and calls sst->write_components()
with streaming priority class. It's a test anyway, no need to be too
diligent here

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:14:05 +03:00
Pavel Emelyanov
dd387d4ec1 test, sstables: Use implicit default prio class in dma_write()
Calls to file.dma_write() may omit specifying default prio class by hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:14:05 +03:00
Pavel Emelyanov
5392f845a4 test, sstables: Use default sstable::get_writer()'s prio class arg
The sstable::get_writer()'s prio class argument has its default value.
No need to pass it explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 10:14:05 +03:00
Michał Chojnowski
81a753e69f real_dirty_memory_accounter: document what the class is doing
The current documentation of real_dirty_memory_accounter is not
quite compelling enough.

One can find some proper documentation by digging through the
git log and reading the description of the commit which added it,
but it shouldn't be that way.

This patch replaces the current documentation of the class with something
more explanatory.

Closes #13927
2023-05-23 09:11:31 +03:00
Jan Ciolek
d2ef55b12c test: use NetworkTopologyStrategy in all unit tests
As described in https://github.com/scylladb/scylladb/issues/8638,
we're moving away from `SimpleStrategy`, in the future
it will become deprecated.

We should remove all uses of it and replace them
with `NetworkTopologyStrategy`.

This change replaces `SimpleStrategy` with
`NetworkTopologyStrategy` in all unit tests,
or at least in the ones where it was reasonable to do so.
Some of the tests were written explicitly to test the
`SimpleStrategy` strategy, or changing the keyspace from
`SimpleStrategy` to `NetworkTopologyStrategy`.
These tests were left intact.
It's still a feature that is supported,
even if it's slowly getting deprecated.

The typical way to use `NetworkTopologyStrategy` is
to specify a replication factor for each datacenter.
This could be a bit cumbersome, we would have to fetch
the list of datacenters, set the repfactors, etc.

Luckily there is another way - we can just specify
a replication factor to use for or each existing
datacenter, like this:
```cql
CREATE KEYSPACE {} WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1};
```

This makes the change rather straightforward - just replace all
instances of `'SimpleStrategy'', with `'NetworkTopologyStrategy'`.

Refs: https://github.com/scylladb/scylladb/issues/8638

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13990
2023-05-23 08:52:56 +03:00
Anna Stuchlik
cad83bd53d doc: publish the docs for branch-5.3
Fixes https://github.com/scylladb/scylladb/issues/13969

This commit enables publishing the docs for branch-5.3
The documentation for version 5.3 will be marked as
unstable until it is released, and an appropriate
warning is in place.

Closes #13977
2023-05-23 08:10:55 +03:00
Kamil Braun
e9b7bf82b4 Merge 'test/topology: split raft upgrade tests and run firs slowest' from Alecco
To speed up total test suite run, split the raft upgrade tests and run schedule slow tests first.

Closes #13951

* github.com:scylladb/scylladb:
  test/topology: run first slow raft upgrade tests
  test/topology: split raft upgrade tests
2023-05-22 21:38:41 +02:00
Avi Kivity
a7c2c9f92b Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types
2023-05-22 21:38:41 +02:00
Tomasz Grabiec
809ddd7f79 Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr
This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`.

We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`.

At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring.

The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key
instead of duplicating them for each `vnode` token.

The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation.

Closes #13715

* github.com:scylladb/scylladb:
  token_metadata_test: add a test for everywhere strategy
  token_metadata_test: check read_endpoints when bootstrapping first node
  token_metadata_test: refactor tests, extract create_erm
  token_metadata: drop has_pending_ranges and migration_info
  effective_replication_map: add has_pending_ranges
  token_metadata: drop update_pending_ranges
  effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading
  token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers
  vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading
  calculate_effective_replication_map: compute pending_endpoints and read_endpoints
  vnode_erm: optimize replication_map
  vnode_erm::get_range_addresses: use sorted_tokens
  abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token
  sequenced_set: add extract_vector method
  effective_replication_map: clone_endpoints_gently -> clone_data_gently
  vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints
  stall_free.hh: add clear_gently for rvalues
  stall_free.hh: relax Container requirement
  token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map
  token_metadata: introduce topology_change_info
  token_metadata: replace set_topology_transition_state with set_read_new
2023-05-22 21:37:06 +02:00
Jan Ciolek
07e7724468 test/network_topology_strategy_test: Test NTS with replication_factor option in test_invalid_dcs
test_invalid_dcs is a test which has a list of incorrect replication factor
values, and tries to create keyspaces with these incorrect values.

The standard way of creating a NetworkTopologyStrategy keyspace
is to specify the replication factor for each specific datacenter,
but there's also a simpler way - a user can just write: 'replication_factor': X
to convey that all of the current datacenters should have replication_factor X.

This way of creating a NetworkTopologyStrategy wasn't tested by test_invalid_dcs,
let's add it to the test to improve coverage.

Refs: https://github.com/scylladb/scylladb/issues/13986

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-22 17:56:27 +02:00
Jan Ciolek
9f5a55bcb9 ks_prop_defs: disallow empty replication factor string in NTS
A CREATE KEYSPACE query which specifies an empty string ('')
as the replication factor value is currently allowed:
```cql
CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''};
```

This is wrong, it's invalid to have an empty replication factor string.
It creates a keyspace without any replication, so the tables inside of it aren't writable.

Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error,
`NetworkTopolgyStrategy` should do the same.

The problem was in `prepare_options`, it treated an empty replication factor string
as no replication factor. Changing it to `std::optional` fixes the problem,
Now `std::nullopt` means no replication factor, and `make_optional("")` means
that there is a replication factor, but it's described by an empty string.

Fixes: https://github.com/scylladb/scylladb/issues/13986

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-22 17:56:16 +02:00
Guy Shtub
eefaad189a fixed broken links, added community forum link, university link, other mistakes 2023-05-22 13:12:16 +03:00
Alejo Sanchez
1940016cd1 test.py: warn and skip for missing unit/boost tests
If the executable of a matching unit or boost test is not present, warn
to console and skip.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13949
2023-05-22 12:49:32 +03:00
Tomasz Grabiec
9d4bca26cc Merge 'raft topology: implement check_and_repair_cdc_streams API' from Kamil Braun
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.

It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.

This PR implements the API on top of raft topology, it was originally
implemented using gossiper.  It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.

Closes #13683

* github.com:scylladb/scylladb:
  docs: update topology-over-raft.md
  test: topology_experimental_raft: test `check_and_repair_cdc` API
  raft topology: implement `check_and_repair_cdc_streams` API
  raft topology: implement global request handling
  raft topology: introduce `prepare_new_cdc_generation_data`
  raft_topology: `get_node_to_work_on_opt`: return guard if no node found
  raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
  raft topology: separate `publish_cdc_generation` state
  raft topology: non-node-specific `exec_global_command`
  raft topology: introduce `start_operation()`
  raft topology: non-node-specific `topology_mutation_builder`
  topology_state_machine: introduce `global_topology_request`
  topology_state_machine: use `uint16_t` for `enum_class`es
  raft topology: make `new_cdc_generation_data_uuid` topology-global
2023-05-22 11:33:58 +02:00
Kefu Chai
8d79811c6a scripts/refresh-submodules.sh: use the correct sha1 in title
0d4ffe1d69 introduced a regression where
it used the sha1 of the local "master" branch instead of the remote's
"master" branch in the title of the commit message.

in this change, let's use the origin/${branch}'s sha1 in the title.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13974
2023-05-22 12:10:03 +03:00
Botond Dénes
93e4671c83 Merge 'doc: add a cloud instance recommendations page' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/13808

This commit moves cloud instance recommendations from the Requirements page to a new dedicated page.

The content of subsections is simply copy-pasted, but I added the introduction and metadata for better
searchability.

Closes #13935

* github.com:scylladb/scylladb:
  doc: add a cloud instance recommendations page
2023-05-22 08:38:40 +03:00
Tomasz Grabiec
c39332710d test: test_tablets: materialize all rows from the result set
When paging, iterating twice over the result set is not possible,
making the second loop noop.
2023-05-21 19:49:57 +03:00
Tomasz Grabiec
1d0be495b6 test: test_tablets: Reconnect the driver after rolling restart
Fixes sporadic failures to execute INSERT which follows the restart:

   cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.194.238.2:9042 datacenter1>: ConnectionShutdown('Connection to 127.194.238.2:9042 is closed')})
2023-05-21 19:49:23 +03:00
Tomasz Grabiec
493e7fc3de main: Load tablet metadata after schema commit log replay
There could be system.tablet mutations in the schema commit log. We
need to see them before loading sstables of user tables because we
need sharding information.
2023-05-21 18:50:11 +03:00
Kefu Chai
3928a9a4e9 counters: specialize fmt::formatter<counter_{shard,cell}_view>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `counter_shard_view` and `counter_cell_view` without the
help of `operator<<`.

the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13967
2023-05-21 17:13:06 +03:00
Petr Gusev
095f35a47d token_metadata_test: add a test for everywhere strategy 2023-05-21 13:17:42 +04:00
Petr Gusev
8877641b0f token_metadata_test: check read_endpoints when bootstrapping first node 2023-05-21 13:17:42 +04:00
Petr Gusev
e9a6fcc8e1 token_metadata_test: refactor tests, extract create_erm
No logical changes, just tidied up
2023-05-21 13:17:42 +04:00
Petr Gusev
5976277c2c token_metadata: drop has_pending_ranges and migration_info
Use the new erm::has_pending_ranges function, drop the
old implementation from token_metadata.
2023-05-21 13:17:42 +04:00
Petr Gusev
5495065242 effective_replication_map: add has_pending_ranges
We add the has_pending_ranges function to erm. The
implementation for vnode is similar to that of token_metadata.
For tablets, we add new code that checks if the given endpoint
is contained in tablet_map::_transitions.
2023-05-21 13:17:42 +04:00
Petr Gusev
8cb709d3d6 token_metadata: drop update_pending_ranges
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
2023-05-21 13:17:42 +04:00
Petr Gusev
87307781c4 effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading
We already use the new pending_endpoints from erm though
the get_pending_ranges virtual function, in this commit
we update all the remaining places to use the new
implementation in erm, as well as remove the old implementation
in token_metadata.
2023-05-21 13:17:42 +04:00
Petr Gusev
d4f004f5c7 token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers
We want to switch token_metadata_test to the new
implementation of pending_endpoints and read_endpoints in erm.
To do this, it is convenient to have token_metadata and
replication_strategy as shared pointers, as it fits better with the signature
of calculate_effective_replication_map. In this commit we don't
change the logic of the tests, we just migrate them to use pointers.
2023-05-21 13:17:42 +04:00
Petr Gusev
e22a5c42c8 vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading
In this commit we introduce functions to erm for accessing
pending_endpoints and read_endpoints similar to the
corresponding functions in token_metadata. The only
difference - we no longer need the keyspace_name map.
The functions get_pending_endpoints and get_endpoints_for_reading
are virtual, since they have different implementations
for vnode and for tablets.

The get_pending_endpoints already existed. For tablets it
remained unchanged, while for vnode we just changed
it from calling on token_metadata to using a local field.
We have also removed ks_name from the signature as it's
no longer needed.

For vnodes, the get_endpoints_for_reading also just
employs the local field. In the case of tablets, we currently
return nullptr as the appropriate implementation remains unclear.
2023-05-21 13:17:42 +04:00
Petr Gusev
fbe3254a9e calculate_effective_replication_map: compute pending_endpoints and read_endpoints
In this commit we add logic to calculate pending_endpoints and
read_endpoints, similar to how it was done in update_pending_ranges.
For situations where 'natural_endpoints_depend_on_token'
is false we short-circuit the calculations, breaking out
of the loop after the first iteration. In this case we add a
single item with key=default_replication_map_key
to the replication_map and set pending_endpoints/read_endpoints
key range to the entire set of possible values.

In the loop we iterate over all_tokens, which contains the union of
all boundary tokens, from the old and from the new topology.
In addition to updating pending_endpoints and read_endpoints in the loop,
we remember the new natural endpoints in the replication_map
if the current token is contained in the current set of boundary tokens.
2023-05-21 13:17:42 +04:00
Petr Gusev
a8c36aad0b vnode_erm: optimize replication_map
We optimise memory usage of replication_map by
storing endpoints list only once in case of
natural_endpoints_depend_on_token() == false. For simplicity,
this list is stored in the same unordered_map with
special key default_replication_map_key.

We inline both get_natural_endpoints and
for_each_natural_endpoint_until from abstract_replication_strategy
into vnode_erm since now the overrides in local and everywhere
strategies are redundant. The default implementation works
for them as empty sorted_tokens() is not a problem, we
store endpoints with a special key.

Function do_get_natural_endpoints was extracted,
since get_natural_endpoints returns by val,
but for_each_natural_endpoint_until reference in sufficient.
2023-05-21 13:17:42 +04:00
Beni Peled
1e63cf6c50 release: prepare for 5.4.0-dev 2023-05-21 10:39:21 +03:00
Petr Gusev
b9812023c6 vnode_erm::get_range_addresses: use sorted_tokens
We want to refactor replication_map so that it doesn't
store multiple copies of the same endpoints vector
in case of natural_endpoints_depend_on_token == false.
To preserve get_range_addresses behaviour
we iterate over tm.sorted_tokens() instead of
_replication_map.

It's possible that the callers of this function
are ok with single range in case of
natural_endpoints_depend_on_token == false,
but to restrict the scope of the refactoring we
refrain from going to that direction.
2023-05-21 11:33:38 +04:00
Petr Gusev
99ff1fefe5 abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token
We are going to use this function in vnode_erm::get_natural_endpoints,
so for efficiency it's better to have fewer virtual calls.
2023-05-21 11:33:38 +04:00
Petr Gusev
e0bc98a217 sequenced_set: add extract_vector method
Can be useful if we want to reuse the vector when
we are done with this sequenced_set instance.
2023-05-21 11:33:38 +04:00
Petr Gusev
6f12c72c3f effective_replication_map: clone_endpoints_gently -> clone_data_gently
We need to account for the new fields in the clone implementation.

The signature future<erm> erm::clone() const; doesn't work because
the call will be made via foreign_ptr on an instance from another
shard, so we need to use local values for replication_strategy
and token_metadata.
2023-05-21 11:33:38 +04:00
Petr Gusev
959f9757d3 vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints
Refactor ~vnode_effective_replication_map, use
our new clear_gently overload for rvalue references.
Add new fields _pending_endpoints and _read_endpoints
to the call.

vnode_efficient_replication_map::clear_gently is removed as
it was not used.
2023-05-21 11:33:38 +04:00
Petr Gusev
700eb90ed8 stall_free.hh: add clear_gently for rvalues 2023-05-21 11:33:33 +04:00
Petr Gusev
4a127c3782 stall_free.hh: relax Container requirement
We don't use the return value of erase, so
we can allow it to return anything. We'll
need this for ring_mapping, since
boost::icl::interval_map::erase(it)
returns void.
2023-05-19 22:11:09 +04:00
Petr Gusev
084abc0e44 token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map
In this commit, we just add fields and pass them through
the constructor. Calculation and usage logic will be added later.
2023-05-19 19:04:43 +04:00
Petr Gusev
10bf8c7901 token_metadata: introduce topology_change_info
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.

At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.

The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.

The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.
2023-05-19 19:04:43 +04:00
Petr Gusev
51e80691ef token_metadata: replace set_topology_transition_state with set_read_new
This helps isolate topology::transition_state dependencies,
token_metadata doesn't need the entire enum, just  this
boolean flag.
2023-05-19 19:04:43 +04:00
Anna Stuchlik
e106f6714d Merge branch 'scylladb:master' into anna-cloud-recommendation-pages 2023-05-19 12:27:42 +02:00
Botond Dénes
3b424e391b Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.

Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.

Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.

Fixes #9559

Closes #13812

* github.com:scylladb/scylladb:
  table: signal compaction_manager when staging sstables become eligible for cleanup
  compaction_manager: perform_cleanup: wait until all candidates are cleaned up
  compaction_manager: perform_cleanup: perform_offstrategy if needed
  compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
  sstable_set: add for_each_sstable_gently* helpers
2023-05-19 12:35:59 +03:00
Anna Stuchlik
a456222ec4 doc: add a cloud instance recommendations page
Fixes https://github.com/scylladb/scylladb/issues/13808

This commit moves cloud instance recommendations from
the Requirements page to a new dedicated page.

The content of subsections is simply copy-pasted, but
I added the introduction and metadata for better
searchability.
2023-05-19 11:08:16 +02:00
Kefu Chai
031f770557 install.sh: use scylla-jmx for detecting JRE
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13514
2023-05-19 11:22:57 +03:00
Kefu Chai
8bb1f15542 test: sstable_3_x_test: avoid using helper using generation_type::int_t
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13931
2023-05-19 11:21:35 +03:00
Tomasz Grabiec
05be5e969b migration_manager: Fix snapshot transfer failing if TABLETS feature is not enabled
Without the feature, the system schema doesn't have the table, and the
read will fail with:

   Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)

We should not attempt to read tablet metadata in the experimental
feature is not enabled.

Fixes #13946
Closes #13947
2023-05-19 09:58:56 +02:00
Botond Dénes
c2aee26278 Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.

Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.

refs: #13024
refs: #13020
refs: #12707

Closes #13767

* github.com:scylladb/scylladb:
  sstable: Toss tempdir extension usage
  sstable: Drop pending_delete_dir_basename()
  sstable: Drop is_pending_delete_dir() helper
  sstable_directory: Make garbage_collect() non-static
  sstable_directory: Move deletion log exists check
  distributed_loader: Move garbage collecting into sstable_directory
  distributed_loader: Collect garbace collecting in one call
  sstable: Coroutinize remove_temp_dir()
  sstable: Coroutinize touch_temp_dir()
  sstable: Use storage::temp_dir instead of hand-crafted path
2023-05-19 08:50:13 +03:00
Alejo Sanchez
4ed178c42e test/topology: run first slow raft upgrade tests
Mark to run first the slowest raft upgrade tests

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-05-19 01:10:41 +02:00
Alejo Sanchez
2de6b8f49c test/topology: split raft upgrade tests
Split raft upgrade tests to run in parallel by default

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-05-19 01:07:41 +02:00
Jan Ciolek
1bcb4c024c cql3/expr: print expressions in user-friendly way by default
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.

`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.

`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.

The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.

It's cumbersome and people forget about it,
so let's change the default to `user`.

There issues about expressions being printed
in a `strange` way, this fixes them.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13916
2023-05-18 20:57:00 +03:00
Kamil Braun
64dc76db55 test: pylib: fix read_barrier implementation
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.

Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.

This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).

Closes #13933
2023-05-18 18:30:11 +02:00
Kamil Braun
13df85ea11 Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).

The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.

Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.

Fixes: #13837

Closes #13172

* github.com:scylladb/scylladb:
  gossiper: Remove features and sysks from gossiper
  system_keyspace: De-static save_local_supported_features()
  system_keyspace: De-static load_|save_local_enabled_features()
  system_keyspace: Move enable_features_on_startup to feature_service (cont)
  system_keyspace: Move enable_features_on_startup to feature_service
  feature_service: Open-code persist_enabled_feature_info() into enabler
  gms: Move feature enabler to feature_service.cc
  gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
  gms: Persist features explicitly in features enabler
  feature_service: Make persist_enabled_feature_info() return a future
  system_keyspace: De-static load_peer_features()
  gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
  gossiper: Enable features and register enabler from outside
  gms: Add feature_service and system_keyspace to feature_enabler
2023-05-18 18:21:06 +02:00
Gleb Natapov
701d6941a5 storage_proxy: raft topology: use gossiper state to populate peers table
Some state that is used to fill in 'peeers' table is still propagated
over gossiper.  When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.

Fixes: #13911

Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
2023-05-18 16:00:29 +02:00
Pavel Emelyanov
5216dcb1b3 Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.

This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.

This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.

Refs: #11870

Closes #13869

* github.com:scylladb/scylladb:
  db/system_keyspace: remove dependency on storage_proxy
  db/system_keyspace: replace storage_proxy::query*() with  replica:: equivalent
  replica: add query.hh
2023-05-18 10:53:27 +03:00
Raphael S. Carvalho
38b226f997 Resurrect optimization to avoid bloom filter checks during compaction
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.

Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.

This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.

The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.

Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
    45.64%    45.64%  sstable_compact  sstable_compaction_test_g
        [.] utils::filter::bloom_filter::is_present

With its resurrection, the problem is gone.

This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.

Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])

[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.

After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).

With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13908
2023-05-18 09:01:50 +03:00
Kefu Chai
03be1f438c sstables: move get_components_lister() into sstable_directory
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.

with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13853
2023-05-18 08:43:35 +03:00
Botond Dénes
88a2421961 Merge 'Generalize global table pointer' from Pavel Emelyanov
There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class.

This PR generalizes both into global_table_ptr facility.

Closes #13909

* github.com:scylladb/scylladb:
  replica: Use global_table_ptr in distributed loader
  replica: Make global_table_ptr a class
  replica: Add type alias for vector of foreign lw-pointers
  replica: Put get_table_on_all_shards() to header
  replica: Rewrite get_table_on_all_shards()
2023-05-18 08:42:04 +03:00
Kefu Chai
8bcbc9a90d sstables: add an maybe_owned_by_this_shard() helper
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13846
2023-05-18 08:41:02 +03:00
Benny Halevy
8a7e77e0ed gossiper: is_alive: fix use-after-move if endpoint is unknown
`ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr
but it's used later for logger.warn()

Fixes #13921

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13920
2023-05-17 21:57:26 +03:00
Pavel Emelyanov
c3fca9481c replica: Use global_table_ptr in distributed loader
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 18:14:34 +03:00
Pavel Emelyanov
d7f99d031d replica: Make global_table_ptr a class
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 18:14:34 +03:00
Pavel Emelyanov
b4a8843907 replica: Add type alias for vector of foreign lw-pointers
This is to convert the global_table_ptr into a class with less bulky
patch further

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 18:14:34 +03:00
Pavel Emelyanov
fffe3e4336 replica: Put get_table_on_all_shards() to header
This is to share it with distributed loader some time soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 18:14:34 +03:00
Pavel Emelyanov
f974617c79 replica: Rewrite get_table_on_all_shards()
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 18:14:34 +03:00
Jan Ciolek
9223cd8c3a statement_restrictions: add get_not_null_columns()
The IS NOT NULL restrictions are handled in a special way.
Instead of putting them together with other restrictions,
statement_restrictions collects all columns restricted
by IS NOT NULL and puts them in the _not_null_columns field.

Add a getter to access this set of columns.
The field is private, so it can't be accessed without
a function that explictily exposes it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-17 16:12:10 +02:00
Jan Ciolek
7f0c64a69d test: remove invalid IS NOT NULL restrictions from tests
The IS NOT NULL restrictions is currently supported
only in the CREATE MATERIALIZED VIEW statements.
These restrictions works correctly for columns
that are part of the view's primary key,
but they're silently ignored on other columns.

The following commits will forbid placing
the IS NOT NULL restriction on columns
that aren't a part of the view's primary key.
The tests have to be modified in order
to pass, because some of them have
a useless IS NOT NULL restriction
on regular columns that don't belong
to the view's primary key.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-17 15:38:03 +02:00
Pavel Emelyanov
ed50fda1fe sstable: Toss tempdir extension usage
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.

This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:19:38 +03:00
Pavel Emelyanov
e8c0ae28b5 sstable: Drop pending_delete_dir_basename()
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:17:33 +03:00
Pavel Emelyanov
7792479865 sstable: Drop is_pending_delete_dir() helper
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:17:32 +03:00
Pavel Emelyanov
7429205632 sstable_directory: Make garbage_collect() non-static
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
45adf61490 sstable_directory: Move deletion log exists check
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
3d7122d2fe distributed_loader: Move garbage collecting into sstable_directory
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.

First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
99f924666f distributed_loader: Collect garbace collecting in one call
When the loader starts it first scans the directory for sstables'
tempdirs and pending deletion logs. Put both into one call so that it
can be moved more easily later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
22299a31c8 sstable: Coroutinize remove_temp_dir()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
9db5e9f77f sstable: Coroutinize touch_temp_dir()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:15:38 +03:00
Pavel Emelyanov
7e506354fd sstable: Use storage::temp_dir instead of hand-crafted path
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:14:04 +03:00
Anna Stuchlik
6f4a68175b doc: fix the links to the Enterprise docs
Fixes https://github.com/scylladb/scylladb/issues/13915

This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.

This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.

Closes #13917
2023-05-17 13:56:21 +03:00
Benny Halevy
bb59687116 table: signal compaction_manager when staging sstables become eligible for cleanup
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:33:22 +03:00
Benny Halevy
a5a8020ecd compaction_manager: perform_cleanup: wait until all candidates are cleaned up
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.

Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.

Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.

Fixes #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:31:07 +03:00
Benny Halevy
be4e23437f compaction_manager: perform_cleanup: perform_offstrategy if needed
It is possible that cleanup will be executed
right after repair-based node operations,
in which case we have a 5 minutes timer
before off-strategy compaction is started.

After marking the sstables that need cleanup,
perform offstrategy compaction, if needed.
This will implicitly cleanup those sstables
as part of offstrategy compaction, before
they are even passed for view update (if the table
has views/secondary index).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:31:07 +03:00
Benny Halevy
53fbf9dd32 compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
Scan all sstables to determine which of them
requires cleanup before calling perform_task_on_all_files.

This allows for cheaper no-op return when
no sstable was identified as requiring cleanup,
and also it will allow triggering offstrategy
compaction if needed, after selecting the sstables
for cleanup, in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:31:07 +03:00
Benny Halevy
ff7c9c661d sstable_set: add for_each_sstable_gently* helpers
Currently callers of `for_each_sstable` need to
use a seastar thread to allow preemption
in the for_each_sstable loop.

Provide for_each_sstable_gently and
for_each_sstable_gently_until to make using this
facility from a coroutine easier, without requiring
a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:31:07 +03:00
Kefu Chai
6cd745fd8b build: cmake: add missing test
string_format_test was added in 1b5d5205c8,
so let's add it to CMake building system as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13912
2023-05-17 09:51:51 +03:00
Raphael S. Carvalho
5544d12f18 compaction: avoid excessive reallocation and during input list formatting
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.

in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13907
2023-05-17 09:40:06 +03:00
Benny Halevy
302a89488a test: sstable_3_x_test: add test_compression_premature_eof
Reproduces #13599 and verifies the fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13903
2023-05-17 09:00:44 +03:00
Gleb Natapov
605e53e617 do not report raft as enabled before group0 is configured
Currently we may start to receive requests before group0 is configured
during boot.  If that happens those requests may try to pull schema and
issue raft read barrier which will crash the system because group0 is
not yet available. Workaround it by pretending the raft is disabled in
this case and use non raft procedure. The proper fix should make sure
that storage proxy verbs are registered only after group0 is fully
functional.

Message-Id: <ZGOZkXC/MsiWtNGu@scylladb.com>
2023-05-17 01:06:42 +02:00
Michał Chojnowski
9b0679c140 range_tombstone_change_generator: fix an edge case in flush()
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes #12462

Closes #13906
2023-05-16 17:54:08 +02:00
Nadav Har'El
24c3cbcb0b Merge 'Improve verbosity of test/pylib/minio.py' from Pavel Emelyanov
CI once failed due to mc being unable to configure minio server. There's currently no glues why it could happen, let's increase the minio.py verbosity a bit

refs: #13896

Closes #13901

* github.com:scylladb/scylladb:
  test,minio: Run mc with --debug option
  test,minio: Log mc operations to log file
2023-05-16 18:04:36 +03:00
Nadav Har'El
52e4edfd5e Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros
Currently, when a user creates a function or a keyspace, no
permissions on functions are update.
Instead, the user should gain all permissions on the function
that they created, or on all functions in the keyspace they have
created. This is also the behavior in Cassandra.

However, if the user is granted permissions on an function after
performing a CREATE OR REPLACE statement, they may
actually only alter the function but still gain permissions to it
as a result of the approach above, which requires another
workaround added to this series.

Lastly, as of right now, when a user is altering a function, they
need both CREATE and ALTER permissions, which is incompatible
with Cassandra - instead, only the ALTER permission should be
required.

This series fixes the mentioned issues, and the tests are already
present in the auth_roles_test dtest.

Fixes #13747

Closes #13814

* github.com:scylladb/scylladb:
  cql: adjust tests to the updated permissions on functions
  cql: fix authorization when altering a function
  cql: grant permissions on functions when creating a keyspace/function
  cql: pass a reference to query processor in grant_permissions_to_creator
  test_permissions: make tests pass on cassandra
2023-05-16 18:04:35 +03:00
Avi Kivity
d2d53fc1db Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.

get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.

Fixes #13899

Closes #13900

* github.com:scylladb/scylladb:
  storage_service: do not preempt while traversing endpoint_state_map
  gossiper: do not preempt while traversing endpoint_state_map
2023-05-16 18:04:35 +03:00
Botond Dénes
3ea521d21b Update tools/jmx submodule
* tools/jmx f176bcd1...1fd23b60 (1):
  > select-java: query java version using -XshowSettings
2023-05-16 18:04:35 +03:00
Kamil Braun
5a8e2153a0 Merge 'Fix heart_beat_state::force_highest_possible_version_unsafe' from Benny Halevy
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since [gms: heart_beat_state: use generation_type and version_type](4cdad8bc8b)
(merged in [Merge 'gms: define and use generation and version types'...](7f04d8231d))

Implementing min/max correctly
Fixes #13801

Closes #13880

* github.com:scylladb/scylladb:
  storage_service: handle_state_normal: on_internal_error on "owns no tokens"
  utils: tagged_integer: implement std::numeric_limits::{min,max}
  test: add tagged_integer_test
2023-05-16 13:59:41 +02:00
Wojciech Mitros
6bc16047ba rust: update wasmtime dependency
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.

We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.

Fixes #13807

Closes #13804
2023-05-16 13:03:29 +03:00
Pavel Emelyanov
29fffaa160 schema_tables: Use sharded<database>& variable
The auto& db = proxy.local().get_db() is called few lines above this
patch, so the &db can be reused for invoke_on_all() call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13896
2023-05-16 12:57:47 +03:00
Benny Halevy
1da0b0ff76 storage_service: do not preempt while traversing endpoint_state_map
The map iterators might be invalidated while yielding
on insert if the map is rehashed.
See https://en.cppreference.com/w/cpp/container/unordered_map/insert

Refs #13899

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-16 12:24:44 +03:00
Benny Halevy
ba13056eba gossiper: do not preempt while traversing endpoint_state_map
The map iterators might be invalidated while yielding
on insert if the map is rehashed.
See https://en.cppreference.com/w/cpp/container/unordered_map/insert

Refs #13899

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-16 12:24:42 +03:00
Pavel Emelyanov
b58ad040d2 sstables: Switch data and index sink to use jumbo uploader
These two can grow large. Non-jumbo sink is effectively limited with
10000 parts, since each is ~5Mb the maximum uploadable data/index
happens to be 50Gb which is too small.

Other components shouldn't grow that big and continue using simple and a
bit faster uploading sink.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
b3df2d0db0 s3/test: Tune-up multipart upload test alignment
Currently the test uses a sequence of 1024-bytes buffers. This lets
minio server actively de-duplicate those blocks by page boundary (it's a
guess, but it it's truish because minio reports back equivalent ETags
for lots of uploading parts). Make the buffer not be power of two so
that when squashed together the resulting 2^X buffers don't get equal.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
fffa04fa67 s3/test: Add jumbo upload test
It re-uses most of the existing upload sink test, but configures the
jumbo sink with at most 3 parts in each intermediate object not to
upload 50Gb part to switch to the next one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
908d0d2e6a s3/client: Wait for background upload fiber on close-abort
When uploading a part (and a piece) there can be one or more background
fibers handling the upload. In case client needs to abort the operation
it calls .close() without flush()ing. In this case the S3 API Abort is
made and the sink can be terminated. It's expected that background
fibers would resolve on their own eventually, but it's not quite the
case.

First, they hold units for the semaphore and the semaphore should be
alive by the time units are returned.

Second, the PUT (or copy) request can finish successfully and it may be
sitting in the reactor queue waiting for its continuation to get
scheduler. The continuation references sink via "this" capture to put
the part etag.

Finally, in case of piece uploading the copy fiber needs _client at the
end to issue delete-object API call dropping the no longer needed part.

Said that -- background fibers must be waited upon on .close() if the
closing is aborting (if it's successfull close, then the fibers mush
have been picked up by final flush() call).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
f9686926c2 c3/client: Implement jumbo upload sink
The sink is also in charge of uploading large objects in parts, but this
time each part is put with the help of upload-part-copy API call, not
the regular upload-part one.

To make it work the new sink inherits from the uploading base class, but
instead of keeping memory_data_sink_buffers with parts it keeps a sink
to upload a temporary intermediate object with parts. When the object is
"full", i.e. the number of parts in it hits the limit, the object is
flushed, then copied into the target object with the S3 API call, then
deletes the intermediate object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Pavel Emelyanov
8fa3294ae1 s3/client: Move memory buffers to upload_sink from base
All the buffers manipulations now happen in the upload_sink class and
the respective member can be removed from base class. The base class
only messes with the buffers in its upload_part() call, but that's
unavoidable, as uploading part implies sending its contents which sits
in buffers.

Now the base class can be re-used for uploading parts with the help of
copy-part API call (next patches)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
2ac5ecd659 s3/client: Move last part upload out of finalize_upload()
This change has two reasons. First, is to facilitate moving the
memory_data_sink_buffers from base class, i.e. -- continuation of the
previous patch. Also this fixes a corner case -- if final sink flush
happens right after the previous part was sent for uploading, the
finalization doesn't happen and sink closing aborts the upload even if
it was successful.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
407b40c430 s3/client: Merge do_flush() with upload_part()
The do_flush() helper is practically useless because what it does can be
done by the upload_part() itself. This merge also facilitates moving the
memory_data_sink_buffers from base class to uploader class by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
a88629227f s3/client: Rename upload_sink -> upload_sink_base
There will appear another sink that would implement multipart upload
with the help of copy-part functionality. Current uploading code is
going to be partially re-used, so this patch moves all of it into the
base class in advance. Next patches will pick needed parts.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:19:50 +03:00
Pavel Emelyanov
01628ae8c1 test,minio: Run mc with --debug option
With that if mc fails we'll (hopefully) get some meaningful information
about why it happened.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:16:15 +03:00
Pavel Emelyanov
4041c2f30d test,minio: Log mc operations to log file
Currently everything minio.py does goes to test.py log, while mc (and
minio) output go to another log file. That's inconvenient, better to
keep minio.py's messages in minio log file.

Also, while at it, print a message if local alias drop fails (it's
benign failure, but it's good to have the note anyway).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:14:49 +03:00
Kefu Chai
67dae95f58 build: cmake: add Scylla_USE_LINKER option
this option allows user to use specified linker instead of the
default one. this is more flexible than adding more linker
candidates to the known linkers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13874
2023-05-16 11:30:18 +03:00
Tzach Livyatan
a73fde6888 Update Azure recommended instances type from the Lsv2-series to the Lsv3-series
Closes #13835
2023-05-16 10:58:19 +03:00
Avi Kivity
3c54d5ec5e test: string_format_test: don't compare std::string with sstring
For unknown reasons, clang 16 rejects equality comparison
(operator==) where the left-hand-side is an std::string and the
right-hand-side is an sstring. gcc and older clang versions first
convert the left-hand-side to an sstring and then call the symmetric
equality operator.

I was able to hack sstring to support this assymetric comparison,
but the solution is quite convoluted, and it may be that it's clang
at fault here. So instead this patch eliminates the three cases where
it happened. With is applied, we can build with clang 16.

Closes #13893
2023-05-16 08:56:16 +03:00
Kefu Chai
b112a3b78a api: storage_service: use string for generation
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".

this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.

this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.

But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13834
2023-05-15 20:33:48 +03:00
Botond Dénes
646396a879 mutation/mutation_partition: append_clustered_row(): use on_internal_error()
Instead of simply throwing an exception. With just the exception, it is
impossible to find out what went wrong, as this API is very generic and
is used in a variety of places. The backtrace printed by
`on_internal_error()` will help zero in on the problem.

Fixes: #13876

Closes #13883
2023-05-15 20:31:44 +03:00
Calle Wilund
469e710caa docs: Add initial doc on commitlog segment file format
Refs #12849

Just a few lines on the file format of segments.

Closes #13848
2023-05-15 16:22:44 +03:00
Benny Halevy
502b5522ca storage_service: handle_state_normal: on_internal_error on "owns no tokens"
Although this condition should not happen,
we suspect that certain timing conditions might
lead this state of node in handle_normal_state
(possibly when shutdown) has no tokens.

Currently we call on_internal_error_noexcept, so
if abort_on_internal_error is false, we will just
print an error and continue on with handle_state_normal.

Change that to `on_internal_error` so to throw an
exception in production in this unexpected state.

Refs #13801

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-15 12:49:17 +03:00
Anna Stuchlik
84ed95f86f doc: add OS support for version 2023.1
Fixes https://github.com/scylladb/scylladb/issues/13857

This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.

After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1

Closes: #13858
2023-05-15 10:51:53 +03:00
Alejo Sanchez
19687b54f1 test/pytest: yaml configuration cluster section
Separate cluster_size into a cluster section and specify this value as
initial_size.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13440
2023-05-15 09:48:39 +02:00
Benny Halevy
a70b53b6e7 utils: tagged_integer: implement std::numeric_limits::{min,max}
Add add a respective unit test.

It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since 4cdad8bc8b
(merged in 7f04d8231d)

Implementing min/max correctly
Fixes #13801

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-15 10:19:39 +03:00
Botond Dénes
0cff0ffa08 Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.

in this change,

* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
   so it can be updated by a per-shard updateable_value. and
   it is now a updateable_value, so its variable name is updated
   accordingly. this value is set in the ctor of executor, and
   it is disconnected from the corresponding named_value<> option
   in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
   executor via sharded_parameter, so `executor::_timeout_in_ms` can
   be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
   the option to executor in its ctor.

Fixes #12232

Closes #13300

* github.com:scylladb/scylladb:
  alternator: split the param list of executor ctor into multi lines
  alternator,config: make alternator_timeout_in_ms live-updateable
2023-05-15 10:16:29 +03:00
Botond Dénes
6c27297406 Merge 'test: sstable_*test: use generator to create new generations' from Kefu Chai
in this series, instead of hardwiring to integer, we switch to generation generator for creating new generations. this should helps us to migrate to a generation identifier which can also represented by UUID. and potentially can help to improve the testing coverage once we switch over to UUID-based generation identifier. will need to parameterize these tests by then, for sure.

Closes #13863

* github.com:scylladb/scylladb:
  test: sstable: use generator to generate generations
  test: sstable: pass generation_type in helper functions
  test: sstable: use generator to generate generations
2023-05-15 10:04:30 +03:00
Botond Dénes
3256afe263 Update tools/jmx submodule
* tools/jmx 5f988945...f176bcd1 (1):
  > sstableinfo: change the type of generation to string

Refs: #13834
2023-05-15 09:59:40 +03:00
Asias He
93c93c69f9 repair: Add per peer node error for get_sync_boundary and friends
It is useful to know which node has the error. For example, when a node
has a corrupted sstable, with this patch, repair master node can tell
which node has the corrupted sstable.

```
WARN  2023-05-15 10:54:50,213 [shard 0] repair -
repair[2df49b2c-219d-411d-87c6-2eae7073ba61]: get_combined_row_hash: got
error from node=127.0.0.2, keyspace=ks2a, table=tb,
range=(8992118519279586742,9031388867920791714],
error=seastar::rpc::remote_verb_error (some error)
```

Fixes #13881

Closes #13882
2023-05-15 09:52:27 +03:00
Pavel Emelyanov
07b7e9faf1 load-meter: Remove unused get_load_string
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13873
2023-05-15 09:21:08 +03:00
Piotr Dulikowski
760651b4ad error injection: allow enabling injections via config
Currently, error injections can be enabled either through HTTP or CQL.
While these mechanisms are effective for injecting errors after a node
has already started, it can't be reliably used to trigger failures
shortly after node start. In order to support this use case, this commit
adds possibility to enable some error injections via config.

A configuration option `error_injections_at_startup` is added. This
option uses our existing configuration framework, so it is possible to
supply it either via CLI or in the YAML configuration file.

- When passed in commandline, the option is parsed as a
  semicolon-separated list of error injection names that should be
  enabled. Those error injections are enabled in non-oneshot mode.

  The CLI option is marked as not used in release mode and does not
  appear in the option list.

  Example:

      --error-injections-at-startup failure_point1;failure_point2

- When provided in YAML config, the option is parsed as a list of items.
  Each item is either a string or a map or parameters. This method is
  more flexible as it allows to provide parameters for each injection
  point. At this time, the only benefit is that it allows enabling
  points in oneshot mode, but more parameters can be added in the future
  if needed.

  Explanatory example:

      error_injections_at_startup:
      - failure_point1 # enabled in non-oneshot mode
      - name: failure_point2 # enabled in oneshot mode
        one_shot: true       # due to one_shot optional parameter

The primary goal of this feature is to facilitate testing of raft-based
cluster features. An error injection will be used to enable an
additional feature to simulate node upgrade.

Tests: manual

Closes #13861
2023-05-15 09:14:07 +03:00
Botond Dénes
1b04fc1425 Merge 'Use member initializer list for trace_state and related helper classes' from Pavel Emelyanov
Constructors of trace_state class initialize most of the fields in constructor body with the help of non-inline helper method. It's possible and is better to initialize as much as possible with initializer lists.

Closes #13871

* github.com:scylladb/scylladb:
  tracing: List-initialize trace_state::_records
  tracing: List-initialize trace_state::_props
  tracing: List-initialize trace_state::_slow_query_threshold
  tracing: Reorder trace_state fields initialization
  tracing: Remove init_session_records()
  tracing: List-initialize one_session_records::ttl
  tracing: List-initialize one_session_records
  tracing: List-initialize session_record
2023-05-15 09:06:14 +03:00
Botond Dénes
20ff122a84 Merge 'Delete S3 sstables without the help of deletion log' from Pavel Emelyanov
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.

Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.

So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic

fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024

Closes #13562

* github.com:scylladb/scylladb:
  sstables: Implement atomic deleter for s3 storage
  sstables: Get atomic deleter from underlying storage
  sstables: Move delete_atomically to manager and rename
2023-05-15 08:57:47 +03:00
Benny Halevy
1b5d5205c8 test: add tagged_integer_test
Add basic test for tagged+integer arithmetic operations.

Remove const qualifier from `tagged_integer::operator[+-]=`
as these are add/sub-assign operators that need to modify
the value in place.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-14 23:26:58 +03:00
Wojciech Mitros
96e912e1cf auth: disallow CREATE permission on a specific function
Similarly to how we handle Roles and Tables, we do not
allow permissions on non-existent objects, so the CREATE
permission on a specific function is meaningless, because
for the permission to be granted to someone, the function
must be already created.
This patch removes the CREATE permission from the set of
permissions applicable to a specific function.

Fixes #13822

Closes #13824
2023-05-14 18:40:34 +03:00
Wojciech Mitros
1e18731a69 cql-pytest: translate Cassandra's UFTypesTest
This is a translation of Cassandra's CQL unit test source file
validation/entities/UFTypesTest.java into our cql-pytest framework.

There are 7 tests, which reproduce one known bug:
Refs #13746: UDF can only be used in SELECT, and abort when used in WHERE, or in INSERT/UPDATE/DELETE commands

And uncovered two previously unknown bugs:

Refs #13855: UDF with a non-frozen collection parameter cannot be called on a frozen value
Refs #13860: A non-frozen collection returned by a UDF cannot be used as a frozen one

Additionally, we encountered an issue that can be treated as either a bug or a hole in documentation:

Refs #13866: Argument and return types in UDFs can be frozen

Closes #13867
2023-05-14 15:22:03 +03:00
Avi Kivity
31e820e5a1 Merge 'Allow tombstone GC in compaction to be disabled on user request' from Raphael "Raph" Carvalho
Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction.

Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction.

column_family variant must specify a single table only, following existing convention.

whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace.

column_family API usage
-----

```
    The table name must be in keyspace:name format

    Get status:
    curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

    Enable GC
    curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

    Disable GC
    curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
```

storage_service API usage
-----

```
    Tables can be specified using a comma-separated list.

    Enable GC on keyspace
    curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

    Disable GC on keyspace
    curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

    Enable GC on a subset of tables
    curl -s -X POST
    "http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"
```

Closes #13793

* github.com:scylladb/scylladb:
  test: Test new API for disabling tombstone GC
  test: rest_api: extract common testing code into generic functions
  Add API to disable tombstone GC in compaction
  api: storage_service: restore indentation
  api: storage_service: extract code to set attribute for a set of tables
  tests: Test new option for disabling tombstone GC in compaction
  compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
  table: Allow tombstone GC in compaction to be disabled on user request
2023-05-14 14:16:16 +03:00
Tomasz Grabiec
a91e83fad6 Merge "issue raft read barrier before pulling schema" from Gleb
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This series changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
2023-05-14 14:14:24 +03:00
Raphael S. Carvalho
a7ceb987f5 test: Fix sporadic failures of database_test
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.

The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.

That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.

Fixes #13660.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13851
2023-05-14 14:14:24 +03:00
Avi Kivity
97694d26c4 Merge 'reader_permit: minor improvements to resource consume/release safety' from Botond Dénes
This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore:
* reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend.
* reader_resources: split `reset()` into `noexcept` and potentially throwing variant.
* reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one)

Closes #13678

* github.com:scylladb/scylladb:
  reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
  reader_permit: split resource_units::reset()
  reader_permit: make consume()/signal() API private
2023-05-14 14:14:23 +03:00
Avi Kivity
5d6f31df8e Merge 'Coroutinize sstable::read_toc()' from Pavel Emelyanov
It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation.

Closes #13862

* github.com:scylladb/scylladb:
  sstables: Restore indentation after previous patches
  sstables: Coroutinuze read_toc() outer part
  sstables: Coroutinuze read_toc() inner part
2023-05-14 14:14:23 +03:00
Avi Kivity
0a78995e2b Merge 'Share s3 clients between sstables' from Pavel Emelyanov
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.

In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.

There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.

1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow

This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.

refs: #13458
refs: #13369
refs: scylladb/seastar#1652

Closes #13859

* github.com:scylladb/scylladb:
  s3: Pick client from manager via handle
  s3: Generalize s3 file handle
  s3: Live-update clients' configs
  sstables: Keep clients shared across sstables
  storage_manager: Rewrap config map
  sstables, database: Move object storage config maintenance onto storage_manager
  sstables: Introduce sharded<storage_manager>
2023-05-14 14:14:23 +03:00
Pavel Emelyanov
8bca54902c sstables: Implement atomic deleter for s3 storage
The existing storage::wipe() method of s3 is in fact atomic deleter --
it commits "deleting" status into ownership table, deletes the objects
from server, then removes the entry from ownership table. So the atomic
deleter does the same and the .wipe() just removes the objects, because
it's not supposed to be atomic.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 17:52:13 +03:00
Pavel Emelyanov
6a8139a4fe sstables: Get atomic deleter from underlying storage
While the driver isn't known without the sstable itself, we have a
vector of them can can get it from the front element. This is not very
generic, but fortunately all sstables here belong to the same table and,
respectively, to the same storage and even prefix. The latter is also
assert-checked by the sstable_directory atomic deleter code.

For now S3 storage returns the same directory-based deleter, but next
patch will change that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 17:52:13 +03:00
Pavel Emelyanov
5985f00da9 sstables: Move delete_atomically to manager and rename
This is to let manager decide which storage driver to call for atomic
sstables deletion in the next patch. While at it -- rename the
sstable_directory's method into something more descriptive (to make
compiler catch all callers of it).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 17:52:12 +03:00
Raphael S. Carvalho
107999c990 test: Test new API for disabling tombstone GC
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:38 -03:00
Raphael S. Carvalho
c396db2e4c test: rest_api: extract common testing code into generic functions
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:38 -03:00
Raphael S. Carvalho
abc1eae1c2 Add API to disable tombstone GC in compaction
Adding new APIs /column_family/tombstone_gc and
/storage_service/tombstone_gc.

Mimicks existing APIs /column_family/autocompaction and
/storage_service/autocompaction.

column_family variant must specify a single table only,
following existing convention.

whereas the storage_service one can specify an entire
keyspace, or a subset of a tables in a keyspace.

column_family API usage
-----

The table name must be in keyspace:name format

Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

storage_service API usage
-----

Tables can be specified using a comma-separated list.

Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:38 -03:00
Raphael S. Carvalho
07104393af api: storage_service: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:36 -03:00
Raphael S. Carvalho
501b5a9408 api: storage_service: extract code to set attribute for a set of tables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:33:50 -03:00
Pavel Emelyanov
d58bc9a797 tracing: List-initialize trace_state::_records
This field needs to call trace_state::ttl_by_type() which, in turn,
looks into _props. The latter should have been initialized already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:15:58 +03:00
Pavel Emelyanov
5aebbedaba tracing: List-initialize trace_state::_props
It takes props from constructor args and tunes them according to the
constructing "flavor" -- primary or secondary state. Adding two static
helpers code-document the intent and make list-initialization possible

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:14:32 +03:00
Raphael S. Carvalho
6c32148751 tests: Test new option for disabling tombstone GC in compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:14:28 -03:00
Raphael S. Carvalho
777af7df44 compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
compaction strategies know how to pick files that are most likely to
satisfy tombstone purge conditions (i.e. not shadow data in uncompacting
files).

This logic can be bypassed if tombstone GC was disabled by user,
as it's a waste of effort to proceed with it until re-enabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:14:28 -03:00
Raphael S. Carvalho
3b28c26c77 table: Allow tombstone GC in compaction to be disabled on user request
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.

Not wired into an API yet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:14:28 -03:00
Pavel Emelyanov
e7978dbf98 tracing: List-initialize trace_state::_slow_query_threshold
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:14:15 +03:00
Pavel Emelyanov
3ebbc25cec tracing: Reorder trace_state fields initialization
The instance ptr and props have to be set up early, because other
members' initialization depends on them. It's currently OK, because
other members are initialized in the constructor body, but moving them
into initializer list would require correct ordering

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:13:13 +03:00
Pavel Emelyanov
16e1315eef tracing: Remove init_session_records()
It now does nothing but wraps make_lw_shared<one_session_records>()
call. Callers can do it on their own thus facilitating further
list-initialization patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:11:18 +03:00
Pavel Emelyanov
dd87adadf3 tracing: List-initialize one_session_records::ttl
For that to happen the value evaluation is moved from the
init_session_records() into a private trace_state helper as it checks
the props values initialized earlier

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:09:05 +03:00
Pavel Emelyanov
b63084237c tracing: List-initialize one_session_records
This touches session_id, parent_id and my_span_id fields

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:07:24 +03:00
Pavel Emelyanov
944b98f261 tracing: List-initialize session_record
This object is constructed via one_session_records thus the latter needs
to pass some arguments along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-12 16:04:01 +03:00
Botond Dénes
157fdb2f6d db/system_keyspace: remove dependency on storage_proxy
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
2023-05-12 07:27:55 -04:00
Botond Dénes
f4f757af23 db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
Use the recently introduced replica side query utility functions to
query the content of the system tables. This allows us to cut the
dependency of the system keyspace on storage proxy.
The methods still take storage proxy parameter, this will be replaced
with replica::database in the next patch.
There is still one hidden storage proxy dependency left, via
clq3::query_processor. This will be addressed later.
2023-05-12 07:27:55 -04:00
Botond Dénes
f5d41ac88c replica: add query.hh
Containing utility methods to query data from the local replica.
Intended to be used to read system tables, completely bypassing storage
proxy in the process.
This duplicates some code already found in storage proxy, but that is a
small price to pay, to be able to break some circular dependencies
involving storage proxy, that have been plaguing us since time
immemorial.
One thing we lose with this, is the smp service level using in storage
proxy. If this becomes a problem, we can create one in database and use
it in these methods too.
Another thing we lose is increasing `replica_cross_shard_ops` storage
proxy stat. I think this is not a problem at all as these new functions
are meant to be used by internal users, which will reduce the internal
noise in this metric, which is meant to indicate users not using
shard-aware clients.
2023-05-12 07:26:18 -04:00
Wojciech Mitros
d50f048279 cql: adjust tests to the updated permissions on functions
As a result of the preceding patches, permissions on a function
are now granted to its creator. As a result, some permissions may
appear which we did not expect before.

In the test_udf_permissions_serialization, we create a function
as the superuser, and as a result, when we compare the permissions
we specifically granted to the ones read from the LIST PERMISSIONS
result, we get more than expected - this is fixed by granting
permissions explicitly to a new user and only checking this user's
permissions list.

In the test_grant_revoke_udf_permissions case, we test whether
the DROP permission in enforced on a function that we have previously
created as the same user - as a result we have the DROP permission
even without granting it directly. We fix this by testing the DROP
permission on a function created by a different user.

In the test_grant_revoke_alter_udf_permissions case, we previously
tested that we require both ALTER and CREATE permissions when executing
a CREATE OR REPLACE FUNCTION statement. The new permissions required
for this statement now depend on whether we actually CREATE or REPLACE
a function, so now we test that the ALTER permission is required when
REPLACING a function, and the CREATE permission is required when
CREATING a function. After the changes, the case no longer needs to
be arfitifially extracted from the previous one, so they are merged
now. Analogous adjustments are made in the test case
test_grant_revoke_alter_uda_permissions.
2023-05-12 10:56:29 +02:00
Wojciech Mitros
8abed6445a cql: fix authorization when altering a function
Currently, when a user is altering a function, they need
both CREATE and ALTER permissions, instead of just ALTER.
Additionally, after altering a function, the user is
treated as an owner of this function, gaining all access
permissions to it.

This patch fixes these 2 issues, by checking only the ALTER
permission when actually altering, and by not modifying
user's permisssions if the user did not actually create
the function.
2023-05-12 10:56:29 +02:00
Wojciech Mitros
1d099644d4 cql: grant permissions on functions when creating a keyspace/function
When a user creates a function, they should have all permissions on
this function.
Similarly, when a user creates a keyspace, they should have all
permissions on functions in the keyspace.
This patch introduces GRANTs on the missing permissions.
2023-05-12 10:56:29 +02:00
Wojciech Mitros
dd20621d71 cql: pass a reference to query processor in grant_permissions_to_creator
In the following patch, the grant_permissions_to_creator method is going
to be also used to grant permissions on a newly created function. The
function resource may contain user-defined types which need the
query processor to be prepared, so we add a reference to it in advance
in this patch for easier review.
2023-05-12 10:56:29 +02:00
Wojciech Mitros
f4d2cd15e9 test_permissions: make tests pass on cassandra
Despite the cql-pytests being intended to pass on both Scylla and
Cassandra, the test_permissions.py case was actually failing on
Cassandra in a few cases. The most common issue was a different
exception type returned by Scylla and Cassandra for an invalid
query. This was fixed by accepting 2 types of exceptions when
necessary.
The second issue was java UDF code that did not compile, which was
fixed simply by debugging the code.
The last issue was a case that was scylla_only with no good reason.
The missing java UDFs were added to that case, and the test was
adjusted so that the ALTER permission was only checked in a
CREATE OR REPLACE statement only if the UDF was already existing -
- Scylla requires it in both cases, which will get resolved in the
next patch.
2023-05-12 10:50:12 +02:00
Kefu Chai
e89e0d4b28 test: sstable: use generator to generate generations
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.

this change uses generator to generate generations for
`make_sstable_for_all_shards()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-12 13:22:32 +08:00
Kefu Chai
e3d6dd46b7 test: sstable: pass generation_type in helper functions
always avoid using generation_type if possible. this helps us to
hide the underlying type of generation identifier, which could also
be a UUID in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-12 13:22:32 +08:00
Kefu Chai
e788bfbb43 test: sstable: use generator to generate generations
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.

this change uses generator to create generations for
`make_sstable_for_this_shard()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-12 13:22:30 +08:00
Pavel Emelyanov
613acba5d0 s3: Pick client from manager via handle
Add the global-factory onto the client that is

- cross-shard copyable
- generates a client from local storage_manager by given endpoint

With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
8ed9716f59 s3: Generalize s3 file handle
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.

This patch prepares the fix by abstracting the client handle part.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
63ff6744d8 s3: Live-update clients' configs
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.

The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.

Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
e6760482b2 sstables: Keep clients shared across sstables
Nowadays each sstable gets its own instance of an s3::client. This patch
keeps clients on storage_manager's endpoints map and when creating a
storage for an sstable -- grab the shared pointer from the map, thus
making one client serve all sstables over there (except for those that
duplicated their files with the help of foreign-info, but that's to be
handled by next patches).

Moving the ownership of a client to the storage_manager level also means
that the client has to be closed on manager's stop, not on sstable
destroy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
743f26040f storage_manager: Rewrap config map
Now the map is endpoint -> config_ptr. Wrap the config_ptr into an
s3_endpoint struct. Next patch will keep the client on this new wrapper
struct thus making them shared between sstables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:01 +03:00
Pavel Emelyanov
a59096aa70 sstables, database: Move object storage config maintenance onto storage_manager
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).

Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:39:00 +03:00
Pavel Emelyanov
2153751d45 sstables: Introduce sharded<storage_manager>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 19:36:01 +03:00
Pavel Emelyanov
d7af178f20 sstables: Restore indentation after previous patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 18:40:24 +03:00
Pavel Emelyanov
54e892caf1 sstables: Coroutinuze read_toc() outer part
It just needs to catch the system_error of ENOENT and re-throw it as
malformed_sstable_exception.

Indentatil is deliberately left broken. Again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 18:40:14 +03:00
Pavel Emelyanov
1eb3ae2256 sstables: Coroutinuze read_toc() inner part
One non-trivial change is the removal of buf temporary variable. That's
because it existed under the same name in the .then() lambda generating
name conflict after coroutinization.

Other than that it's pretty straightforward.

Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-11 18:40:07 +03:00
Gleb Natapov
7caf1d26fb migration manager: Make schema pull abortable.
Now which schema pull may issues raft read barrier it may stuck if
majority is not available. Make the operation abortable and abort it
during queries if timeout is reached.
2023-05-11 16:31:23 +03:00
Gleb Natapov
091ec285fe serialized_action: make serialized_action abortable
Add an ability to abort waiting for a result of a specific trigger()
invocation.
2023-05-11 16:31:23 +03:00
Asias He
7fcc403122 tombstone_gc: Fix gc_before for immediate mode
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.

Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.

The following procedure reproduces the issue:

- Start 2 nodes

- Insert data

```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```

- Run nodetool flush and nodetool compact

- Compaction drops all data

```
~128 total partitions merged to 0.
```

Fixes #13572

Closes #13800
2023-05-11 15:10:00 +03:00
Botond Dénes
3d75158fda Merge 'Allow no owned token ranges in cleanup compaction' from Benny Halevy
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.

In this case we should go ahead with cleanup that will
effectively delete all data.

Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
or, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.

Fixes #13634

Also, add a respective rest_api unit test

Closes #13849

* github.com:scylladb/scylladb:
  test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges
  compaction_manager: perform_cleanup: handle empty owned ranges
2023-05-11 15:05:06 +03:00
Gleb Natapov
70189b60de migration manager: if raft is enables sync with group0 leader before pulling a schema which is not available locally
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This patch changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.

Refs: #3760
Fixes: #13211
2023-05-11 13:28:54 +03:00
Gleb Natapov
d4417442e9 service: raft_group0_client: add using_raft function
Make it easy to check if raft is enabled.
2023-05-11 13:27:58 +03:00
Anna Stuchlik
7f7ab3ae3e doc: fix the broken Glossary link
Fixes https://github.com/scylladb/scylladb/issues/13805

This commit fixes the redirection required by moving the Glossary
page from the top of the page tree to the Reference section.

As the change was only merged to master (not to branch-5.2),
it is not working for version 5.2, which is now the latest stable
version.
For this reason, "stable" in the path must be replaced with "master".

Closes #13847
2023-05-11 10:30:59 +03:00
Botond Dénes
24cb351655 Merge 'test: sstable_*test: avoid using helper using generation_type::int_t ' from Kefu Chai
the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation.

Closes #13845

* github.com:scylladb/scylladb:
  test: drop unused helper functions
  test: sstable_mutation_test: avoid using helper using generation_type::int_t
  test: sstable_move_test: avoid using helper using generation_type::int_t
  test: sstable_*test: avoid using helper using generation_type::int_t
  test: sstable_3_x_test: do not use reuseable_sst() accepting integer
2023-05-11 10:17:02 +03:00
Benny Halevy
2fc142279f compaction_manager: perform_cleanup: hold on to sstable_set around yielding
Updates to the compaction_group sstable sets are
never done in place.  Instead, the update is done
on a mutable copy of the sstable set, and the lw_shared
result is set back in the compaction_group.
(see for example compaction_group::set_main_sstables)

Therefore, there's currently a risk in perform_cleanup
`get_sstables` lambda that if it yield while in
set.for_each_sstable, the sstable_set might be replaced
and the copy it is traversing may be destroyed.
This was introduced in c2bf0e0b72.

To prevent that, hold on to set.shared_from_this()
around set.for_each_sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13852
2023-05-11 09:46:53 +03:00
Benny Halevy
0b91bfbcc5 test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges
Test cleanup on a keyspace after altering
it replication factor to 0.
Expect no sstables to remain.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-11 08:16:31 +03:00
Kefu Chai
29284d64a5 test: drop unused helper functions
all users of these two helpers have switched to their alternatives,
so there is no need to keep them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-11 12:32:37 +08:00
Kefu Chai
b036d2b50c test: sstable_mutation_test: avoid using helper using generation_type::int_t
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer. also, since
`generate_clustered()` is only used by functions in the same
compilation unit, let's take the opportunity to mark it `static`.
and there is no need to pass generation as a template parameter,
we just pass it as a regular parameter.

we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-11 12:32:22 +08:00
Kefu Chai
689e1e99d6 test: sstable_move_test: avoid using helper using generation_type::int_t
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in helper functions, we just use
`generation_type`. please note, despite that we'd prefer generating
the generations using generator, the SSTables used by the tests
modified by this change are stored in the repo, to ensure that the
tests are always able to find the SSTable files, we keep them
unchanged instead of using generation_generator, or a random
generation for the testing.

we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-11 12:32:22 +08:00
Kefu Chai
bfd6caffbb test: sstable_*test: avoid using helper using generation_type::int_t
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type by offering a default paramter, which is a
generation created using 1. this preserves the existing behavior.

we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-11 12:32:22 +08:00
Kefu Chai
ab8efbf1ab test: sstable_3_x_test: do not use reuseable_sst() accepting integer
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type.

also, as no callers are using the last parameter of `make_test_sstable()`,
let's drop it .

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-11 12:32:21 +08:00
Nadav Har'El
f1cad230bb Merge 'cql: enable setting permissions on resources with quoted UDT names' from Wojciech Mitros
This series fixes an issue with altering permissions on UDFs with
parameter types that are UDTs with quoted names and adds
a test for it.

The issue was caused by the format of the temporary string
that represented the UDT in `auth::resource`. After parsing the
user input to a raw type, we created a string representing the
UDT using `ut_name::to_string()`. The segment of the resulting
string that represented the name of the UDT was not quoted,
making us unable to parse it again when the UDT was being
`prepare`d. Other than for this purpose, the `ut_name::to_string()`
is used only for logging, so the solution was modifying it to
maybe quote the UDT name.

Ref: https://github.com/scylladb/scylladb/pull/12869

Closes #13257

* github.com:scylladb/scylladb:
  cql-pytest: test permissions for UDTs with quoted names
  cql: maybe quote user type name in ut_name::to_string()
  cql: add a check for currently used stack in parser
  cql-pytest: add an optional name parameter to new_type()
2023-05-10 19:10:29 +03:00
Wojciech Mitros
1f45c7364c cql: check permissions for used functions when creating a UDA
Currently, when creating a UDA, we only check for permissions
for creating functions. However, the creator gains all permissions
to the UDA, including the EXECUTE permission. This enables the
user to also execute the state/reduce/final functions that were
used in the UDA, even if they don't have the EXECUTE permissions
on them.

This patch adds checks for the missing EXECUTE permissions, so
that the UDA can be only created if the user has all required
permissions.

The new permissions that are now required when creating a UDA
are now granted in the existing UDA test.

Fixes #13818

Closes #13819
2023-05-10 18:06:04 +03:00
Wojciech Mitros
a86b9fa0bb auth: fix formatting of function resource with no arguments
Currently, when a function has no arguments, the function_args()
method, which is supposed to return a vector of string_views
representing the arguments of the function, returns a nullopt
instead, as if it was a functions_resource on all functions
or all functions in a keyspace. As a result, the functions_resource
can't be properly formatted.
This is fixed in this patch by returning an empty vector instead,
and the fix is confirmed in a cql-pytest.

Fixes #13842

Closes #13844
2023-05-10 17:07:33 +03:00
Benny Halevy
3771d48488 sstables: mx: validate: close consumer context
data_consume_rows keeps an input_stream member that must be closed.
In particular, on the error path, when we destroy it possibly
with readaheads in flight.

Fixes #13836

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13840
2023-05-10 17:05:43 +03:00
Benny Halevy
c720754e37 compaction_manager: perform_cleanup: handle empty owned ranges
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.

In this case we should go ahead with cleanup that will
effectively delete all data.

Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
ot, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.

Fixes #13634

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-10 15:11:53 +03:00
Avi Kivity
171a6cbbaa cql3: untyped_result_set: document performance characteristics
untyped_result_set is optimized towards conventience and safety,
so note that.

Closes #13661
2023-05-10 15:03:12 +03:00
Nadav Har'El
e57252092c Merge 'cql3: result_set, selector: change value type to managed_bytes_opt' from Avi Kivity
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.

This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance

To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.

The series is neutral wrt performance:

before:
```
222118.61 tps ( 61.1 allocs/op,  12.1 tasks/op,   43092 insns/op,        0 errors)
224250.14 tps ( 61.1 allocs/op,  12.1 tasks/op,   43094 insns/op,        0 errors)
224115.66 tps ( 61.1 allocs/op,  12.1 tasks/op,   43092 insns/op,        0 errors)
223508.70 tps ( 61.1 allocs/op,  12.1 tasks/op,   43107 insns/op,        0 errors)
223498.04 tps ( 61.1 allocs/op,  12.1 tasks/op,   43087 insns/op,        0 errors)
```

after:
```
220708.37 tps ( 61.1 allocs/op,  12.1 tasks/op,   43118 insns/op,        0 errors)
225168.99 tps ( 61.1 allocs/op,  12.1 tasks/op,   43081 insns/op,        0 errors)
222406.00 tps ( 61.1 allocs/op,  12.1 tasks/op,   43088 insns/op,        0 errors)
224608.27 tps ( 61.1 allocs/op,  12.1 tasks/op,   43102 insns/op,        0 errors)
225458.32 tps ( 61.1 allocs/op,  12.1 tasks/op,   43098 insns/op,        0 errors)
```

Though I expect with some more effort we can eliminate some copies.

Closes #13637

* github.com:scylladb/scylladb:
  cql3: untyped_result_set: switch to managed_bytes_view as the cell type
  cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
  cql3: untyped_result_set: always own data
  types: abstract_type: add mixed-type versions of compare() and equal()
  utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
  utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
  utils: managed_bytes: add managed_bytes_view::with_linearized()
  utils: managed_bytes: mark managed_bytes_view::is_linearized() const
2023-05-10 15:01:45 +03:00
Wojciech Mitros
9ae1b02144 service: revoke permissions on functions when a function/keyspace is dropped
Currently, when a user has permissions on a function/all functions in
keyspace, and the function/keyspace is dropped, the user keeps the
permissions. As a result, when a new function/keyspace is created
with the same name (and signature), they will be able to use it even
if no permissions on it are granted to them.

Simliarly to regular UDFs, the same applies to UDAs.

After this patch, the corresponding permissions on functions are dropped
when a function/keyspace is dropped.

Fixes #13820

Closes #13823
2023-05-10 14:39:42 +03:00
Botond Dénes
bb62038119 Merge 'Scrub compaction task' from Aleksandra Martyniuk
Task manager's tasks covering scrub compaction on top,
shard and table level.

For this levels we have common scrub tasks for each scrub
mode since they share code. Scrub modes will be differentiated
on compaction group level.

Closes #13694

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test scrub compaction
  compaction: add table_scrub_sstables_compaction_task_impl
  compaction: add shard_scrub_sstables_compaction_task_impl
  compaction: add scrub_sstables_compaction_task_impl
  api: get rid of unnecessary std::optional in scrub
  compaction: rename rewrite_sstables_compaction_task_impl
2023-05-10 14:18:20 +03:00
Anna Stuchlik
4898a20ae9 doc: add troubleshooting for failed schema sync
Fixes https://github.com/scylladb/scylladb/issues/12133

This commit adds a Troubleshooting article to support
users when schema sync failed on their cluster.

Closes #13709
2023-05-10 14:01:36 +03:00
Avi Kivity
1a3545b13d Merge 'data_dictionary: define helpers in options and define == operator only' from Kefu Chai
in this series, `data_dictionary::storage_options` is refactored so that each dedicated storage option takes care of itself, instead of putting all the logic into `storage_options`. cleaner this way. as the next step, i will add yet another set of options for the tiered_storage which is backed by the s3_storage and the local filesystem_storage. with this change, we will be able to group the per-option functionalities together by the option thy are designed for, instead of sharding them by the actual function.

Closes #13826

* github.com:scylladb/scylladb:
  data_dictionary: define helpers in options
  data_dictionary: only define operator== for storage options
2023-05-10 12:59:57 +03:00
Avi Kivity
e252dbcfb8 Merge ' readers,mutation: move mutation_fragment_stream_validator to mutation/' from Botond Dénes
The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc.
This PR fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now).

Closes #13831

* github.com:scylladb/scylladb:
  mutation/mutation_fragment_stream_validator.cc: rename logger
  readers,mutation: move mutation_fragment_stream_validator to mutation/
2023-05-10 12:54:53 +03:00
Botond Dénes
6bea0c04cf message: match unknown tenants to the default tenant
On connection setup, the isolation cookie of the connection is matched
to the appropriate scheduling group. This is achieved by iterating over
the known statement tenant connection types as well as the system
connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and
the remote node has a scheduling group the local one doesn't have. To
avoid demoting a scheduling group of unknown importance, in this case the
default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise
version, as the scheduling groups of the enterprise service-levels will
match none of the statement tenants and will hence fall-back to the
default scheduling group. As a consequence, while the cluster is mixed,
user workload on old (OSS) nodes, will be executed under the system
scheduling group and concurrency semaphore. Not only does this mean that
user workloads are directly competing for resources with system ones,
but the two workloads are now sharing the semaphore too, reducing the
available throughput. This usually manifests in queries timing out on
the old (OSS) nodes in the cluster.

This patch proposes to fix this, by recognizing that the unknown
scheduling group is in fact a tenant this node doesn't know yet, and
matching it with the default statement tenant.
With this, order should be restored, with service-level connections
being recognized as user connections and being executed in the statement
scheduling group and the statement (user) concurrency semaphore.
2023-05-10 05:09:34 -04:00
Botond Dénes
8663b27f25 message: generalize per-tenant connection types
We have a set amount of connection types for each tenant. The amount of
these connection types can change. Although currently these are
hardcoded in a single place, soon (in the next patch) there will be yet
another place where these will be used. To avoid duplicating these
names, making future changes error prone, centralize them in a const
array, generalizing the concept of a tenant connection type.
2023-05-10 04:28:57 -04:00
Kamil Braun
7d9ab44e81 Merge 'token_metadata: read remapping for write_both_read_new' from Gusev Petr
When new nodes are added or existing nodes are deleted, the topology
state machine needs to shunt reads from the old nodes to the new ones.
This happens in the `write_both_read_new` state. The problem is that
previously this state was not handled in any way in `token_metadata` and
the read nodes were only changed when the topology state machine reached
the final 'owned' state.

To handle `write_both_read_new` an additional `interval_map` inside
`token_metadata` is maintained similar to `pending_endpoints`.  It maps
the ranges affected by the ongoing topology change operation to replicas
which should be used for reading. When topology state sm reaches the
point when it needs to switch reads to a new topology, it passes
`request_read_new=true` in a call to `update_pending_ranges`. This
forces `update_pending_ranges` to compute the ranges based on new
topology and store them to the `interval_map`. On the data plane, when a
read on coordinator needs to decide which endpoints to use, it first
consults this `interval_map` in `token_metadata`, and only if it doesn't
contain a range for current token it uses normal endpoints from
`effective_replication_map`.

Closes #13376

* github.com:scylladb/scylladb:
  storage_proxy, storage_service: use new read endpoints
  storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
  token_metadata: add unit test for endpoints_for_reading
  token_metadata: add endpoints for reading
  sequenced_set: add extract_set method
  token_metadata_impl: extract maybe_migration_endpoints helper function
  token_metadata_impl: introduce migration_info
  token_metadata_impl: refactor update_pending_ranges
  token_metadata: add unit tests
  token_metadata: fix indentation
  token_metadata_impl: return unique_ptr from clone functions
2023-05-10 10:03:30 +02:00
Avi Kivity
550aa01242 Merge 'Restore raft::internal::tagged_uint64 type' from Benny Halevy
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.

However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.

This change restores the use of raft::internal::tagged_uint64
for the raft types and adds back an idl definition for
it that is not marked as final, similar to the way
raft::internal::tagged_id extends utils::tagged_uuid.

Fixes #13752

Closes #13774

* github.com:scylladb/scylladb:
  raft, idl: restore internal::tagged_uint64 type
  raft: define term_t as a tagged uint64_t
  idl: gossip_digest: include required headers
2023-05-09 22:51:25 +03:00
Kefu Chai
d8cd62b91a compaction/compaction: initialize local variable
the initial `validation_errors` should be zero. so let's initialize it
instead of leaving it to uninitialized.

this should address following warning from Clang-16:

```
/usr/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=6 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -I/home/kefu/dev/scylladb/build/cmake/gen -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -march=westmere  -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT compaction/CMakeFiles/compaction.dir/compaction.cc.o -MF compaction/CMakeFiles/compaction.dir/compaction.cc.o.d -o compaction/CMakeFiles/compaction.dir/compaction.cc.o -c /home/kefu/dev/scylladb/compaction/compaction.cc
/home/kefu/dev/scylladb/compaction/compaction.cc:1681:9: error: variable 'validation_errors' is uninitialized when used here [-Werror,-Wuninitialized]
        validation_errors += co_await sst->validate(permit, descriptor.io_priority, cdata.abort, [&schema] (sstring what) {
        ^~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/compaction/compaction.cc:1676:31: note: initialize the variable 'validation_errors' to silence this warning
    uint64_t validation_errors;
                              ^
                               = 0
```

the change which introduced this local variable was 7ba5c9cc6a.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13813
2023-05-09 22:49:29 +03:00
Avi Kivity
8c6229d229 Merge 'sstable: encode value using UUID' from Kefu Chai
in this series, we encode the value of generation using UUID to prepare for the UUID generation identifier. simpler this way, as we don't need to have two ways to encode integer or a timeduuid: uuid with a zero timestamp, and a variant. also, add a `from_string()` factory method to convert string to generation to hide the the underlying type of value from generation_type's users.

Closes #13782

* github.com:scylladb/scylladb:
  sstable: use generation_type::from_string() to convert from string
  sstable: encode int using UUID in generation_type
2023-05-09 22:07:23 +03:00
Avi Kivity
996f717dfc Merge 'cql3/prepare_expr: force token() receiver name to be partition key token' from Jan Ciołek
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```

After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker.

In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`.

In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`.

It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things.

This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`.

Fixes: #13769

Closes #13815

* github.com:scylladb/scylladb:
  cql-pytest: test that bind marker is partition key token
  cql3/prepare_expr: force token() receiver name to be partition key token
2023-05-09 20:44:46 +03:00
Anna Stuchlik
c64109d8c7 doc: add driver support for Serverless
Fixes https://github.com/scylladb/scylladb/issues/13453

This is V2 of https://github.com/scylladb/scylladb/pull/13710/.

This commit adds:
- the information about which ScyllaDB drivers support ScyllaDB Cloud Serverless.
- language and organization improvements to the ScyllaDB CQL Drivers
  page.

Closes #13825
2023-05-09 20:43:22 +03:00
Kefu Chai
c872ade50f sstable: use generation_type::from_string() to convert from string
in this change,

* instead of using "\d+" to match the generation, use "[^-]",
* let generation_type to convert a string to generation

before this change, we casts the matched string in SSTable file name
to integer and then construct a generation identifier from the integer.
this solution has a strong assumption that the generation is represented
with an integer, we should not encode this assumption in sstable.cc,
instead we'd better let generation_type itself to take care of this. also,
to relax the restriction of regex for matching generation, let's
just use any characters except for the delimeter -- "-".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 22:57:39 +08:00
Kefu Chai
478c13d0d4 sstable: encode int using UUID in generation_type
since we already use UUID for encoding an bigint in SSTable registry
table, let's just use the same approach for encoding bigint in generation_type,
to be more consistent, and less repeatings this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 22:57:38 +08:00
Petr Gusev
08529a1c6c storage_proxy, storage_service: use new read endpoints
We use set_topology_transition_state to set read_new state
in storage_service::topology_state_load
based on _topology_state_machine._topology.tstate.
This triggers update_pending_ranges to compute and store new ranges
for read requests. We use this information in
storage_proxy::get_endpoints_for_reading
when we need to decide which nodes to use for reading.
2023-05-09 18:42:03 +04:00
Petr Gusev
052b91fb1f storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.

Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
2023-05-09 18:42:03 +04:00
Petr Gusev
15fe4d8d69 token_metadata: add unit test for endpoints_for_reading 2023-05-09 18:42:03 +04:00
Petr Gusev
0e4e2df657 token_metadata: add endpoints for reading
In this patch we add
token_metadata::set_topology_transition_state method.
If the current state is
write_both_read_new update_pending_ranges
will compute new ranges for read requests. The default value
of topology_transition_state is null, meaning no read
ranges are computed. We will add the appropriate
set_topology_transition_state calls later.

Also, we add endpoints_for_reading method to get
read endpoints based on the computed ranges.
2023-05-09 18:41:59 +04:00
Kefu Chai
d24687ea26 data_dictionary: define helpers in options
instead of dispatching and implementing the per-option handling
right in `storage_option`, define these helpers in the dedicated
option themselves, so `storage_option` is only responsible for
dispatching.

much cleaner this way. this change also makes it easier to add yet
another storage backend.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 21:51:52 +08:00
Kefu Chai
152d0224dc data_dictionary: only define operator== for storage options
as the only user of these comparison operators is
`storage_options::can_update_to()`, which just check if the given
`storage_options` is equal to the stored one. so no need to define
the <=> operator.

also, no need to add the `friend` specifier, as the options are plain
struct, all the member variables are public.

make the comparison operator a member function instead of a free
function, as in C++20 comparision operators are symmetric.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 21:51:45 +08:00
Botond Dénes
ef7b7223d5 mutation/mutation_fragment_stream_validator.cc: rename logger
This code inherited its logger variable name from mutation reader,
rename it to better match its new context.
2023-05-09 07:55:13 -04:00
Botond Dénes
8681f3e997 readers,mutation: move mutation_fragment_stream_validator to mutation/
The validator classes have their definition in a header located in mutation/,
while their implementation is located in a .cc in readers/mutation_reader.cc.
This patch fixes this inconsistency by moving the implementation into
mutation/mutation_fragment_stream_validator.cc. The only change is that
the validator code gets a new logger instance (but the logger variable itself
is left unchanged for now).
2023-05-09 07:55:13 -04:00
Botond Dénes
287ccce1cc Merge 'sstables: extract storage out ' from Kefu Chai
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:

- for better readability. the sstables.hh is over 1005 lines.
  and sstables.cc 3602 lines. it's a little bit difficult to figure
  out how the different parts in these sources interact with each
  other. for instance, with this change, it's clear some of helper
  functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
  files out, they can be compiled individually, so changing one .cc
  file does not impact others. this could speed up the compilation
  time.

Closes #13785

* github.com:scylladb/scylladb:
  sstables: storage: coroutinize idempotent_link_file()
  sstables: extract storage out
2023-05-09 14:03:40 +03:00
Jan Ciolek
9ad1c5d9f2 cql-pytest: test that bind marker is partition key token
When preparing a query each bind marker gets a name.
For a query like:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
The bind marker's name should be `"partition key token"`.
Java driver relies on this name, having something else,
like `"token(p1, p2)"` be the name breaks the Java driver.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-09 12:33:06 +02:00
Jan Ciolek
8a256f63db cql3/prepare_expr: force token() receiver name to be partition key token
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```

After calling `prepare` the drivers receives some information
about the prepared statment, including names of values bound
to each bind marker.

In case of a partition token restriction (`token(p1, p2) = ?`)
there's an expectation that the name assigned to this bind marker
will be `"partition key token"`.

In a recent change the code handling `token()` expressions has been
unified with the code that handles generic function calls,
and as a result the name has changed to `token(p1, p2)`.

It turns out that the Java driver relies on the name being
`"partition key token"`, so a change to `token(p1, p2)`
broke some things.

This patch sets the name back to `"partition key token"`.
To achieve this we detect any restrictions that match
the pattern `token(p1, p2, p3) = X` and set the receiver
name for X to `"partition key token"`.

Fixes: #13769

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-05-09 12:32:57 +02:00
Petr Gusev
b2e5d8c21c sequenced_set: add extract_set method
Can be useful if we want to reuse the set
when we are done with this sequenced_set
instance.
2023-05-09 13:56:38 +04:00
Petr Gusev
0567ab82ac token_metadata_impl: extract maybe_migration_endpoints helper function
We are going to add a function in token_metadata to get read endpoints,
similar to pending_endpoints_for. So in this commit we extract
the maybe_migration_endpoints helper function, which will be
used in both cases.
2023-05-09 13:56:38 +04:00
Petr Gusev
030f0f73aa token_metadata_impl: introduce migration_info
We are going to store read_endpoints in a way similar
to pending ranges, so in this commit we add
migration_info - a container for two
boost::icl::interval_map.

Also, _pending_ranges_interval_map is renamed to
_keyspace_to_migration_info, since it captures
the meaning better.
2023-05-09 13:56:38 +04:00
Petr Gusev
56c2b3e893 token_metadata_impl: refactor update_pending_ranges
Now update_pending_ranges is quite complex, mainly
because it tries to act efficiently and update only
the affected intervals. However, it uses the function
abstract_replication_strategy::get_ranges, which calls
calculate_natural_endpoints for every token
in the ring anyway.

Our goal is to start reading from the new replicas for
ranges in write_both_read_new state. In the current
code structure this is quite difficult to do, so
in this commit we first simplify update_pending_ranges.

The main idea of the refactoring is to build a new version
of token_metadata based on all planned changes
(join, bootstrap, replace) and then for each token
range compare the result of calculate_natural_endpoints on
the old token_metadata and on the new one.
Those endpoints that are in the new version and
are not in the old version should be added to the pending_ranges.

The add_mapping function is extracted for the
future - we are going to use it to handle read mappings.

Special care is taken when replacing with the same IP.
The coordinator employs the
get_natural_endpoints_without_node_being_replaced function,
which excludes such endpoints from its result. If we compare
the new (merged) and current token_metadata configurations, such
endpoints will also be absent from pending_endpoints since
they exist in both. To address this, we copy the current
token_metadata and remove these endpoints prior to comparison.
This ensures that nodes being replaced are treated
like those being deleted.
2023-05-09 13:56:28 +04:00
Petr Gusev
3120cabf56 token_metadata: add unit tests
We are going to refactor update_pending_ranges,
so in this commit we add some simple unit tests
to ensure we don't break it.
2023-05-09 13:56:06 +04:00
Benny Halevy
adfb79ba3e raft, idl: restore internal::tagged_uint64 type
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.

However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.

This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8

Fixes #13752

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 12:38:20 +03:00
Kamil Braun
41cac23aa4 Merge 'raft: verify RPC destination ID' from Mikołaj Grzebieluch
All Raft verbs include `dst_id`, the ID of the destination server, but
it isn't checked. `append_entries` will work even if it arrives at
completely the wrong server (but in the same group). It can cause
problems, e.g. in the scenario of replacing a dead node.

This commit adds verifying if `dst_id` matches the server's ID and if it
doesn't, the Raft verb is rejected.

Closes #12179

Testing
---

Testcase and scylla's configuration:
57d3ef14d8

It artificially lengthens the duration of replacing the old node. It
increases the chance of getting the RPC command sent to a replaced node,
by the new node.

In the logs of the node that replaced the old one, we can see logs in
the form:
```
DEBUG <time> [shard 0] raft_group_registry - Got message for server <dst_id>, but my id is <my_id>
```
It indicates that the Raft verb with the wrong `dst_id` was rejected.

This test isn't included in the PR because it doesn't catch any specific error.

Closes #13575

* github.com:scylladb/scylladb:
  service/raft: raft_group_registry: Add verification of destination ID
  service/raft: raft_group_registry: `handle_raft_rpc` refactor
2023-05-09 11:33:28 +02:00
Aleksandra Martyniuk
f199ec5ec3 test: extend test_compaction_task.py to test scrub compaction 2023-05-09 11:15:26 +02:00
Aleksandra Martyniuk
83d3463d10 compaction: add table_scrub_sstables_compaction_task_impl
Implementation of task_manager's task covering scrub sstables
compaction of one table.
2023-05-09 11:15:25 +02:00
Aleksandra Martyniuk
d8e4a2fee3 compaction: add shard_scrub_sstables_compaction_task_impl
Implementation of task_manager's task covering scrub sstables
compaction on one shard.
2023-05-09 11:14:36 +02:00
Aleksandra Martyniuk
8d32579fe6 compaction: add scrub_sstables_compaction_task_impl
Implementation of task_manager's task covering scrub sstables
compaction.
2023-05-09 11:13:57 +02:00
Kefu Chai
a69282e69b sstables: storage: coroutinize idempotent_link_file()
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 16:47:00 +08:00
Kefu Chai
2eefcb37eb sstables: extract storage out
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:

- for better readability. the sstables.hh is over 1005 lines.
  and sstables.cc 3602 lines. it's a little bit difficult to figure
  out how the different parts in these sources interact with each
  other. for instance, with this change, it's clear some of helper
  functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
  files out, they can be compiled individually, so changing one .cc
  file does not impact others. this could speed up the compilation
  time.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-09 16:47:00 +08:00
Aleksandra Martyniuk
79c39e4ea7 api: get rid of unnecessary std::optional in scrub
In scrub lambdas returning std::optional<compaction_stats> cannot
return empty value. Hence, std::optional wrapper isn't needed.
2023-05-09 10:31:44 +02:00
Aleksandra Martyniuk
40809c887e compaction: rename rewrite_sstables_compaction_task_impl
Rename rewrite_sstables_compaction_task_impl to
sstables_compaction_task_impl as a new name describes the class
of tasks better. Rewriting sstables is a slightly more fine-grained
type of sstable compaction task then the one needed here.
2023-05-09 10:31:44 +02:00
Botond Dénes
20f620feb9 Merge 'replica, sstable: replace generation_type::value() with generation_type::as_int()' from Kefu Chai
this series prepares for the UUID based generation by replacing the general `value()` function with the function with more specific name: `as_int()`.

Closes #13796

* github.com:scylladb/scylladb:
  test: drop a reusable_sst() variant which accepts int as generation
  treewide: replace generation_type::value() with generation_type::as_int()
2023-05-09 07:30:54 +03:00
Benny Halevy
531ac63a8d raft: define term_t as a tagged uint64_t
It was defined as a tagged (signed) int64_t by mistake
in f5f566bdd8.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 06:51:26 +03:00
Benny Halevy
d3a59fdefd idl: gossip_digest: include required headers
To be self-sufficient, before the next patch
that will affect tagged_integer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-09 06:51:26 +03:00
Michał Chojnowski
0813fa1da0 database: fix reads_memory_consumption for system semaphore
The metric shows the opposite of what its name suggests.
It shows available memory rather than consumed memory.
Fix that.

Fixes #13810

Closes #13811
2023-05-09 06:42:43 +03:00
Kamil Braun
ddb5b45aef docs: update topology-over-raft.md
It was already outdated before this PR.
Describe the version of topology state machine implemented in this PR.
Fix some typos and make it proper markdown so it renders nicely on
GitHub etc.
2023-05-08 16:49:01 +02:00
Kamil Braun
f581282625 test: topology_experimental_raft: test check_and_repair_cdc API 2023-05-08 16:49:01 +02:00
Kamil Braun
372a06f735 raft topology: implement check_and_repair_cdc_streams API
The original API is gossiper-based. Since we're moving CDC generations
handling to Raft-based topology, we need to implement this API as well.

For now the API creates a new generation unconditionally, in a follow-up
I'll introduce a check to skip the creation if the current generation is
optimal.
2023-05-08 16:49:01 +02:00
Kamil Braun
a09ed01ffa raft topology: implement global request handling
The only possible request for now is creating a new CDC generation.
2023-05-08 16:49:01 +02:00
Kamil Braun
2bd333dd84 raft topology: introduce prepare_new_cdc_generation_data
Refactor the code, taking a bulk of the CDC-specific code used when
there's a bootstrap request to a separate function. We'll use it
elsewhere as well.
2023-05-08 16:48:59 +02:00
Kamil Braun
2863ef3df4 raft_topology: get_node_to_work_on_opt: return guard if no node found
We'll need the guard back.
2023-05-08 16:48:29 +02:00
Kamil Braun
afcf17f168 raft topology: remove node_to_work_on from commit_cdc_generation transition
We don't need it for anything in this state, and this change allows us
to commit CDC generations without transitioning nodes.
2023-05-08 16:47:37 +02:00
Kamil Braun
1b21a3c5ae raft topology: separate publish_cdc_generation state
Previously the generation committed in `commit_cdc_generation` state
would be published by the coordinator in `write_both_read_old` state.
This logic assumed that we only create new CDC generations during node
bootstrap.

We'll allow committing new generations without bootstrap (without any
node transitions in fact), so we need this separate state.

After publishing the generation, we check whether there is a
transitioning node; if so, we'll enter `write_both_read_old` as next
state, otherwise we'll make the topology non-transitioning.
2023-05-08 16:47:24 +02:00
Kamil Braun
6d5b8c1b7c raft topology: non-node-specific exec_global_command
This function broadcasts a command to cluster members. It takes a
`node_to_work_on`. We'll need a version which works in situations where
there is no 'node to work on'.
2023-05-08 16:47:13 +02:00
Kamil Braun
8b5237a058 raft topology: introduce start_operation()
This calls `raft_group0_client::start_operation` and checks if the term
is different from the term that the coordinator was initially created
with; if so, we must no longer continue coordinating the topology.

There was one direct call to `raft_group0_client::start_operation`
without a term check, replace it with the introduced function.
2023-05-08 16:47:13 +02:00
Kamil Braun
90770f712c raft topology: non-node-specific topology_mutation_builder
The existing `topology_mutation_builder` took a `raft::server_id` in its
constructor and immediately created a clustering row in the
`system.topology` mutation that it was building for the given node.
This does not allow building mutations which only affect the static
columns.

Split the class into two:
- `topology_mutation_builder` doesn't take `raft::server_id` in its
  constructor and contains only the methods that are used to set static
  columns. It also has a `with_node` method taking a `raft::server_id`
  which returns a `topology_node_mutation_builder&`.
- `topology_node_mutation_builder` creates the clustering row and allows
  seting its columns.

We'll use `topology_mutation_builder` when we only want to transition
the cluster-global topology state, without affecting any specific nodes'
states.
2023-05-08 16:47:11 +02:00
Kamil Braun
acfb6bf3ed topology_state_machine: introduce global_topology_request
`topology` currently contains the `requests` map, which is suitable for
node-specific requests such as "this node wants to join" or "this node
must be removed". But for requests for operations that affect the
cluster as a whole, a separate request type and field is more
appropriate. Introduce one.

The enum currently contains the option `new_cdc_generation` for requests
to create a new CDC generation in the cluster. We will implement the
whole procedure in later commits.
2023-05-08 16:46:14 +02:00
Kamil Braun
7c5056492e topology_state_machine: use uint16_t for enum_classes
16 bits ought to be enough for everyone.
2023-05-08 16:46:14 +02:00
Kamil Braun
93dcdcd4eb raft topology: make new_cdc_generation_data_uuid topology-global
- make it a static column in `system.topology`
- move it from node-specific `ring_slice` to cluster-global `topology`

We will use it in scenarios where no node is transitioning.

Also make it `std::optional` in topology for consistency with other
fields (previously, the 'no value' state for this field was represented
using default-constructed `utils::UUID`).
2023-05-08 16:46:14 +02:00
Nadav Har'El
5f37d43ee6 Merge 'compaction: validate: validate the index too' from Botond Dénes
In addition to the data file itself. Currently validation avoids the
index altogether, using the crawling reader which only relies on the
data file and ignores the index+summary. This is because a corrupt
sstable usually has a corrupt index too and using both at the same time
might hide the corruption. This patch adds targeted validation of the
index, independent of and in addition to the already existing data
validation: it validates the order of index entries as well as whether
the entry points to a complete partition in the data file.
This will usually result in duplicate errors for out-of-order
partitions: one for the data file and one for the index file.

Fixes: #9611

Closes #11405

* github.com:scylladb/scylladb:
  test/cql-pytest: add test_sstable_validation.py
  test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
  tools/scylla-sstables: write validation result to stdout
  sstables/sstable: validate(): delegate to mx validator for mx sstables
  sstables/mx/reader: add mx specific validator
  mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter
  sstables/mx/reader: template data_consume_rows_context_m on the consumer
  sstables/mx/reader: move row_processing_result to namespace scope
  sstables/mx/reader: use data_consumer::proceed directly
  sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)
  compaction/compaction: remove now unused scrub_validate_mode_validate_reader()
  compaction/compaction: move away from scrub_validate_mode_validate_reader()
  tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
  test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
  sstables/sstable: add validate() method
  compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
  compaction: scrub: use error messages from validator
  mutation_fragment_stream_validator: produce error messages in low-level validator
2023-05-08 17:14:26 +03:00
Botond Dénes
b790f14456 reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.

Fixes: #13540

Closes #13541
2023-05-08 17:11:41 +03:00
Takuya ASADA
fdceda20cc scylla_raid_setup: wipe filesystem signatures from specified disks
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.

Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).

Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.

Fixes #13737

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #13738
2023-05-08 16:53:43 +03:00
Anna Stuchlik
98e1d7a692 doc: add the Elixir driver to the docs
This commit adds the link to the Exlixir driver
to the list of the third-party drivers.
The driver actively supports ScyllaDB.

This is v2 of https://github.com/scylladb/scylladb/pull/13701

Closes #13806
2023-05-08 15:36:35 +03:00
Botond Dénes
a6ee3b25a7 test/boost/database_test: add unit test for semaphore mismatch on range scans
Check that:
* Mismatch is detected;
* Mismatch is handled without crash (e.g. due to unclosed readers);
2023-05-08 07:35:39 -04:00
Botond Dénes
71bae0c549 partition_slice_builder: add set_specific_ranges()
The builder only has a method to mutate existing specific ranges. This
patch adds one to set or overwrite it.
2023-05-08 07:35:39 -04:00
Botond Dénes
4ba7810f60 multishard_mutation_query: make reader_context::lookup_readers() exception safe
With regards to closing the looked-up querier if an exception is thrown.
In particular, this requires closing the querier if a semaphore mismatch
is detected. Move the table lookup above the line where the querier is
looked up, to avoid having to handle the exception from it.
2023-05-08 07:35:39 -04:00
Botond Dénes
227b0d3f08 multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
Needed by the next patch. Sad, but it runs once/shard/page, so it
shouldn't be noticable.
2023-05-08 07:35:33 -04:00
Kamil Braun
153cb00e9d test: test_random_tables: wait for token ring convergence before data queries
The test performs an `INSERT` followed by a `SELECT`, checking if the
previously inserted data is returned.

This may fail because we're using `ring_delay = 0` in tests and the two
queries may arrive at different nodes, whose `token_metadata` didn't
converge yet (it's eventually consistent based on gossiping).
I illustrated this here:
https://github.com/scylladb/scylladb/issues/12937#issuecomment-1536147455

Ensure that the nodes' token rings are synchronized (by waiting until
the token ring members on each node is the same as group 0
configuration).

Fixes #12937

Closes #13791
2023-05-08 13:22:52 +02:00
Kamil Braun
3f3dcf451b test: pylib: random_tables: perform read barrier in verify_schema
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.

However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.

To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.

Fixes #13788

Closes #13789
2023-05-08 13:21:10 +02:00
Avi Kivity
198738f2b1 Merge 'build: compile wasm udfs automatically' from Wojciech Mitros
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the source codes, and build Wasm when
necessary. This series adds build commands that
compile C/Rust sources to Wasm and uses them for Wasm
programs that we're already using.

After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the `wasms` and `wasm_deps` lists in
`configure.py`.

All Wasm programs are build by default when building all
artifacts, artifacts in a given mode, or when building
tests. Additionally, a {mode}-wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/{mode}/wasm,
and are accessed in cql-pytests similarly to the way we're
accessing the scylla binary - using glob.

Closes #13209

* github.com:scylladb/scylladb:
  wasm: replace wasm programs with their source programs
  build: prepare rules for compiling wasm files
  build: set the type of build_artifacts
  test: extend capabilities of Wasm reading helper funciton
2023-05-08 13:51:53 +03:00
Petr Gusev
e5c6af17e6 token_metadata: fix indentation 2023-05-08 13:16:21 +04:00
Petr Gusev
435a7573ff token_metadata_impl: return unique_ptr from clone functions
token_metadata takes token_metadata_impl as unique_ptr,
so it makes sense to create it that way in the first place
to avoid unnecessary moves.

token_metadata_impl constructor with shallow_copy parameter
was made public for std::make_unique. The effective
accessibility of this constructor hasn't changed though since
shallow_copy remains private.
2023-05-08 13:16:21 +04:00
Wojciech Mitros
6d89d718d9 wasm: replace wasm programs with their source programs
After recent changes, we are able to store only the
C/Rust source codes for Wasm programs, and only build
them when neccessary. This patch utilizes this
opportunity by removing most of the currently stored
raw Wasm programs, replacing them with C/Rust sources
and adding them to the new build system.
2023-05-08 10:47:34 +02:00
Wojciech Mitros
c065ae0ded build: prepare rules for compiling wasm files
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the (C/Rust) source codes, and build Wasm
when neccessary. This patch adds build commands that
compile C/Rust sources to Wasm.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the wasms and wasm_deps lists in
configure.py.
All Wasm programs are build by default when building all
artifacts, all artifacts in a given mode, or when building
tests. Additionally, a ninja wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/wasm.
2023-05-08 10:47:34 +02:00
Wojciech Mitros
c53d68ee3e build: set the type of build_artifacts
Currently, build_artifacts are of type set[str] | list, which prevents
us from performing set operations on it. In a future patch, we will
want to take a set difference and set intersections with it, so we
initialize the type of build_artifacts to a set in all cases.
2023-05-08 10:47:34 +02:00
Wojciech Mitros
0a34a54c73 test: extend capabilities of Wasm reading helper funciton
Currently, we require that the Wasm file is named the same
as the funciton. In the future we may want multiple functions
with the same name, which we can't currently do due to this
limitation.
This patch allows specifying the function name, so that multiple
files can have a function with the same name.
Additionally, the helper method now escapes "'" characters, so
that they can appear in future Wasm files.
2023-05-08 10:47:34 +02:00
Botond Dénes
ab5fd0f750 Merge 's3: Provide timestamps in the s3 file implementation' from Raphael "Raph" Carvalho
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.

Therefore, let's implement it.

Fixes https://github.com/scylladb/scylladb/issues/13649.

Closes #13713

* github.com:scylladb/scylladb:
  s3: Provide timestamps in the s3 file implementation
  s3: Introduce get_object_stats()
  s3: introduce get_object_header()
2023-05-08 11:43:41 +03:00
Raphael S. Carvalho
ad471e5846 s3: Provide timestamps in the s3 file implementation
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.

Fixes #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:51:12 -03:00
Raphael S. Carvalho
57661f0392 s3: Introduce get_object_stats()
get_object_stats() will be used for retrieving content size and
also last modified.

The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.

Refs #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:51:10 -03:00
Raphael S. Carvalho
da2ccc44a4 s3: introduce get_object_header()
This allows other functions to reuse the code to retrieve the
object header.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-07 19:49:52 -03:00
Kefu Chai
5fa459bd1a treewide: do not include unused header
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13765
2023-05-07 19:01:29 +03:00
Kefu Chai
468460718a utils: UUID: drop uint64_t_tri_compare()
functinoality wise, `uint64_t_tri_compare()` is identical to the
three-way comparison operator, so no need to keep it. in this change,
it is dropped in favor of <=>.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13794
2023-05-07 18:07:49 +03:00
Avi Kivity
380c0b0f33 cql3: untyped_result_set: switch to managed_bytes_view as the cell type
Now that result_set uses managed_bytes_opt for its internals, it's
easy to switch untyped_result_set too. This avoids large
contiguous allocations.
2023-05-07 17:17:36 +03:00
Avi Kivity
42a1ced73b cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.

Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.

The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
2023-05-07 17:17:36 +03:00
Avi Kivity
df4b7e8500 cql3: untyped_result_set: always own data
untyped_result_set is used for internal queries, where ease-of-use is more
important than performance. Currently, cells are held either by value or
by reference (managed_bytes_view). An upcoming change will cause the
result set to be built from managed_bytes_view, making it non-owning, but
the source data is not actually held, resulting in a use-after-free.

Rather than chase the source and force the data to be owned in this case,
just drop the possibility for a non-owning untyped_result_set. It's only
used in non-performance-critical paths and safety is more important than
saving a few cycles.

This also results in simplification: previously, we had a variant selecting
monostate (for NULL), managed_bytes_view (for a reference), and bytes (for
owning data); now we only have a bytes_opt since that already signifies
data-or-NULL.

Once result_set transitions to managed_bytes_opt, untyped_result_set
will follow. For now it's easier to use bytes_opt.
2023-05-07 17:17:36 +03:00
Avi Kivity
d3e9fd49a3 types: abstract_type: add mixed-type versions of compare() and equal()
compare() and equal() can compare two unfragmented values or two
fragmented values, but a mix of a fragmented value and an unfragmented
value runs afoul of C++ conversion rules. Add more overloads to
make it simpler for users.
2023-05-07 17:17:36 +03:00
Avi Kivity
11d651b606 utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
The codebase evolved to have several different ways to hold a fragmented
buffer: fragmented_temporary_buffer (for data received from the network;
not relevant for this discussion); bytes_ostream (for fragmented data that
is built incrementally; also used for a serialized result_set), and
managed_bytes (used for lsa and serialized individual values in
expression evaluation).

One problem with this state of affairs is that using data in one
fragmented form with functions that accept another fragmented form
requires either a copy, or templating everything. The former is
unpalatable for fast-path code, and the latter is undesirable for
compile time and run-time code footprint. So we'd like to make
the various forms compatible.

In 53e0dc7530 ("bytes_ostream: base on managed_bytes") we changed
bytes_ostream to have the same underlying data structure as
managed_bytes, so all that remains is to add the right API. This
is somewhat difficult as the data is hidden in multiple layers:
ser::buffer_view<> is used to abstract a slice of bytes_ostream,
and this is further abstracted by using iterators into bytes_ostream
rather than directly using the internals. Likewise, it's impossible
to construct a managed_bytes_view from the internals.

Hack through all of these by adding extract_implementation() methods,
and a build_managed_bytes_view_from_internals() helper. These are all
used by new APIs buffer_view_to_managed_bytes_view() that extract
the internals and put them back together again.

Ideally we wouldn't need any of this, but unifying the type system
in this area is quite an undertaking, so we need some shortcuts.
2023-05-07 17:17:34 +03:00
Avi Kivity
613f4b9858 utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
Useful, rather than open-coding the conversions.
2023-05-07 17:16:38 +03:00
Avi Kivity
1e6ef5503c utils: managed_bytes: add managed_bytes_view::with_linearized()
Becomes useful in later patches.

To avoid double-compiling the call to func(), use an
immediately-invoked lambda to calculate the bytes_view we'll be
calling func() with.
2023-05-07 17:16:38 +03:00
Avi Kivity
08ba0935e2 utils: managed_bytes: mark managed_bytes_view::is_linearized() const
It's trivially const, mark it so.
2023-05-07 17:16:38 +03:00
Tomasz Grabiec
d8826acaa3 tablets: Fix stack smashing in tablet_map_to_mutation()
The code was incorrectly passing a data_value of type bytes due to
implicit conversion of the result of serialize() (bytes_opt) to a
data_value object of type bytes_type via:

   data_value(std::optional<NativeType>);

mutation::set_static_cell() accepts a data_value object, which is then
serialized using column's type in abstract_type::decompose(data_value&):

    bytes b(bytes::initialized_later(), serialized_size(*this, value._value));
    auto i = b.begin();
    value.serialize(i);

Notice that serialized_size() is taken from the column type, but
serialization is done using data_value's type. The two types may have
a compatible CQL binary representation, but may differ in native
types. serialized_size() may incorrectly interpret the native type and
come up with the wrong size. If the size is too smaller, we end up with
stack or heap corruption later after serialize().

For example, if the column type is utf8 but value holds bytes, the
size will be wrong because even though both use the basic_sstring
type, they have a different layout due to max_size (15 vs 31).

Fixes #13717

Closes #13787
2023-05-07 14:07:50 +03:00
Botond Dénes
c1e8e86637 reader_concurrency_semaphore: reader_permit: clean-up after failed memory requests
When requesting memory via `reader_permit::request_memory()`, the
requested amount is added to `_requested_memory` member of the permit
impl. This is because multiple concurrent requests may be blocked and
waiting at the same time. When the requests are fulfilled, the entire
amount is consumed and individual requests track their requested amount
with `resource_units` to release later.
There is a corner-case related to this: if a reader permit is registered
as inactive while it is waiting for memory, its active requests are
killed with `std::bad_alloc`, but the `_requested_memory` fields is not
cleared. If the read survives because the killed requests were part of
a non-vital background read-ahead, a later memory request will also
include amount from the failed requests. This extra amount wil not be
released and hence will cause a resource leak when the permit is
destroyed.
Fix by detecting this corner case and clearing the `_requested_memory`
field. Modify the existing unit test for the scenario of a permit
waiting on memory being registered as inactive, to also cover this
corner case, reproducing the bug.

Fixes: #13539

Closes #13679
2023-05-07 14:06:51 +03:00
Kefu Chai
bd3e8d0460 test: drop a reusable_sst() variant which accepts int as generation
this is one of the changes to reduce the usage of integer based generation
test. in future, we will need to expand the test to exercise the UUID
based generation, or at least to be neutral to the underlying generation's
identifier type. so, to remove the helpers which only accept `generation_type::int_t`
would helps us to make this happen.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-06 18:24:48 +08:00
Kefu Chai
9b35faf485 treewide: replace generation_type::value() with generation_type::as_int()
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()

because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-06 18:24:45 +08:00
Kamil Braun
aba31ad06c storage_service: use seastar::format instead of fmt::format
For some reason Scylla crashes on `aarch64` in release mode when calling
`fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this
line:
```
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id()));
```

I found this in our configure.py:
```
def get_clang_inline_threshold():
    if args.clang_inline_threshold != -1:
        return args.clang_inline_threshold
    elif platform.machine() == 'aarch64':
        # we see miscompiles with 1200 and above with format("{}", uuid)
        # also coroutine miscompiles with 600
        return 300
    else:
        return 2500
```
but reducing it to `0` didn't help.

I managed to get the following backtrace (with inline threshold 0):
```
void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751
 (inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728
 (inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255
void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const*, wchar_t const*) at ??:?
void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:?
fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:?
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206
 (inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572
```

Maybe it's a bug in `fmt` library?

In any case replacing the call with `::format` (i.e. `seastar::format`
from seastar/core/print.hh) helps.

Do it for the entire file for consistency (and avoiding this bug).

Also, for the future, replace `format` calls with `::format` - now it's
the same thing, but the latter won't clash with `std::format` once we
switch to libstdc++13.

Fixes #13707

Closes #13711
2023-05-05 19:23:22 +02:00
Kamil Braun
70f2b09397 Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.

Closes #13427

* github.com:scylladb/scylladb:
  pylib_test: add tests for read_last_line
  pytest: add pylib_test directory
  scylla_cluster.py: fix read_last_line
  scylla_cluster.py: move read_last_line to util.py
2023-05-05 13:29:15 +02:00
Botond Dénes
1e9dcaff01 Merge 'build: cmake: use Seastar API level 6' from Kefu Chai
to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this.

Closes #13780

* github.com:scylladb/scylladb:
  build: cmake: disable deprecated warning
  build: cmake: use Seastar API level 6
2023-05-05 13:55:34 +03:00
Kefu Chai
05a172c7e7 build: cmake: link against Boost::unit_test_framework
we introduced the linkage to Boost::unit_test_framework in
fe70333c19, this library is used by
test/lib/test_utils.cc, so update CMake accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13781
2023-05-05 13:55:00 +03:00
Petr Gusev
8a0bcf9d9d pylib_test: add tests for read_last_line 2023-05-05 12:57:43 +04:00
Petr Gusev
7476e91d67 pytest: add pylib_test directory
We want to add tests for read_last_line,
in this commit we add a new directory for them
since there were no tests for pylib code before.
2023-05-05 12:57:43 +04:00
Petr Gusev
330d1d5163 scylla_cluster.py: fix read_last_line
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 512 chars;
* the logic of the function made simpler and more
maintainable.
2023-05-05 12:57:36 +04:00
Petr Gusev
8a5e211c30 scylla_cluster.py: move read_last_line to util.py
We want to add tests for read_last_line, so we
move it to make this simper.
2023-05-05 12:51:25 +04:00
Botond Dénes
687a8bb2f0 Merge 'Sanitize test::filename(sstable) API' from Pavel Emelyanov
There are two of them currently with slightly different declaration. Better to leave only one.

Closes #13772

* github.com:scylladb/scylladb:
  test: Deduplicate test::filename() static overload
  test: Make test::filename return fs::path
2023-05-05 11:36:08 +03:00
Botond Dénes
b704698ba5 Merge 'Close toc file in remove_by_toc_name()' from Pavel Emelyanov
The method in question suffers from scylladb/seastar#1298. The PR fixes it and makes a bit shorter along the way

Closes #13776

* github.com:scylladb/scylladb:
  sstable: Close file at the end
  sstables: Use read_entire_stream_cont() helper
2023-05-05 11:33:05 +03:00
Anna Stuchlik
27b0dff063 doc: make branch-5.2 latest and stable
This commit changes the configuration in the conf.py
file to make branch-5.2 the latest version and
remove it from the list of unstable versions.

As a result, the docs for version 5.2 will become
the default for users accessing the ScyllaDB Open Source
documentation.

This commit should be merged as soon as version 5.2
is released.

Closes #13681
2023-05-05 11:11:17 +03:00
Botond Dénes
0cccf9f1cc Merge 'Remove some file_writer public methods' from Pavel Emelyanov
One is unused, the other one is not really required in public

Closes #13771

* github.com:scylladb/scylladb:
  file_writer: Remove static make() helper
  sstable: Use toc_filename() to print TOC file path
2023-05-05 10:48:46 +03:00
Pavel Emelyanov
ac305076bd test: Split test_twcs_interposer_on_memtable_flush naturally
The test case consists of two internal sub-test-cases. Making them
explicit kills three birds with one stone

- improves parallelizm
- removes env's tempdir wiping
- fixes code indentation

refs: #12707

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13768
2023-05-05 10:42:30 +03:00
Raphael S. Carvalho
1f69c46889 sstables: use version_types received from parser or writer
This is only a cosmetical change, no change in semantics

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13779
2023-05-05 10:32:14 +03:00
Kefu Chai
e4c6b0b31d build: cmake: disable deprecated warning
since Seastar now deprecates a bunch of APIs which accept io_priority_class,
we started to have deprecated warnings. before migrating to V7 API,
let's disable this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-05 15:31:39 +08:00
Kefu Chai
3c941e8b8a build: cmake: use Seastar API level 6
to avoid the FTBFS after we bump up the Seastar submodule
which bumped up its API level to v7. and API v7 is a breaking
change. so, in order to unbreak the build, we have to hardwire
the API level to 6. `configure.py` also does this.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-05 15:21:42 +08:00
Avi Kivity
fe1cd6f477 Update seastar submodule
* seastar 02d5a0d7c...f94b1bb9c (12):
  > Merge 'Unify CPU scheduling groups and IO priority classes' from Pavel Emelyanov
  > scripts: addr2line: relax regular expression for matching kernel traces
  > add dirs for clangd to .gitignore
  > http::client: Log failed requests' body
  > build: always quote the ENVIRONMENT with quotes
  > exception_hacks: Change guard check order to work around static init fail
  > shared_future: remove support for variadic futures
  > iotune: Don't close file that wasn't opened
Fixes #13439
  > Merge 'Relax per tick IO grab threshold' from Pavel Emelyanov
  > future: simplify constraint on then() a little
  > Merge 'coroutine: generator: initialize const member variable and enable generator tests' from Kefu Chai
  > future: drop libc++ std::tuple compatibility hack

Closes #13777
2023-05-05 00:32:11 +03:00
Pavel Emelyanov
75e7187e1a sstable: Close file at the end
The thing is than when closing file input stream the underlying file is
not .close()-d (see scylladb/seastar#1298). The remove_by_toc_name() is
buggy in this sense. Using with_closeable() fixes it and makes the code
shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 20:37:48 +03:00
Pavel Emelyanov
334383beb5 sstables: Use read_entire_stream_cont() helper
The remove_by_toc_name() wants to read the whole stream into a sstring.
There's a convenience helper to facilitate that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 20:37:09 +03:00
Avi Kivity
f125a3e315 Merge 'tree: finish the reader_permit state renames' from Botond Dénes
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.

Closes #13573

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
  reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
  reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
  reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
  reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
2023-05-04 18:29:04 +03:00
Avi Kivity
204521b9a7 Merge 'mutation/mutation_compactor: validate range tombstone change before it is moved' from Botond Dénes
e2c9cdb576 moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call.

Refs: #12575

Closes #13749

* github.com:scylladb/scylladb:
  test/boost/mutation_test: add sanity test for mutation compaction validator
  mutation/mutation_compactor: add validation level to compaction state query constructor
  mutation/mutation_compactor: validate range tombstone change before it is moved
2023-05-04 18:15:35 +03:00
Avi Kivity
1d351dde06 Merge 'Make S3 client work with real S3' from Pavel Emelyanov
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.

The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.

So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.

In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.

fixes: #13425

Closes #13493

* github.com:scylladb/scylladb:
  doc: Add a document describing how to configure S3 backend
  s3/test: Add ability to run boost test over real s3
  s3/client: Sign requests if configured
  s3/client: Add connection factory with DNS resolve and configurable HTTPS
  s3/client: Keep server port on config
  s3/client: Construct it with config
  s3/client: Construct it with sstring endpoint
  sstables: Make s3_storage with endpoint config
  sstables_manager: Keep object storage configs onboard
  code: Introduce conf/object_storage.yaml configuration file
2023-05-04 18:08:54 +03:00
Avi Kivity
2d74dc0efd Merge 'sstable_directory: parallel_for_each_restricted: do not move container' from Benny Halevy
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
            query = "SELECT COUNT(*) FROM cf"
            statement = SimpleStatement(query)
            s = self.patient_cql_connection(node, 'ks')
            result = list(s.execute(statement))
    >       assert result[0].count == expected_number_of_rows, \
                "Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT *
FROM ks.cf")))
    E       AssertionError: Expected 1 rows. Got []
    E       assert 0 == 1
    E         +0
```

The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared.

parallel_for_each_restricted moves the container passed to it into a `do_with` continuation.  This is required for parallel_for_each_restricted.

However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee.

This patch changes the signature of parallel_for_each_restricted to accept a container rather than a rvalue reference, allowing the callers to decide whether to move or not.

Most callers are converted to move the container, except for `do_for_each_sstable` that copies `_unshared_local_sstables`, allowing callers to call `dir.do_for_each_sstable` multiple times without moving the list contents.

Closes #13526

* github.com:scylladb/scylladb:
  sstable_directory: coroutinize parallel_for_each_restricted
  sstable_directory: parallel_for_each_restricted: use std::ranges for template definition
  sstable_directory: parallel_for_each_restricted: do not move container
2023-05-04 17:39:05 +03:00
Pavel Emelyanov
56dfc21ba0 test: Deduplicate test::filename() static overload
There are two of them currently, both returning fs::path for sstable
components. One is static and can be dropped, callers are patched to use
the non-static one making the code tiny bit shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 17:16:00 +03:00
Pavel Emelyanov
3f30a253be test: Make test::filename return fs::path
The sstable::filename() is private and is not supposed to be used as a
path to open any files. However, tests are different and they sometimes
know it is. For that they use test wrapper that has access to private
members and may make assumptions about meaning of sstable::filename().

Said that, the test::filename() should return fs::path, not sstring.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 17:14:04 +03:00
Michał Chojnowski
eb5ccb7356 mutation_partition_v2: fix a minor bug in printer
Commit 1cb95b8cf caused a small regression in the debug printer.
After that commit, range tombstones are printed to stdout,
instead of the target stream.
In practice, this causes range tombstones to appear in test logs
out of order with respect to other parts of the debug message.

Fix that.

Closes #13766
2023-05-04 16:56:40 +03:00
Pavel Emelyanov
c4394a059c file_writer: Remove static make() helper
It's simply unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 16:55:41 +03:00
Pavel Emelyanov
eaf534cc4b sstable: Use toc_filename() to print TOC file path
The sstable::write_toc() gets TOC filename from file writer, while it
can get it from itself. This makes the file_writer::get_filename()
private and actually improves logging, as the writer is not required
to have the filename onboard, while sstable always has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-04 16:54:21 +03:00
Mikołaj Grzebieluch
4a8a8c153c service/raft: raft_group_registry: Add verification of destination ID
All Raft verbs include dst_id, the ID of the destination server, but it isn't checked.
`append_entries` will work even if it arrives at completely the wrong server (but in the same group).
It can cause problems, e.g. in the scenario of replacing a dead node.

This commit adds verifying if `dst_id` matches the server's ID and if it doesn't,
the Raft verb is rejected.

Closes #12179
2023-05-04 15:25:23 +02:00
Tomasz Grabiec
e385ce8a2b Merge "fix stack use after free during shutdown" from Gleb
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.

Fixes #13522
2023-05-04 15:14:18 +02:00
Pavel Emelyanov
fe70333c19 test: Auto-skip object-storage test cases if run from shell
In case an sstable unit test case is run individually, it would fail
with exception saying that S3_... environment is not set. It's better to
skip the test-case rather than fail. If someone wants to run it from
shell, it will have to prepare S3 server (minio/AWS public bucket) and
provide proper environment for the test-case.

refs: #13569

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13755
2023-05-04 14:15:18 +03:00
Mikołaj Grzebieluch
ae41d908d7 service/raft: raft_group_registry: handle_raft_rpc refactor
One-way RPC and two-way RPC have different semantics, i.e. in the first one
client doesn't need to wait for an answer.

This commit splits the logic of `handle_raft_rpc` to enable handle differences
in semantics, e.g. errors handling.
2023-05-04 13:05:04 +02:00
Botond Dénes
0c9af10470 test/cql-pytest: add test_sstable_validation.py
This test file, focuses on stressing the underlying sstable validator
with cases where the data/index has discrepancies.
2023-05-04 06:48:05 -04:00
Botond Dénes
a26224ffb8 test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
From test_tools.py, their current home. They will soon be used by more
then one test file.
2023-05-04 06:48:05 -04:00
Konstantin Osipov
e7c9ca560b test: issue a read barrier before checking ring consistency
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.

When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.

To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.

Fixes #13518 (flaky topology test).

Closes #13756
2023-05-04 12:22:07 +02:00
Gleb Natapov
dc6c3b60b4 init: move raft_group0 creation before storage_service
storage_service uses raft_group0 so the later needs to exists until
the former is stopped.
2023-05-04 13:03:18 +03:00
Gleb Natapov
e9fb885e82 service/raft: raft_group0: drop dependency on cdc::generation_service
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
2023-05-04 13:03:07 +03:00
Benny Halevy
205daf49fd sstable_directory: coroutinize parallel_for_each_restricted
Using a coroutine simplifies the function and reduced the
number of moves it performs.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-04 11:46:59 +03:00
Benny Halevy
e4acc44814 sstable_directory: parallel_for_each_restricted: use std::ranges for template definition
We'd like the container to be a std::ranges::range.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-04 11:44:24 +03:00
Benny Halevy
e2023877f2 sstable_directory: parallel_for_each_restricted: do not move container
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
        query = "SELECT COUNT(*) FROM cf"
        statement = SimpleStatement(query)
        s = self.patient_cql_connection(node, 'ks')
        result = list(s.execute(statement))
>       assert result[0].count == expected_number_of_rows, \
            "Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT * FROM ks.cf")))
E       AssertionError: Expected 1 rows. Got []
E       assert 0 == 1
E         +0
E         -1
```

The reason for the regression is that the call to `do_for_each_sstable`
in `collect_all_shared_sstables` to search for sstables that need
cleanup caused the list of sstables in the sstable directory to be
moved and cleared.

parallel_for_each_restricted moves the container passed to it
into a `do_with` continuation.  This is required for
parallel_for_each_restricted.

However, moving the container is destructive and so,
the decision whether to move or not needs to be the
caller's, not the callee.

This patch changes the signature of parallel_for_each_restricted
to accept a lvalue reference to the container rather than a rvalue reference,
allowing the callers to decide whether to move or not.

Most callers are converted to move the container, as they effectively do
today, and a new method, `filter_sstables` was added for the
`collect_all_shared_sstables` us case, that allows the `func` that
processes each sstable to decide whether the sstable is kept
in `_unshared_local_sstables` or not.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-04 11:36:25 +03:00
Botond Dénes
6bc5c4acf6 tools/scylla-sstables: write validation result to stdout
Currently the validate command uses the logger to output the result of
validation. This is inconsistent with other commands which all write
their output to stdout and log any additional information/errors to
stderr. This patch updates the validate command to do the same. While at
it, remove the "Validating..." message, it is not useful.
2023-05-04 03:13:07 -04:00
Botond Dénes
c1f18cb0c1 sstables/sstable: validate(): delegate to mx validator for mx sstables
We have a more in-depth validator for the mx format, so delegate to that
if the validated sstable is of that format. For kl/la we fall-back to
the reader-level validator we used before.
2023-05-04 03:13:07 -04:00
Botond Dénes
d941d38759 sstables/mx/reader: add mx specific validator
Working with the low-level sstable parser and index reader, this
validator also cross-checks the index with the data file, making sure
all partitions are located at the position and in the order the index
describes. Furthermore, if the index also has promoted index, the order
and position of clustering elements is checked against it.
This is above the usual fragment kind order, partition key order and
clustering order checks that we already had with the reader-level
validator.
2023-05-04 03:13:03 -04:00
Botond Dénes
11f2d6bd0a Merge 'build: only apply -Wno-parentheses-equality to ANTLR generated sources' from Kefu Chai
it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR.

please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported.

for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
                            if ( (LA4_0 == '$'))
                                  ~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
                            if ( (LA4_0 == '$'))
                                 ~      ^     ~
```

Closes #13762

* github.com:scylladb/scylladb:
  build: only apply -Wno-parentheses-equality to ANTLR generated sources
  compaction: disambiguate format_to()
2023-05-04 10:09:36 +03:00
Kefu Chai
c76486c508 build: only apply -Wno-parentheses-equality to ANTLR generated sources
it turns out the only places where we have compiler warnings of
-W-parentheses-equality is the source code generated by ANTLR. strictly
speaking, this is valid C++ code, just not quite readable from the
hygienic point of view. so let's enable this warning in the source tree,
but only disable it when compiling the sources generated by ANTLR.

please note, this warning option is supported by both GCC and Clang,
so no need to test if it is supported.

for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
                            if ( (LA4_0 == '$'))
                                  ~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
                            if ( (LA4_0 == '$'))
                                 ~      ^     ~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-04 11:16:27 +08:00
Michał Chojnowski
2d1a345068 test: mvcc_test: add a test for gentle schema upgrades 2023-05-04 03:35:15 +02:00
Michał Chojnowski
80c8a6d0e6 partition_version: make partition_entry::upgrade() gentle
Preceding commits in this patch series have extended the MVCC
mechanism to allow for versions with different schemas
in the same entry/snapshot, with on-the-fly and background
schema upgrades to the most recent version in the chain.

Given that, we can perform gentle schema upgrades by simply
adding an empty version with the target schema to the front
of the entry.

This patch is intended to be the first and only behaviour-changing patch in the
series. Previous patches added code paths for multi-schema snapshots, but never
exercised them, because before this patch two different schemas within a single
MVCC chain never happened. This patch makes it happen and thus exercises all the
code in the series up until now.

Fixes #2577
2023-05-04 03:35:15 +02:00
Michał Chojnowski
fe576f8f29 partition_version: handle multi-schema snapshots in merge_partition_versions
Each partition_version is allowed to have a different schema now.
As of this patch, all versions reachable from a snapshot/entry always
have the same schema, but this will change in an upcoming patch.
This commit prepares merge_partition_versions() for that.

See code comments added in this patch for a detailed description.

The design chosen in this patch requires adding a bit of information to
partition_version. Due to alignment, it results in a regrettable waste of 8
bytes per partition. If we want, we can recover that in the future by squeezing
the bit into some free bit in other fields, for example the highest or lowest
bits of one of the pointers in partition_version.

After this patch, MVCC should be prepared for replacing the atomic schema
upgrade() of cache/memtable entries with a gentle upgrade().
2023-05-04 03:35:15 +02:00
Michał Chojnowski
152b4cd4c2 mutation_partition_v2: handle schema upgrades in apply_monotonically()
To avoid reactor stalls during schema upgrades of memtable and cache entries,
we want to do them interruptibly, not atomically. To achieve that, we want
to reuse the existing gentle version merging mechanism. If we generalize
version merging algorithms to handle `mutation_partition`s with different
schemas, a schema upgrade will boil down simply to adding a new empty MVCC
version with the new schema.

In a previous patch, we already generalized the cursor to upgrade rows
on the fly when reading.
But we still have to generalize the other MVCC algorithm: the merging of
superfluous mutation_partition_v2 objects. This patch modifies the two-version
merging algorithm: apply_monotonically(). The next patch will update its caller,
merge_partition_versions() to make of use the updated apply_monotonically()
properly.
2023-05-04 03:35:15 +02:00
Michał Chojnowski
0273101890 partition_version: remove the unused "from" argument in partition_entry::upgrade()
partition_entry now contains a reference to its schema, so it doesn't have to
be supplied by the caller anymore.
2023-05-04 02:37:30 +02:00
Michał Chojnowski
fc4b812e62 row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades
The upcoming schema upgrade change will perform the schema upgrade by adding
a new version (with the new schema) to the partition entry.

To clean a multi-version entry, eviction is not enough - the versions have
to be merged and/or cleared first. drain() does just that.
2023-05-04 02:37:30 +02:00
Michał Chojnowski
db6a35e3a8 partition_version: handle multi-schema entries in partition_entry::squashed
An upcoming patch will enable multiple schemas within a single entry,
after the entry is upgraded.
partition_entry::squashed isn't prepared for that yet.
This patch prepares it.
2023-05-04 02:37:30 +02:00
Michał Chojnowski
5f68409934 partition_snapshot_row_cursor: handle multi-schema snapshots
To support gentle schema upgrades, each version has its own schema.
Currently this facility is unused, and the schema is equal for
all versions in a snapshot. But in upcoming commits this will change.

In the new design, after an entry upgrade, there will be a transitional
period where two versions with different schemas will coexist in a snapshot.
Eventually, these versions will be merged by mutation_cleaner into one
version with the current schema, but until then reads have to merge
multi-schema snapshots on the fly.

This commit implements in the cursor support for per-version schemas.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
f4e853b32d partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots
When in upcoming patches we allow multiple schema versions within a single
snapshot, reads will have to upgrade rows on the fly.
This also applies to squashed()
2023-05-04 02:37:29 +02:00
Michał Chojnowski
a2e3cf7463 partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots
When in upcoming patches we allow multiple schema versions within a single
snapshot, reads will have to upgrade rows on the fly.
This also applies to the static row.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
94e4dc3d8d partition_version: add a logalloc::region argument to partition_entry::upgrade()
The argument is currently unused, but will be further propagated to
add_version() in an upcoming patch.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
98dfe3355e memtable: propagate the region to memtable_entry::upgrade_schema()
Adds a logalloc::region argument to upgrade_schema().
It's currently unused, but will be further propagated to
partition_entry::upgrade() in an upcoming patch.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
effd1fe70f mutation_partition: add an upgrading variant of lazy_row::apply()
A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
dce1b3e820 mutation_partition: add an upgrading variant of rows_entry::rows_entry
A helper which will be used in upcoming commits.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
2fe25a5aa2 mutation_partition: switch an apply() call to apply_monotonically() 2023-05-04 02:37:29 +02:00
Michał Chojnowski
a34c5e410f mutation_partition: add an upgrading variant of rows_entry::apply_monotonically()
A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
333e65447c mutation_fragment: add an upgrading variant of clustering_row::apply()
It will be used during upcoming changes in partition_snapshot_row_cursor
to prepare it for multi-schema snapshots.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
b488e4d541 mutation_partition: add an upgrading variant of row::row
It will be used in upcoming commits.

A factory function is used, rather than an actual constructor,
because we want to delegate the (easy) case of equal schemas
to the existing single-schema constructor.
And that's impossible (at least without invoking a copy/move
constructor) to do with only constructors.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
caaf0bd6bf partition_version: remove _schema from partition_entry::operator<<
operator<< accepts a schema& and a partition_entry&. But since the latter
now contains a reference to its schema inside, the former is redundant.
Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
f6e11c95e2 partition_version: remove the schema argument from partition_entry::read()
partition_entry now contains a reference to its schema, so it no longer
needs to be supplied by the caller.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
4e4ae43a84 memtable: remove _schema from memtable_entry
After adding a _schema field to each partition version,
the field in memtable_entry is redundant. It can be always recovered
from the latest version. Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
d999e46fa5 row_cache: remove _schema from cache_entry
After adding a _schema field to each partition version,
the field in cache_entry is redundant. It can be always recovered
from the latest version. Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
d7d6449a8f partition_version: remove the _schema field from partition_snapshot
After adding a _schema field to each partition version,
the field in partition_snapshot is redundant. It can be always recovered
from the latest version. Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
1d01a4a168 partition_version: add a _schema field to partition_version
Currently, partition_version does not reference its schema.
All partition_version reachable from a entry/snapshot have the same schema,
which is referenced in memtable_entry/cache_entry/partition_snapshot.

To enable gentle schema upgrades, we want to use the existing background
version merging mechanism. To achieve that, we will move the schema reference
into partition_version, and we will allow neighbouring MVCC versions to have
different schemas, and we will merge them on-the-fly during reads and
persistently during background version merges.
This way, an upgrade will boil down to adding a new empty version with
the new schema.

This patch adds the _schema field to partition_version and propagates
the schema pointer to it from the version's containers (entry/snapshot).
Subsequent patches will remove the schema references from the containers,
because they are now redundant.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
bc6a07a16a mutation_partition: change schema_ptr to schema& in mutation_partition::difference
Cosmetic change. See the preceding commit for details.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
a70c5704df mutation_partition: change schema_ptr to schema& in mutation_partition constructor
Cosmetic change. See the preceding commit for details.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
781514acfe mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor
We don't have a convention for when to pass `schema_ptr` and and when to pass
`const schema&` around.
In general, IMHO the natural convention for such a situation is to pass the
shared pointer if the callee might extend the lifetime of shared_ptr,
and pass a reference otherwise. But we convert between them willy-nilly
through shared_from_this().

While passing a reference to a function which actually expects a shared_ptr
can make sense (e.g. due to the fact that smart pointers can't be passed in
registers), the other way around is rather pointless.

This patch takes one occurence of that and modifies the parameter to a reference.

Since enable_shared_from_this makes shared pointer parameters and reference
parameters interchangeable, this is a purely cosmetic change.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
021b345832 mutation_partition: add upgrading variants of row::apply()
They will be used in upcoming patches which introduce incremental schema upgrades.

Currently, these variants always copy cells during upgrade.
This could be optimized in the future by adding a way to move them instead.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
4214f8d0de partition_version: update the comment to apply_to_incomplete()
The comment refers to "other", but it means "pe". Fix that.

The patch also adds a bit of context to the mutation_partition jargon
("evictability" and "continuity"), by reminding how it relates
to the concrete abstractions: memtable and cache.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
49a02b08de mutation_partition_v2: clean up variants of apply()
Most variants of apply() and apply_monotonically() in mutation_partition_v2
are leftovers from mutation_partition, and are unused. Thus they only
add confusion and maintenance burden. Since we will be modifying
apply_monotonically() in upcoming patches, let's clean them up, lest
the variants become stale.

This patch removes all unused variants of apply() and apply_monotonically()
and "manually inlines" the variants which aren't used often enough to carry
their own weight.

In the end, we are left with a single apply_monotonically() and two convenience
apply() helpers.

The single apply_monotonically() accepts two schema arguments. This facility
is unimplemented and unused as of this patch - the two arguments are always
the same - but it will be implemented and used in later parts of the series.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
88a0871729 mutation_partition: remove apply_weak()
apply_weak is just an alias for apply(), and most of its variants
are dead code. Get rid of it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
38d9241c30 mutation_partition_v2: remove a misleading comment in apply_monotonically()
The comment suggests that the order of sentinel insertion is meaningful because
of the resulting eviction order. But the sentinels are added to the tracker
with the two-argument version of insert(), which inserts the second argument
into the LRU right before the (more recent) first argument.
Thus the eviction order of sentinels is decided explicitly, and it doesn't
rely on insertion order.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
42c7bc0391 row_cache_test: add schema changes to test_concurrent_reads_and_eviction
Reads with multiple schema verions have a different code path now,
so add schema changes to the test, to test these paths too.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
fb8ae3cca4 mutation_partition: fix mixed-schema apply()
In some mixed-schema apply helpers for tests, the source mutation
is accidentally copied with the target schema. Fix that.

Nothing seems to be currently affected by this bug; I found it when it
was triggered by a new test I was adding.
2023-05-04 02:37:29 +02:00
Kefu Chai
113fb32019 compaction: disambiguate format_to()
we should always qualify `format_to` with its namespace. otherwise
we'd have following failure when compiling with libstdc++ from GCC-13:

```
/home/kefu/dev/scylladb/compaction/table_state.hh:65:16: error: call to 'format_to' is ambiguous
        return format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
               ^~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13760
2023-05-03 20:33:18 +03:00
Pavel Emelyanov
0b18e3bff9 doc: Add a document describing how to configure S3 backend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:38 +03:00
Pavel Emelyanov
e00d3188ed s3/test: Add ability to run boost test over real s3
Support the AWS_S3_EXTRA environment vairable that's :-split and the
respective substrings are set as endpoint AWS configuration. This makes
it possible to run boost S3 test over real S3.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:38 +03:00
Pavel Emelyanov
98b9c205bb s3/client: Sign requests if configured
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:37 +03:00
Pavel Emelyanov
3dd82485f6 s3/client: Add connection factory with DNS resolve and configurable HTTPS
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.

First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).

Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.

This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:23:19 +03:00
Pavel Emelyanov
3bec5ea2ce s3/client: Keep server port on config
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
85f06ca556 s3/client: Construct it with config
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
caf9e357c8 s3/client: Construct it with sstring endpoint
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.

First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.

So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
711514096a sstables: Make s3_storage with endpoint config
Continuation of the previous patch. The sstables::s3_storage gets the
endpoint config instance upon creation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
bd1e3c688f sstables_manager: Keep object storage configs onboard
The user sstables manager will need to provide endpoint config for
sstables' storage drivers. For that it needs to get it from db::config
and keep in-sync with its updates.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:43 +03:00
Pavel Emelyanov
2f6aa5b52e code: Introduce conf/object_storage.yaml configuration file
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.

To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name

The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.

To keep the described configuration the proposed place is the
object_storage.yaml file with the format

endpoints:
  - name: a.b.c
    port: 443
    aws_key: 12345
    aws_secret: abcdefghijklmnop
    ...

When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-03 20:19:15 +03:00
Botond Dénes
4365f004c1 test/boost/mutation_test: add sanity test for mutation compaction validator
Checking that compacted fragments are forwarded to the validator intact.
2023-05-03 04:19:42 -04:00
Botond Dénes
60e1a23864 mutation/mutation_compactor: add validation level to compaction state query constructor
Allowing the validation level to be customized by whoever creates the
compaction state. Add a default value (the previous hardcoded level) to
avoid the churn of updating all call sites.
2023-05-03 04:17:05 -04:00
Botond Dénes
be859db112 mutation/mutation_compactor: validate range tombstone change before it is moved
e2c9cdb576 moved the validation of the
range tombstone change to the place where it is actually consumed, so we
don't attempt to pass purged or discarded range tombstones to the
validator. In doing so however, the validate pass was moved after the
consume call, which moves the range tombstone change, the validator
having been passed a moved-from range tombstone. Fix this by moving he
validation to before the consume call.

Refs: #12575
2023-05-03 03:07:31 -04:00
Botond Dénes
48b9f31a08 Merge 'db, sstable: use generation_type instead of its value when appropriate' from Kefu Chai
in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id.

Closes #13652

* github.com:scylladb/scylladb:
  db: use uuid for the generation column in sstable registry table
  db, sstable: add operator data_value() for generation_type
  db, sstable: print generation instead of its value
2023-05-03 09:04:54 +03:00
Nadav Har'El
b5f28e2b55 Merge 'Add S3 support to sstables::test_env' from Pavel Emelyanov
Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too.

This PR adds this support and patches several test cases to use it. More test cases are to come later on demand.

fixes: #13015

Closes #13569

* github.com:scylladb/scylladb:
  test: Make resharding test run over s3 too
  test: Add lambda to fetch bloom filter size
  test: Tune resharding test use of sstable::test_env
  test: Make datafile test case run over s3 too
  test: Propagate storage options to table_for_test
  test: Add support for s3 storage_options in config
  test: Outline sstables::test_env::do_with_async()
  test: Keep storage options on sstable_test_env config
  sstables: Add and call storage::destroy()
  sstables: Coroutinize sstable::destroy()
2023-05-02 21:48:05 +03:00
Botond Dénes
a6387477fa mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter 2023-05-02 09:42:42 -04:00
Botond Dénes
d79db676b1 sstables/mx/reader: template data_consume_rows_context_m on the consumer
Sadly this means all accesses of base-class members have to be qualified
with `this->`.
2023-05-02 09:42:42 -04:00
Botond Dénes
06fb48362a sstables/mx/reader: move row_processing_result to namespace scope
Reduce `data_consume_rows_context_m`'s dependency on the
`mp_row_consumer_m` symbol, preparing the way to make the former
templated on the consumer.
2023-05-02 09:42:42 -04:00
Botond Dénes
00362754a0 sstables/mx/reader: use data_consumer::proceed directly
Currently mp_row_consumer_m creates an alias to data_consumer::proceed.
Code in the rest of the file uses both unqualified name and
mp_row_consumer_m::proceed. Remove the alias and just use
`data_consumer::proceed` directly everywhere, leads to cleaner code.
2023-05-02 09:42:42 -04:00
Botond Dénes
388e7ddc03 sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic) 2023-05-02 09:42:42 -04:00
Botond Dénes
10fe76a0fe compaction/compaction: remove now unused scrub_validate_mode_validate_reader() 2023-05-02 09:42:42 -04:00
Botond Dénes
f6e5be472d compaction/compaction: move away from scrub_validate_mode_validate_reader()
Use sstable::validate() directly instead.
2023-05-02 09:42:42 -04:00
Botond Dénes
3e52f0681e tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
Use sstable::validate() directly instead. Since sstables have to be
validated individually, this means the operation looses the `--merge`
option.
2023-05-02 09:42:42 -04:00
Botond Dénes
393c42d4a9 test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
Test sstable::validate() instead. Also rename the unit test testing said
method from scrub_validate_mode_validate_reader_test to
sstable_validate_test to reflect the change.
At this point this test should probably be moved to
sstable_datafile_test.cc, but not in this patch.
Sadly this transition means we loose some test scenarios. Since now we
have to write the invalid data to sstables, we have to drop scenarios
which trigger errors on either the write or read path.
2023-05-02 09:42:42 -04:00
Botond Dénes
47959454eb sstables/sstable: add validate() method
To replace the validate code currently in compaction/compaction.cc (not
in this commit). We want to push down this logic to the sstable layer,
so that:
* Non compaction code that wishes to validate sstables (tests, tools)
  doesn't have to go through compaction.
* We can abstract how sstables are validated, in particular we want to
  add a new more low-level validation method that only the more recent
  sstable versions (mx) will support.
2023-05-02 09:42:41 -04:00
Botond Dénes
7ba5c9cc6a compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
Currently said method creates a combined reader from all the sstables
passed to it then validates this combined reader.
Change it to validate each sstable (reader) individually in preparation
of the new validate method which can handle a single sstable at a time.
Note that this is not going to make much impact in practice, all callers
pass a single sstable to this method already.
2023-05-02 09:42:41 -04:00
Botond Dénes
e8c7ba98f1 compaction: scrub: use error messages from validator 2023-05-02 09:42:41 -04:00
Botond Dénes
d3749b810a mutation_fragment_stream_validator: produce error messages in low-level validator
Currently, error messages for validation errors are produced in several
places:
* the high-level validator (which is built on the low-level one)
* scrub compaction and validation compaction (scrub in validate mode)
* scylla-sstable's validate operation

We plan to introduce yet another place which would use the low-level
validator and hence would have to produce its own error messages. To cut
down all this duplication, centralize the production of error messages
in the low-level validator, which now returns a `validation_result`
object instead of bool from its validate methods. This object can be
converted to bool (so its backwards compatible) and also contains an
error message if validation failed. In the next patches we will migrate
all users of the low level validator (be that direct or indirect) to use
the error messages provided in this result object instead of coming up
with one themselves.
2023-05-02 09:42:41 -04:00
Botond Dénes
72003dc35c readers: evictable_reader: skip progress guarantee when next pos is partition start
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.

Fixes: #13491

Closes #13563
2023-05-02 16:19:32 +03:00
Botond Dénes
7baa2d9cb2 Merge 'Cleanup range printing' from Benny Halevy
This mini-series cleans up printing of ranges in utils/to_string.hh

It generalizes the helper function to work on a std::ranges::range,
with some exceptions, and adds a helper for boost::transformed_range.

It also changes the internal interface by moving `join` the the utils namespace
and use std::string rather than seastar::sstring.

Additional unit tests were added to test/boost/json_test

Fixes #13146

Closes #13159

* github.com:scylladb/scylladb:
  utils: to_string: get rid of utils::join
  utils: to_string: get rid of to_string(std::initializer_list)
  utils: to_string: get rid of to_string(const Range&)
  utils: to_string: generalize range helpers
  test: add string_format_test
  utils: chunked_vector: add std::ranges::range ctor
2023-05-02 14:55:18 +03:00
Botond Dénes
d6ed5bbc7e Merge 'alternator: fix validation of numbers' magnitude and precision' from Nadav Har'El
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:

1. Users might get used to this "unofficial" feature and start relying
    on it, not allowing us to switch to a more efficient limited-precision
    implementation later.

2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
    number with 1.0 will result in a huge number, huge allocations and
    stalls. This is highly undesirable.

This series adds more tests in this area covering additional corner cases,
and then fixes the issue by adding the missing verification where it's
needed. After the series, all 12 tests in test/alternator/test_number.py now pass.

Fixes #6794

Closes #13743

* github.com:scylladb/scylladb:
  alternator: unit test for number magnitude and precision function
  alternator: add validation of numbers' magnitude and precision
  test/alternator: more tests for limits on number precision and magnitude
  test/alternator: reproducer for DoS in unlimited-precision addition
2023-05-02 14:33:36 +03:00
Kefu Chai
74e9e6dd1a db: use uuid for the generation column in sstable registry table
* change the "generation" column of sstable registry table from
  bigint to uuid
* from helper to convert UUID back to the original generation

in the long run, we encourage user to use uuid based generation
identifier. but in the transition period, both bigint based and uuid
based identifiers are used for the generation. so to cater both
needs, we use a hackish way to store the integer into UUID. to
differentiate the was-integer UUID from the geniune UUID, we
check the UUID's most_significant_bits. because we only support
serialize UUID v1, so if the timestamp in the UUID is zero,
we assume the UUID was generated from an integer when converting it
back to a generation identififer.

also, please note, the only use case of using generation as a
column is the sstable_registry table, but since its schema is fixed,
we cannot store both a bigint and a UUID as the value of its
`generation` column, the simpler way forward is to use a single type
for the generation. to be more efficient and to preserve the type of
the generation, instead of using types like ascii string or bytes,
we will always store the generation as a UUID in this table, if the
generation's identifier is a int64_t, the value of the integer will
be used as the least significant bits of the UUID.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-02 19:23:22 +08:00
Nadav Har'El
ed34f3b5e4 cql-pytest: translate Cassandra's test for LWT with collections
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cql-pytest
framework.

This test file checks various LWT conditional updates which involve
collections or UDTs (there is a separate test file for LWT conditional
updates which do not involve collections, which I haven't translated
yet).

The tests reproduce one known bug:

Refs #5855:  lwt: comparing NULL collection with empty value in IF
             condition yields incorrect results

And also uncovered three previously-unknown bugs:

Refs #13586: Add support for CONTAINS and CONTAINS KEY in LWT expressions
Refs #13624: Add support for UDT subfields in LWT expression
Refs #13657: Misformatted printout of column name in LWT error message

Beyond those bona-fide bugs, this test also demonstrates several places
where we intentionally deviated from Cassandra's behavior, forcing me
to comment out several checks. These deviations are known, and intentional,
but some of them are undocumented and it's worth listing here the ones
re-discovered by this test:

1. On a successful conditional write, Cassandra returns just True, Scylla
   also returns the old contents of the row. This difference is officially
   documented in docs/kb/lwt-differences.rst.
2. Scylla allows the test "l = [null]" or "s = {null}" with this weird
   null element (the result is false), whereas Cassandra prints an error.
3. Scylla allows "l[null]" or "m[null]" (resulting in null), Cassandra
   prints an error.
4. Scylla allows a negative list index, "l[-2]", resulting in null.
   Cassandra prints an error in this case.
5. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
   UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
6. Scylla allows "IN null" (the condition just fails), Cassandra prints
   an error in this case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13663
2023-05-02 11:53:58 +03:00
Pavel Emelyanov
d4a72de406 test: Make resharding test run over s3 too
Now when the test case and used lib/utils code is using storage-agnostic
approach, it can be extended to run over S3 storage as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:46:23 +03:00
Pavel Emelyanov
2601c58278 test: Add lambda to fetch bloom filter size
The resharding test compares bloom filter sizes before and after reshard
runs. For that it gets the filter on-disk filename and stat()s it. That
won't work with S3 as it doesn't have its accessable on-disk files.

Some time ago there existed the storage::get_stats() method, but now
it's gone. The new s3::client::get_object_stat() is coming, but it will
take time to switch to it. For now, generalize filter size fetching into
a local lambda. Next patch will make a stub in it for S3 case, and once
the get_object_stat() is there we'll be able to smoothly start using it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:43:26 +03:00
Kefu Chai
135b4fd434 db: schema_tables: capture reference to temporary value by value
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.

this also silences warning from GCC-13:

```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                              ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fixes #13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13721
2023-05-02 11:42:43 +03:00
Pavel Emelyanov
76594bf72b test: Tune resharding test use of sstable::test_env
The test case in question spawns async context then makes the test_env
instance on the stack (and stopper for it too). There's helper for the
above steps, better to use them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:30:03 +03:00
Pavel Emelyanov
439c8770aa test: Make datafile test case run over s3 too
Most of the sstable_datafile test cases are capable of running with S3
storage, so this patch makes the simplest of them do it. Patching the
rest from this file is optional, because mostly the cases test how the
datafile data manipulations work without checking the files
manipulations. So even if making them all run over S3 is possible, it
will just increase the testing time w/o real test of the storage driver.

So this patch makes one test case run over local and S3 storages, more
patches to update more test cases with files manipulations are yet to
come.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:30:03 +03:00
Pavel Emelyanov
f7df238545 test: Propagate storage options to table_for_test
Teach table_for_tests use any storage options, not just local one. For
now the only user that passes non-local options is sstables::test_env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:30:03 +03:00
Pavel Emelyanov
fa1de16f30 test: Add support for s3 storage_options in config
When the sstable test case wants to run over S3 storage it needs to
specify that in test config by providing the S3 storage options. So
first thing this patch adds is the helper that makes these options based
on the env left by minio launcher from test.py.

Next, in order to make sstables_manager work with S3 it needs the
plugged system keyspace which, in turn, needs query processor, proxy,
database, etc. All this stuff lives in cql_test_env, so the test case
running with S3 options will run in a sstables::test_env nested inside
cql_test_env. The latter would also need to plug its system keyspace to
the former's sstables manager and turn the experimental feature ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:30:03 +03:00
Nadav Har'El
57ffbcbb22 cql3: fix spurious token names in syntax error messages
We have known for a long time (see issue #1703) that the quality of our
CQL "syntax error" messages leave a lot to be desired, especially when
compared to Cassandra. This patch doesn't yet bring us great error
messages with great context - doing this isn't easy and it appears that
Antlr3's C++ runtime isn't as good as the Java one in this regard -
but this patch at least fixes **garbage** printed in some error messages.

Specifically, when the parser can deduce that a specific token is missing,
it used to print

    line 1:83 missing ')' at '<missing '

After this patch we get rid of the meaningless string '<missing ':

    line 1:83 : Missing ')'

Also, when the parser deduced that a specific token was unneeded, it
used to print:

    line 1:83 extraneous input ')' expecting <invalid>

Now we got rid of this silly "<invalid>" and write just:

    line 1:83 : Unexpected ')'

Refs #1703. I didn't yet marked that issue "fixed" because I think a
complete fix would also require printing the entire misparsed line and the
point of the parse failure. Scylla still prints a generic "Syntax Error"
in most cases now, and although the character number (83 in the above
example) can help, it's much more useful to see the actual failed
statement and where character 83 is.

Unfortunately some tests enshrine buggy error messages and had to be
fixed. Other tests enshrined strange text for a generic unexplained
error message, which used to say "  : syntax error..." (note the two
spaces and elipses) and after this patch is " : Syntax error". So
these tests are changed. Another message, "no viable alternative at
input" is deliberately kept unchanged by this patch so as not to break
many more tests which enshrined this message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13731
2023-05-02 11:23:58 +03:00
Pavel Emelyanov
1e03733e8c test: Outline sstables::test_env::do_with_async()
It's growing larger, better to keep it in .cc file

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:15:45 +03:00
Pavel Emelyanov
f223f5357d test: Keep storage options on sstable_test_env config
So that it could be set to s3 by the test case on demand. Default is
local storage which uses env's tempdir or explicit path argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:15:45 +03:00
Pavel Emelyanov
81a1416ebf sstables: Add and call storage::destroy()
The s3_storage leaks client when sstable gets destoryed. So far this
came unnoticed, but debug-mode unit test ran over minio captured it. So
here's the fix.

When sstable is destroyed it also kicks the storage to do whatever
cleanup is needed. In case of s3 storage the cleanup is in closing the
on-boarded client. Until #13458 is fixed each sstable has its own
private version of the client and there's no other place where it can be
close()d in co_await-able mannter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:15:44 +03:00
Avi Kivity
c0eb0d57bc install-dependencies.sh: don't use fgrep
fgrep says:

    fgrep: warning: fgrep is obsolescent; using grep -F

follow its advice.

Closes #13729
2023-05-02 11:15:40 +03:00
Pavel Emelyanov
3e0c3346a8 sstables: Coroutinize sstable::destroy()
To simiplify patching by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-02 11:15:11 +03:00
Nadav Har'El
e74f69bb56 alternator: unit test for number magnitude and precision function
In the previous patch we added a limit in Alternator for the magnitude
and precision of numbers, based on a function get_magnitude_and_precision
whose implementation was, unfortunately, rather elaborate and delicate.

Although we did add in the previous patches some end-to-end tests which
confirmed that the final decision made based on this function, to accept or
reject numbers, was a correct decision in a few cases, such an elaborate
function deserves a separate unit test for checking just that function
in isolation. In fact, this unit tests uncovered some bugs in the first
implementation of get_magnitude_and_precision() which the other tests
missed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-05-02 11:04:05 +03:00
Nadav Har'El
3c0603558c alternator: add validation of numbers' magnitude and precision
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:

1. Users might get used to this "unofficial" feature and start relying
   on it, not allowing us to switch to a more efficient limited-precision
   implementation later.

2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
   number with 1.0 will result in a huge number, huge allocations and
   stalls. This is highly undesirable.

After this patch, all tests in test/alternator/test_number.py now
pass. The various failing tests which verify magnitude and precision
limitations in different places (key attributes, non-key attributes,
and arithmetic expressions) now pass - so their "xfail" tags are removed.

Fixes #6794

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-05-02 11:04:05 +03:00
Nadav Har'El
0eccc49308 test/alternator: more tests for limits on number precision and magnitude
We already have xfailing tests for issue #6794 - the missing checks on
precision and magnitudes of numbers in Alternator - but this patch adds
checks for additional corner cases. In particular we check the case that
numbers are used in a *key* column, which goes to a different code path
than numbers used in non-key columns, so it's worth testing as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-05-02 11:04:05 +03:00
Nadav Har'El
56b8b9d670 test/alternator: reproducer for DoS in unlimited-precision addition
As already noted in issue #6794, whereas DynamoDB limits the magnitude
of numbers to between 10^-130 and 10^125, Scylla does not. In this patch
we add yet another test for this problem, but unlike previous tests
which just shown too much magnitude being allowed which always sounded
like a benign problem - the test in this patch shows that this "feature"
can be used to DoS Scylla - a user user can send a short request that
causes arbitrarily-large allocations, stalls and CPU usage.

The test is currently marked "skip" because it cause cause Scylla to
take a very long time and/or run out of memory. It passes on DynamoDB
because the excessive magnitude is simply not allowed there.

Refs #6794

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-05-02 11:03:51 +03:00
Benny Halevy
959a740dac utils: to_string: get rid of utils::join
Use `fmt::format("{}", fmt::join(...))` instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:59:58 +03:00
Benny Halevy
e6bcb1c8df utils: to_string: get rid of to_string(std::initializer_list)
It's unused.

Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).

Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.

This doesn't break anything since to_string(initializer_list)
wasn't used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
ba883859c7 utils: to_string: get rid of to_string(const Range&)
Use fmt::to_string instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
15c9f0f0df utils: to_string: generalize range helpers
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.

Modernize the implementation using templates based
on std::ranges::range and using fmt::join.

Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.

Fixes #13146

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
59e89efca6 test: add string_format_test
Test string formatting before cleaning up
utils/to_string.hh in the next patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Benny Halevy
45153b58bd utils: chunked_vector: add std::ranges::range ctor
To be used in next patch for constructing
chunked_vector from an initializer_list.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-02 10:48:46 +03:00
Wojciech Mitros
b18c21147f cql: check if the keyspace is system when altering permissions
Currently, when altering permissions on a functions resource, we
only check if it's a builtin function and not if it's all functions
in the "system" keyspace, which contains all builtin functions.
This patch adds a check of whether the function resource keyspace
is "system". This check actually covers both "single function"
and "all functions in keyspace" cases, so the additional check
for single functions is removed.

Closes #13596
2023-05-02 10:13:59 +03:00
Botond Dénes
022465d673 Merge 'Tone down offstrategy log message' from Benny Halevy
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do.  In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.

This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.

Fixes #13466

Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.

Fixes #13467

Closes #13520

* github.com:scylladb/scylladb:
  compaction_manager: print compaction_group id
  compaction_group, table_state: add group_id member
  compaction_manager: offstrategy compaction: skip compaction if no candidates are found
2023-05-02 08:05:18 +03:00
Avi Kivity
9c37fdaca3 Revert "dht: incremental_owned_ranges_checker: use lower_bound()"
This reverts commit d85af3dca4. It
restores the linear search algorithm, as we expect the search to
terminate near the origin. In this case linear search is O(1)
while binary search is O(log n).

A comment is added so we don't repeat the mistake.

Closes #13704
2023-05-02 08:01:44 +03:00
Benny Halevy
707bd17858 everywhere: optimize calls to make_flat_mutation_reader_from_mutations_v2 with single mutation
No point in going through the vector<mutation> entry-point
just to discover in run time that it was called
with a single-element vector, when we know that
in advance.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13733
2023-05-02 07:58:34 +03:00
Avi Kivity
72c12a1ab2 Merge 'cdc, db_clock: specialize fmt::formatter<{db_clock::time_point, generation_id}>' from Kefu Chai
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `cdc::generation_id` and `db_clock::time_point` without the help of `operator<<`.

the formatter of `cdc::generation_id` uses that of `db_clock::time_point` , so these two commits are posted together in a single pull request.

the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now.

Refs #13245

Closes #13703

* github.com:scylladb/scylladb:
  db_clock: specialize fmt::formatter<db_clock::time_point>
  cdc: generation: specialize fmt::formatter<generation_id>
2023-05-01 22:56:33 +03:00
Avi Kivity
7b7d9bcb14 Merge 'Do not access owned_ranges_ptr across shards in update_sstable_cleanup_state' from Benny Halevy
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):

- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
  - Fixes #13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
  - allow the filtering reader now used during resharding to process tokens owned by other shards

Closes #13635

* github.com:scylladb/scylladb:
  compaction: make_partition_filter: do not assert shard ownership
  distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
  table, compaction_manager: prevent cross shard access to owned_ranges_ptr
2023-05-01 22:51:00 +03:00
Avi Kivity
c9dab3ac81 Merge 'treewide: fix warnings from GCC-13' from Kefu Chai
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.

see also #13243

Closes #13723

* github.com:scylladb/scylladb:
  cdc: initialize an optional using its value type
  compaction: disambiguate type name
  db: schema_tables: drop unused variable
  reader_concurrency_semaphore: fix signed/unsigned comparision
  locator: topology: disambiguate type names
  raft: disambiguate promise name in raft::awaited_conf_changes
2023-05-01 22:48:00 +03:00
Kefu Chai
37f1beade5 s3/client: do not allocate potentially big object on stack
when compiling using GCC-13, it warns that:

```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
  224 | sstring parse_multipart_upload_id(sstring& body) {
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~
```

so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13722
2023-05-01 22:46:18 +03:00
Kefu Chai
108f20c684 cql3: capture reference to temporary value by value
`data_dictionary::database::find_keyspace()` returns a temporary
object, and `data_dictionary::keyspace::user_types()` returns a
references pointing to a member of this temporary object. so we
cannot use the reference after the expression is evaluated. in
this change, we capture the return value of `find_keyspace()` using
universal reference, and keep the return value of `user_types()`
with a reference, to ensure us that we can use it later.

this change silences the warning from GCC-13, like:

```
/home/kefu/dev/scylladb/cql3/statements/authorization_statement.cc:68:21: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
   68 |         const auto& utm = qp.db().find_keyspace(*keyspace).user_types();
      |                     ^~~
```

Fixes #13725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13726
2023-05-01 22:41:41 +03:00
Kefu Chai
b76877fd99 transport: capture reference to temp value by value
`current_scheduling_group()` returns a temporary value, and `name()`
returns a reference, so we cannot capture the return value by reference,
and use the reference after this expression is evaluated. this would
cause undefined behavior. so let's just capture it by value.

this change also silence following warning from GCC-13:

```
/home/kefu/dev/scylladb/transport/server.cc:204:11: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
  204 |     auto& cur_sg_name = current_scheduling_group().name();
      |           ^~~~~~~~~~~
/home/kefu/dev/scylladb/transport/server.cc:204:56: note: the temporary was destroyed at the end of the full expression ‘seastar::current_scheduling_group().seastar::scheduling_group::name()’
  204 |     auto& cur_sg_name = current_scheduling_group().name();
      |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fixes #13719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13724
2023-05-01 22:40:36 +03:00
Kefu Chai
0a3a254284 cql3: do not capture reference to temporary value
`data_dictionary::database::find_column_family()` return a temporary value,
and `data_dictionary::table::get_index_manager()` returns a reference in
this temporary value, so we cannot capture this reference and use it after
the expression is evaluated. in this change, we keep the return value
of `find_column_family()` by value, to extend the lifecycle of the return
value of `get_index_manager()`.

this should address the warning from GCC-13, like:

```
/home/kefu/dev/scylladb/cql3/restrictions/statement_restrictions.cc:519:15: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
  519 |         auto& sim = db.find_column_family(_schema).get_index_manager();
      |               ^~~
```

Fixes #13727
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13728
2023-05-01 22:39:48 +03:00
Nadav Har'El
1cefb662cd Merge 'cql3/expr: remove expr::token' from Jan Ciołek
Let's remove `expr::token` and replace all of its functionality with `expr::function_call`.

`expr::token` is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses `expr::token` to represent the `token(p1, p2)` part.

The situation with `expr::token` is a bit complicated.
On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the `token()` function, for example `token(1, 2, 3)` could be a `function_call`, but it could also be `expr::token`.

The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented as `expr::token` is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong.

Currently `expr::token` is created only in one specific case.
When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates `expr::token`.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle.

There is also a problem because there's a lot of code duplication between a `function_call` and `expr::token`.
All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function.
Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`.

One more aspect is multi-table queries.
With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is represented as `expr::token` for one schema would be represented as `function_call` in the context of a different schema.
Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries.

Overall I think it would be best to remove `expr::token`.

Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble.

Instead of having both `expr::token` and `function_call` we can just have the `function_call` and check if it represents a partition token when needed.

Refs: #12906
Refs: #12677
Closes: #12905

Closes #13480

* github.com:scylladb/scylladb:
  cql3: remove expr::token
  cql3: keep a schema in visitor for extract_clustering_prefix_restrictions
  cql3: keep a schema inside the visitor for extract_partition_range
  cql3/prepare_expr: make get_lhs_receiver handle any function_call
  cql3/expr: properly print token function_call
  expr_test: use unresolved_identifier when creating token
  cql3/expr: split possible_lhs_values into column and token variants
  cql3/expr: fix error message in possible_lhs_values
  cql3: expr: reimplement is_satisfied_by() in terms of evaluate()
  cql3/expr: add a schema argument to expr::replace_token
  cql3/expr: add a comment for expr::has_partition_token
  cql3/expr: add a schema argument to expr::has_token
  cql3: use statement_restrictions::has_token_restrictions() wherever possible
  cql3/expr: add expr::is_partition_token_for_schema
  cql3/expr: add expr::is_token_function
  cql3/expr: implement preparing function_call without a receiver
  cql3/functions: make column family argument optional in functions::get
  cql3/expr: make it possible to prepare expr::constant
  cql3/expr: implement test_assignment for column_value
  cql3/expr: implement test_assignment for expr::constant
2023-04-30 15:31:35 +03:00
Tomasz Grabiec
aba5667760 Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.

This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.

Closes #13690

* github.com:scylladb/scylladb:
  raft topology: rename `update_replica_state` -> `update_topology_state`
  raft topology: remove `transition_state::normal`
  raft topology: switch on `transition_state` first
  raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
  raft topology: parse replaced node in `exec_global_command`
  raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
  storage_service: extract raft topology coordinator fiber to separate class
  raft topology: rename `replication_state` to `transition_state`
  raft topology: make `replication_state` a topology-global state
2023-04-30 10:55:24 +02:00
Kefu Chai
e333bcc2da cdc: initialize an optional using its value type
as this syntax is not supported by the standard, it seems clang
just silently construct the value with the initializer list and
calls the operator=, but GCC complains:

```
/home/kefu/dev/scylladb/cdc/split.cc:392:54: error: converting to ‘std::optional<partition_deletion>’ from initializer list would use explicit constructor ‘constexpr std::optional<_Tp>::optional(_Up&&) [with _Up = const tombstone&; typename std::enable_if<__and_v<std::__not_<std::is_same<std::optional<_Tp>, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::__not_<std::is_same<std::in_place_t, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::is_constructible<_Tp, _Up>, std::__not_<std::is_convertible<_Iter, _Iterator> > >, bool>::type <anonymous> = false; _Tp = partition_deletion]’
  392 |         _result[t.timestamp].partition_deletions = {t};
      |                                                      ^
```

to silences the error, and to be more standard compliant,
let's use emplace() instead.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 19:34:12 +08:00
Jan Ciolek
be8ef63bf5 cql3: remove expr::token
Let's remove expr::token and replace all of its functionality with expr::function_call.

expr::token is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`,
this will be internally represented as an expression which uses
expr::token to represent the `token(p1, p2)` part.

The situation with expr::token is a bit complicated.
On one hand side it's supposed to represent the partition token,
but sometimes it's also assumed that it can represent a generic
call to the token() function, for example `token(1, 2, 3)` could
be a function_call, but it could also be expr::token.

The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented
as expr::token is dangerous - the query planning
might think that it is `token(p1, p2, p3)` and plan the query
based on this, which would be wrong.

Currently expr::token is created only in one specific case.
When the parser detects that the user typed in a restriction
which has a call to `token` on the LHS it generates expr::token.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token,
it stays a `function_call`. During preparation there is no check
to see if a `function_call` to `token` could be turned into `expr::token`.
This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented
as `expr::token` and the query planner handles that, but sometimes it might
be represented as `function_call`, which the query planner doesn't handle.

There is also a problem because there's a lot of duplication
between a `function_call` and `expr::token`. All of the evaluation
and preparation is the same for `expr::token` as it's for a `function_call`
to the token function. Currently it's impossible to evaluate `expr::token`
and preparation has some flaws, but implementing it would basically
consist of copy-pasting the corresponding code from token `function_call`.

One more aspect is multi-table queries. With `expr::token` we turn
a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple
tables? The schema is different, so something that is representad
as `expr::token` for one schema would be represented as `function_call`
in the context of a different schema.
Translating expressions to different tables would require careful
manipulation to convert `expr::token` to `function_call` and vice versa.
This could cause trouble for index queries.

Overall I think it would be best to remove expr::token.

Although having a clear marker for the partition token
is sometimes nice for query planning, in my opinion
the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things,
having two separate representations of the same thing
without clear boundaries between them causes trouble.

Instead of having expr::token and function_call we can
just have the function_call and check if it represents
a partition token when needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:11:31 +02:00
Jan Ciolek
6e0ae59c5a cql3: keep a schema in visitor for extract_clustering_prefix_restrictions
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:11:31 +02:00
Jan Ciolek
551135e83f cql3: keep a schema inside the visitor for extract_partition_range
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:11:30 +02:00
Jan Ciolek
16bc1c930f cql3/prepare_expr: make get_lhs_receiver handle any function_call
get_lhs_receiver looks at the prepared LHS of a binary operator
and creates a receiver corresponding to this LHS expression.
This receiver is later used to prepare the RHS of the binary operator.

It's able to handle a few expression types - the ones that are currently
allowed to be on the LHS.
One of those types is `expr::token`, to handle restrictions like `token(p1, p2) = 3`.

Soon token will be replaced by `expr::function_call`, so the function will need
to handle `function_calls` to the token function.

Although we expect there to be only calls to the `token()` function,
as other functions are not allowed on the LHS, it can be made generic
over all function calls, which will help in future grammar extensions.

The functions call that it can currently get are calls to the token function,
but they're not validated yet, so it could also be something like `token(pk, pk, ck)`.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:53 +02:00
Jan Ciolek
d3a958490e cql3/expr: properly print token function_call
Printing for function_call is a bit strange.
When printing an unprepared function it prints
the name and then the arguments.

For prepared function it prints <anonymous function>
as the name and then the arguments.
Prepared functions have a name() method, but printing
doesn't use it, maybe not all functions have a valid name(?).

The token() function will soon be represent as a function_call
and it should be printable in a user-readable way.
Let's add an if which prints `token(arg1, arg2)`
instead of `<anonymous function>(arg1, arg2)` when printing
a call to the token function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:53 +02:00
Jan Ciolek
289ca51ee5 expr_test: use unresolved_identifier when creating token
One test for expr::token uses raw column identifier
in the test.

Let's change it to unresloved_identifier, which is
a standard representation of unresolved column
names in expressions.

Once expr::token is removed it will be possible
to create a function_call with unresolved_identifiers
as arguments.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:53 +02:00
Jan Ciolek
096efc2f38 cql3/expr: split possible_lhs_values into column and token variants
The possible_lhs_values takes an expression and a column
and finds all possible values for the column that make
the expression true.

Apart from finding column values it's also capable of finding
all matching values for the partition key token.
When a nullptr column is passed, possible_lhs_values switches
into token values mode and finds all values for the token.

This interface isn't ideal.
It's confusing to pass a nullptr column when one wants to
find values for the token. It would be better to have a flag,
or just have a separate function.

Additionally in the future expr::token will be removed
and we will use expr::is_partition_token_for_schema
to find all occurences of the partition token.
expr::is_partition_token_for_schema takes a schema
as an argument, which possible_lhs_values doesn't have,
so it would have to be extended to get the schema from
somewhere.

To fix these two problems let's split possible_lhs_values
into two functions - one that finds possible values for a column,
which doesn't require a schema, and one that finds possible values
for the partition token and requires a schema:

value_set possible_column_values(const column_definition* col, const expression& e, const query_options& options);
value_set possible_partition_token_values(const expression& e, const query_options& options, const schema& table_schema);

This will make the interface cleaner and enable smooth transition
once expr::token is removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:53 +02:00
Jan Ciolek
f2e5f654f2 cql3/expr: fix error message in possible_lhs_values
In possible_lhs_values there was a message talking
about is_satisifed_by. It looks like a badly
copy-pasted message.

Change it to possibel_lhs_values as it should be.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:52 +02:00
Avi Kivity
dc3c28516d cql3: expr: reimplement is_satisfied_by() in terms of evaluate()
It calls evaluate() internally anyway.

There's a scary if () in there talking about tokens, but everything
appears to work.
2023-04-29 13:04:52 +02:00
Jan Ciolek
ad5c931102 cql3/expr: add a schema argument to expr::replace_token
Just like has_token, replace_token will use
expr::is_partition_token_for_schema to find all instance
of the partition token to replace.

Let's prepare for this change by adding a schema argument
to the function before making the big change.

It's unsued at the moment, but having a separate commit
should make it easier to review.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:52 +02:00
Jan Ciolek
d50db32d14 cql3/expr: add a comment for expr::has_partition_token
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:52 +02:00
Jan Ciolek
18879aad6f cql3/expr: add a schema argument to expr::has_token
In the future expr::token will be removed and checking
whether there is a partition token inside an expression
will be done using expr::is_partition_token_for_schema.

This function takes a schema as an argument,
so all functions that will call it also need
to get the schema from somewhere.

Right now it's an unused argument, but in the future
it will be used. Adding it in a separate commit
makes it easier to review.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:52 +02:00
Jan Ciolek
90b3b85bd0 cql3: use statement_restrictions::has_token_restrictions() wherever possible
The statement_restrictions class has a method called has_token_restriction().
This method checks whether the partition key restrictions contain expr::token.

Let's use this function in all applicable places instead of manually calling has_token().

In the future has_token() will have an additional schema argument,
so eliminating calls to has_token() will simplify the transition.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:52 +02:00
Jan Ciolek
7af010095e cql3/expr: add expr::is_partition_token_for_schema
Add a function to check whether the expression
represents a partition token - that is a call
to the token function with consecutive partition
key columns as the arguments.

For example for `token(p1, p2, p3)` this function
would return `true`, but for `token(1, 2, 3)` or `token(p3, p2, p1)`
the result would be `false`.

The function has a schema argument because a schema is required
to get the list of partition columns that should be passed as
arguments to token().

Maybe it would be possible to infer the schema from the information
given earlier during prepare_expression, but it would be complicated
and a bit dangerous to do this. Sometimes we operate on multiple tables
and the schema is needed to differentiate between them - a token() call
can represent the base table's partition token, but for an index table
this is just a normal function call, not the partition token.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:51 +02:00
Jan Ciolek
694d9298aa cql3/expr: add expr::is_token_function
Add a function that can be used to check
whether a given expression represents a call
to the token() function.

Note that a call to token() doesn't mean
that the expression represents a partition
token - it could be something like token(1, 2, 3),
just a normal function_call.

The code for checking has been taken from functions::get.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:51 +02:00
Jan Ciolek
f7cac10fe0 cql3/expr: implement preparing function_call without a receiver
Currently trying to do prepare_expression(function_call)
with a nullptr receiver fails.

It should be possible to prepare function calls without
a known receiver.

When the user types in: `token(1, 2, 3)`
the code should be able to figure out that
they are looking for a function with name `token`,
which takes 3 integers as arguments.

In order to support that we need to prepare
all arguments that can be prepared before
attempting to find a function.

Prepared expressions have a known type,
which helps to find the right function
for the given arguments.

Additionally the current code for finding
a function requires all arguments to be
assignment_testable, which requires to prepare
some expression types, e.g column_values.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:04:51 +02:00
Jan Ciolek
15ed83adbc cql3/functions: make column family argument optional in functions::get
The method `functions::get` is used to get the `functions::function` object
of the CQL function called using `expr::function_call`.

Until now `functions::get` required the caller to pass both the keyspace
and the column family.

The keyspace argument is always needed, as every CQL function belongs
to some keyspace, but the column family isn't used in most cases.

The only case where having the column family is really required
is the `token()` function. Each variant of the `token()` function
belongs to some table, as the arguments to the function are the
consecutive partition key columns.

Let's make the column family argument optional. In most cases
the function will work without information about column family.
In case of the `token()` function there's gonna be a check
and it will throw an exception if the argument is nullopt.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-29 13:00:01 +02:00
Kefu Chai
0232115eaa compaction: disambiguate type name
otherwise GCC-13 complains:

```
/home/kefu/dev/scylladb/compaction/compaction_state.hh:38:22: error: declaration of ‘compaction::owned_ranges_ptr compaction::compaction_state::owned_ranges_ptr’ changes meaning of ‘owned_ranges_ptr’ [-Wchanges-meaning]
   38 |     owned_ranges_ptr owned_ranges_ptr;
      |                      ^~~~~~~~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 17:02:25 +08:00
Kefu Chai
56511a42d0 db: schema_tables: drop unused variable
this also silence the warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable]
 1489 |     auto ts = db_clock::now();
      |          ^~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 17:02:25 +08:00
Kefu Chai
48387a5a9a reader_concurrency_semaphore: fix signed/unsigned comparision
a signed/unsigned comparsion can overflow. and GCC-13 rightly points
this out. so let's use `std::cmp_greater_equal()` when comparing
unsigned and signed for greater-or-equal.

```
/home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
  931 |     if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] {
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 17:02:25 +08:00
Kefu Chai
6d8188ad70 locator: topology: disambiguate type names
otherwise GCC-13 complains:
```
/home/kefu/dev/scylladb/locator/topology.hh:70:21: error: declaration of ‘const locator::topology* locator::node::topology() const’ changes meaning of ‘topology’ [-Wchanges-meaning]
   70 |     const topology* topology() const noexcept {
      |                     ^~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 17:02:25 +08:00
Kefu Chai
f80f638bb9 raft: disambiguate promise name in raft::awaited_conf_changes
otherwise GCC 13 complains that

```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
   42 |     promise<> promise;
      |               ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
   42 |     promise<> promise;
      |     ^~~~~~~~~
```
see also cd4af0c722

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-29 17:02:25 +08:00
Botond Dénes
f527b28174 Merge 'treewide: reenable -Wmissing-braces' from Kefu Chai
this change silences the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like

```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
                options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
                                            ^~~~~~~~~~~~~~~~~~~~~~~~~
                                            {                        }
```

in this change,

also, take the opportunity to use structured binding to simplify the
related code.

Closes #13705

* github.com:scylladb/scylladb:
  build: reenable -Wmissing-braces
  treewide: add braces around subobject
  cql3/stats: use zero-initialization
2023-04-28 16:00:14 +03:00
Kefu Chai
43e9910fa0 utils/chunked_managed_vector: use operator<=> when appropriate
instead of crafting 4 operators manually, just delegate it to <=>.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13698
2023-04-28 15:59:08 +03:00
Botond Dénes
0b5d9d94fa Merge 'Kill sstable::storage::get_stats() to help S3 provide accurate SSTable component stats' from Raphael "Raph" Carvalho
S3 wasn't providing filter size and accurate size for all SSTable components on disk.

First, filter size is provided by taking advantage that its in-memory representation is roughly the same as on-disk one.

Second, size for all components is provided by piggybacking on sstable parser and writer, so no longer a need to do a separate additional step after Scylla have either parsed or written all components.

Finally, sstable::storage::get_stats() is killed, so the burden is no longer pushed on the storage type implementation.

Refs #13649.

Closes #13682

* github.com:scylladb/scylladb:
  test: Verify correctness of sstable::bytes_on_disk()
  sstable: Piggyback on sstable parser and writer to provide bytes_on_disk
  sstable: restore indentation in read_digest() and read_checksum()
  sstable: make all parsing of simple components go through do_read_simple()
  sstable: Add missing pragma once to random_access_reader.hh
  sstable: make all writing of simple components go through do_write_simple()
  test: sstable_utils: reuse set_values()
  sstable: Restore indentation in read_simple()
  sstable: Coroutinize read_simple()
  sstable: Use filter memory footprint in filter_size()
2023-04-28 15:58:39 +03:00
Kefu Chai
ba8402067f db, sstable: add operator data_value() for generation_type
so we can apply `execute_cql()` on `generation_type` directly without
extracting its value using `generation.value()`. this paves the road to
adding UUID based generation id to `generation_type`. as by then, we
will have both UUID based and integer based `generation_type`, so
`generation_type::value()` will not be able to represent its value
anymore. and this method will be replaced by `operator data_value()` in
this use case.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 20:39:12 +08:00
Kefu Chai
ae9aa9c4bd db, sstable: print generation instead of its value
this change prepares for the change to use `variant<UUID, int64_t>`
as the value of `generation_type`. as after this change, the "value"
of a generation would be a UUID or an integer, and we don't want to
expose the variant in generation's public interface. so the `value()`
method would be changed or removed by then.

this change takes advantage of the fact that the formatter of
`generation_type` always prints its value. also, it's better to
reuse `generation_type` formatter when appropriate.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 20:39:12 +08:00
Jan Ciolek
b3d05f3525 cql3/expr: make it possible to prepare expr::constant
try_prepare_expression(constant) used to throw an error
when trying to prepeare expr::constant.

It would be useful to be able to do this
and it's not hard to implement.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-28 14:34:59 +02:00
Jan Ciolek
bf36cde29a cql3/expr: implement test_assignment for column_value
Make it possible to do test_assignment for column_values.
It's implemented using the generic expression assignment
testing function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-28 14:34:59 +02:00
Jan Ciolek
fd174bda60 cql3/expr: implement test_assignment for expr::constant
test_assignment checks whether a value of some type
can be assigned to a value of different type.

There is no implementation of test_assignment
for expr::constant, but I would like to have one.

Currently there is a custom implementation
of test_assignment for each type of expression,
but generally each of them boils down to checking:
```
type1->is_value_compatible_with(type2)
```

Instead of implementing another type-specific funtion
I added expresion_test_assignment and used it to
implement test_assignment for constant.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-04-28 14:34:56 +02:00
Kefu Chai
a34e417069 streaming: remove unused operator==
since this operator is used nowhere, let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13697
2023-04-28 12:39:17 +03:00
Kefu Chai
662f8fa66e build: reenable -Wmissing-braces
since we've addressed all the -Wmissing-braces warnings, we can
now enable this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 16:59:29 +08:00
Kefu Chai
eb7c41767b treewide: add braces around subobject
this change helps to silence the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like

```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
                options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
                                            ^~~~~~~~~~~~~~~~~~~~~~~~~
                                            {                        }
```

in this change,

also, take the opportunity to use structured binding to simplify the
related code.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 16:59:29 +08:00
Kefu Chai
91f22b0e81 cql3/stats: use zero-initialization
use {} instead of {0ul} for zero initialization. as `_query_cnt`
is a multi-dimension array, each elements in `_query_cnt` is yet
another array. so we cannot initialize it with a `{0ul}`. but
to zero-initialize this array, we can just use `{}`, as per
https://en.cppreference.com/w/cpp/language/zero_initialization

> If T is array type, each element is zero-initialized.

so this should recursively zero-initialize all arrays in `_query_cnt`.

this change should silence following warning:

stats.hh:88:60: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
            [statements::statement_type::MAX_VALUE + 1] = {0ul};
                                                           ^~~
                                                           {  }

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 16:59:29 +08:00
Botond Dénes
a93e5698b0 Merge 'Adding MindsDB integration to Docs' from Guy Shtub
@annastuchlik please review

Closes #13691

* github.com:scylladb/scylladb:
  adding documentation for integration with MindsDB
  adding documentation for integration with MindsDB
2023-04-28 11:47:10 +03:00
Botond Dénes
c6be764d46 Merge 'build: cmake: pick up tablets related changes and cleanups' from Kefu Chai
this series syncs the CMake building system with `configure.py` which was updated for introducing the tablets feature. also, this series include a couple cleanups.

Closes #13699

* github.com:scylladb/scylladb:
  build: cmake: remove dead code
  build: move test-perf down to test/perf
  build: cmake: pick up tablets related changes
2023-04-28 11:35:04 +03:00
Kefu Chai
066371adfa db_clock: specialize fmt::formatter<db_clock::time_point>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `db_clock::time_point` without the help of `operator<<`.

the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 15:48:06 +08:00
Kefu Chai
7863ef53ad cdc: generation: specialize fmt::formatter<generation_id>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `generation_id` without the help of `operator<<`.

the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 15:47:44 +08:00
Michał Sala
3b44ecd1e7 scripts: open-coredump.sh: suggest solib-search-path
Loading cores from Scylla executables installed in a non-standard
location can cause gdb to fail reading required libraries.

This is an example of a warning I've got after trying to load core
generated by dtest jenkins job (using ./scripts/open-coredump.sh):
> warning: Can't open file /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/0d64f327e1af9bcbb711ee217eda6df16e517c42/libreloc/libboost_system.so.1.78.0 during file-backed mapping note processing

Invocations of `scylla threads` command ended with an error:
> (gdb) scylla threads
> Python Exception <class 'gdb.error'>: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
> Error occurred in Python: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target

An easy fix for this is to set solib-search-path to
/opt/scylladb/libreloc/.

This commit adds that set command to suggested command line gdb
arguments. I guess it's a good idea to always suggest setting
solib-search-path to that path, as it can save other people from wasting
their time on looking why does coredump opening does not work.

Closes #13696
2023-04-28 08:11:01 +03:00
Kefu Chai
572fab37bb build: cmake: remove dead code
the removed CMake script was designed to cater the needs when
Seastar's CMake script is not included in the parent project, but
this part is never tested and is dysfunctional as the `target_source()`
misses the target parameter. we can add it back when it is actually needed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 11:13:41 +08:00
Kefu Chai
d4530b023e build: move test-perf down to test/perf
so it is closer to where the sources are located.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 11:13:41 +08:00
Kefu Chai
56b99b7879 build: cmake: pick up tablets related changes
to sync with the changes in 5e89f2f5ba

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-28 11:13:41 +08:00
Benny Halevy
935ff0fcbb types: timestamp_from_string: print current_exception on error
We may catch exceptions that are not `marshal_exception`.
Print std::current_exception() in this case to provide
some context about the marshalling error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13693
2023-04-27 22:30:55 +03:00
Asias He
a8040306bb storage_service: Fix removing replace node as pending
Consider

- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2

  ```
  auto host_id = _gossiper.get_host_id(endpoint);
  auto existing = tmptr->get_endpoint_for_host_id(host_id);
  ```

  host_id = new host id
  existing = empty

  As a result, del_replacing_endpoint() will not be called.

This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.

This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.

To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.

Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3

After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3

Tests: https://github.com/scylladb/scylla-dtest/pull/3126

Closes #13677
2023-04-27 21:03:01 +03:00
Raphael S. Carvalho
4e205650b6 test: Verify correctness of sstable::bytes_on_disk()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:48 -03:00
Raphael S. Carvalho
2dbae856f8 sstable: Piggyback on sstable parser and writer to provide bytes_on_disk
bytes_on_disk is the sum of all sstable components.

As read_simple() fetches the file size before parsing the component,
bytes_on_disk can be added incrementally rather than an additional
step after all components were already parsed.

Likewise, write_simple() tracks the offset for each new component,
and therefore bytes_on_disk can also be added incrementally.

This simplifies s3 life as it no longer have to care about feeding
a bytes_on_disk, which is currently limited to data and index
sizes only.

Refs #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:48 -03:00
Raphael S. Carvalho
4d02821094 sstable: restore indentation in read_digest() and read_checksum()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:48 -03:00
Raphael S. Carvalho
75dc7b799e sstable: make all parsing of simple components go through do_read_simple()
With all parsing of simple components going through do_read_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:48 -03:00
Raphael S. Carvalho
71cd8e6b51 sstable: Add missing pragma once to random_access_reader.hh
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:48 -03:00
Raphael S. Carvalho
b783bddbdf sstable: make all writing of simple components go through do_write_simple()
With all writing of simple components going through do_write_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:06:46 -03:00
Raphael S. Carvalho
bc486b05fa test: sstable_utils: reuse set_values()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:04:52 -03:00
Raphael S. Carvalho
dcee5c4fae sstable: Restore indentation in read_simple()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:04:52 -03:00
Raphael S. Carvalho
253d9e787b sstable: Coroutinize read_simple()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:04:52 -03:00
Raphael S. Carvalho
0dcdec6a55 sstable: Use filter memory footprint in filter_size()
For S3, filter size is currently set to zero, as we want to avoid
"fstat-ing" each file.

On-disk representation of bloom filter is similar to the in-memory
one, therefore let's use memory footprint in filter_size().

User of filter_size() is API implementing "nodetool cfstats" and
it cares about the size of bloom filter data (that's how it's
described).

This way, we provide the filter data size regardless of the
underlying storage type.

Refs #13649.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-27 12:04:52 -03:00
Harsh Soni
84ea2f5066 raft: fsm: add empty check for max_read_id_with_quorum
Updated the empty() function in the struct fsm_output to include the
max_read_id_with_quorum field when checking whether the fsm output is
empty or not. The change was made in order maintain consistency with the
codebase and adding completeness to the empty check. This change has no
impact on other parts of the codebase.

Closes #13656
2023-04-27 16:04:58 +02:00
Kamil Braun
0bee872fb1 raft topology: rename update_replica_state -> update_topology_state
The new name is more generic and appropriate for topology transitions
which don't affect any specific replica but the entire cluster as a
whole (which we'll introduce later).

Also take `guard` directly instead of `node_to_work_on` in this more
generic function. Since we want `node_to_work_on` to die when we steal
its guard, introduce `take_guard` which takes ownership of the object
and returns the guard.
2023-04-27 15:22:19 +02:00
Kamil Braun
22ab5982e7 raft topology: remove transition_state::normal
What this state really represented is that there is currently no
transition. So remove it and make `transition_state` optional instead.
2023-04-27 15:18:32 +02:00
Kamil Braun
61c4e0ae20 raft topology: switch on transition_state first
Previously the code assumed that there was always a 'node to work on' (a
node which wants to change its state) or there was no work to do at all.
It would find such a node, switch on its state (e.g. check if it's
bootstrapping), and in some states switch on the topology
`transition_state` (e.g. check if it's `write_both_read_old`).

We want to introduce transitions that are not node-specific and can work
even when all nodes are 'normal' (so there's no 'node to work on'). As a
first step, we refactor the code so it switches on `transition_state`
first. In some of these states, like `write_both_read_old`, there must
be a 'node to work on' for the state to make sense; but later in some
states it will be optional (such as `commit_cdc_generation`).
2023-04-27 15:14:59 +02:00
Kamil Braun
a023ca2cf1 raft topology: handle_ring_transition: rename res to exec_command_res
A more descriptive name.
2023-04-27 15:12:12 +02:00
Kamil Braun
4ddfce8213 raft topology: parse replaced node in exec_global_command
Will make following commits easier.
2023-04-27 15:10:49 +02:00
Kamil Braun
bafce8fd28 raft topology: extract cleanup_group0_config_if_needed from get_node_to_work_on 2023-04-27 15:04:36 +02:00
Kamil Braun
98f69f52aa storage_service: extract raft topology coordinator fiber to separate class
The lambdas defined inside the fiber are now methods of this class.

Currently `handle_node_transition` is calling `handle_ring_transition`,
in a later commit we will reverse this: `handle_ring_transition` will
call `handle_node_transition`. We won't have to shuffle the functions
around because they are members of the same class, making the change
easier to review. In general, the code will be easier to maintain in
this new form (no need to deal with so many lambda captures etc.)

Also break up some lines which exceeded the 120 character limit (as per
Seastar coding guidelines).
2023-04-27 15:04:35 +02:00
Kefu Chai
87e9686f61 cdc: generation: simpify std::visit() call
if the visitor clauses are the same, we can just use the generic version
of it by specifying the parameter with `auto&`. simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13626
2023-04-27 14:43:20 +02:00
Alejo Sanchez
47d7939b8f test/topology: register RF pytest marker
Register pytest marker for replication_factor.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13688
2023-04-27 12:14:28 +02:00
Guy Shtub
c4664f9b66 adding documentation for integration with MindsDB 2023-04-27 13:13:19 +03:00
Guy Shtub
7e35a07f93 adding documentation for integration with MindsDB 2023-04-27 13:12:38 +03:00
Kamil Braun
defa63dc20 raft topology: rename replication_state to transition_state
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
2023-04-27 11:39:38 +02:00
Kamil Braun
af1ea2bb16 raft topology: make replication_state a topology-global state
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.

The `replication_state` column in `system.topology` is now `static`.

This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
2023-04-27 11:39:38 +02:00
Kamil Braun
30cc07b40d Merge 'Introduce tablets' from Tomasz Grabiec
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.

Things achieved in this PR:

  - You can start a cluster and create a keyspace whose tables will use
    tablet-based replication. This is done by setting `initial_tablets`
    option:

    ```
        CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
                        'replication_factor': 3,
                        'initial_tablets': 8};
    ```

    All tables created in such a keyspace will be tablet-based.

    Tablet-based replication is a trait, not a separate replication
    strategy. Tablets don't change the spirit of replication strategy, it
    just alters the way in which data ownership is managed. In theory, we
    could use it for other strategies as well like
    EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
    is augmented to support tablets.

  - You can create and drop tablet-based tables (no DDL language changes)

  - DML / DQL work with tablet-based tables

    Replicas for tablet-based tables are chosen from tablet metadata
    instead of token metadata

Things which are not yet implemented:

  - handling of views, indexes, CDC created on tablet-based tables
  - sharding is done using the old method, it ignores the shard allocated in tablet metadata
  - node operations (topology changes, repair, rebuild) are not handling tablet-based tables
  - not integrated with compaction groups
  - tablet allocator piggy-backs on tokens to choose replicas.
    Eventually we want to allocate based on current load, not statically

Closes #13387

* github.com:scylladb/scylladb:
  test: topology: Introduce test_tablets.py
  raft: Introduce 'raft_server_force_snapshot' error injection
  locator: network_topology_strategy: Support tablet replication
  service: Introduce tablet_allocator
  locator: Introduce tablet_aware_replication_strategy
  locator: Extract maybe_remove_node_being_replaced()
  dht: token_metadata: Introduce get_my_id()
  migration_manager: Send tablet metadata as part of schema pull
  storage_service: Load tablet metadata when reloading topology state
  storage_service: Load tablet metadata on boot and from group0 changes
  db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
  migration_notifier: Introduce before_drop_keyspace()
  migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
  test: perf: Introduce perf-tablets
  test: Introduce tablets_test
  test: lib: Do not override table id in create_table()
  utils, tablets: Introduce external_memory_usage()
  db: tablets: Add printers
  db: tablets: Add persistence layer
  dht: Use last_token_of_compaction_group() in split_token_range_msb()
  locator: Introduce tablet_metadata
  dht: Introduce first_token()
  dht: Introduce next_token()
  storage_proxy: Improve trace-level logging
  locator: token_metadata: Fix confusing comment on ring_range()
  dht, storage_proxy: Abstract token space splitting
  Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
  db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
  db: Introduce get_non_local_vnode_based_strategy_keyspaces()
  service: storage_proxy: Avoid copying keyspace name in write handler
  locator: Introduce per-table replication strategy
  treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
  locator: Introduce effective_replication_map
  locator: Rename effective_replication_map to vnode_effective_replication_map
  locator: effective_replication_map: Abstract get_pending_endpoints()
  db: Propagate feature_service to abstract_replication_strategy::validate_options()
  db: config: Introduce experimental "TABLETS" feature
  db: Log replication strategy for debugging purposes
  db: Log full exception on error in do_parse_schema_tables()
  db: keyspace: Remove non-const replication strategy getter
  config: Reformat
2023-04-27 09:40:18 +02:00
Kefu Chai
f5b05cf981 treewide: use defaulted operator!=() and operator==()
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.

fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.

in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.

sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.

also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.

please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.

this change is a cleanup to modernize the code base with C++20
features.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13687
2023-04-27 10:24:46 +03:00
Botond Dénes
3e92bcaa20 Merge 'utils: redesign reusable_buffer' from Michał Chojnowski
Common compression libraries work on contiguous buffers.
Contiguous buffers are a problem for the allocator. However, as long as they are short-lived,
we can avoid the expensive allocations by reusing buffers across tasks.

This idea is already applied to the compression of CQL frames, but with some deficiencies.
`utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion.

Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually *increases* memory pressure significantly, instead of reducing it.
`zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard.

Closes #13324

* github.com:scylladb/scylladb:
  zstd: share buffers between compressor instances
  utils: redesign reusable_buffer
2023-04-27 09:09:09 +03:00
Pavel Emelyanov
4f93b440a5 sstables: Remove lost eptr variable from do_write_simple()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13684
2023-04-27 07:37:15 +03:00
Anna Stuchlik
c7df168059 doc: move Glossary to the Reference section
This commit moves the Glossary page to the Reference
section. In addition, it adds the redirection so that
there are no broken links because of this change
and fixes a link to a subsection of Glossary.

Closes #13664
2023-04-27 07:03:55 +03:00
Michał Chojnowski
16dd93cb7e zstd: share buffers between compressor instances
The zstd implementation of `compressor` has a separate decompression and
compression context per instance. This is unreasonably wasteful. One
decompression buffer and one compression buffer *per shard* is enough.

The waste is significant. There might exist thousands of SSTable readers, each
containing its own instance of `compressor` with several hundred KiB worth of
unneeded buffers. This adds up to gigabytes of wasted memory and gigapascals
of allocator pressure.

This patch modifies the implementation of zstd_processor so that all its
instances on the shard share their contexts.

Fixes #11733
2023-04-26 22:09:17 +02:00
Michał Chojnowski
bf26a8c467 utils: redesign reusable_buffer
Large contiguous buffers put large pressure on the allocator
and are a common source of reactor stalls. Therefore, Scylla avoids
their use, replacing it with fragmented buffers whenever possible.
However, the use of large contiguous buffers is impossible to avoid
when dealing with some external libraries (i.e. some compression
libraries, like LZ4).

Fortunately, calls to external libraries are synchronous, so we can
minimize the allocator impact by reusing a single buffer between calls.

An implementation of such a reusable buffer has two conflicting goals:
to allocate as rarely as possible, and to waste as little memory as
possible. The bigger the buffer, the more likely that it will be able
to handle future requests without reallocation, but also the memory
memory it ties up.

If request sizes are repetitive, the near-optimal solution is to
simply resize the buffer up to match the biggest seen request,
and never resize down.

However, if we anticipate pathologically large requests, which are
caused by an application/configuration bug and are never repeated
again after they are fixed, we might want to resize down after such
pathological requests stop, so that the memory they took isn't tied
up forever.

The current implementation of reusable buffers handles this by
resizing down to 0 every 100'000 requests.

This patch attempts to solve a few shortcomings of the current
implementation.
1. Resizing to 0 is too aggressive. During regular operation, we will
surely need to resize it back to the previous size again. If something
is allocated in the hole left by the old buffer, this might cause
a stall. We prefer to resize down only after pathological requests.
2. When resizing, the current implementation allocates the new buffer
before freeing the old one. This increases allocator pressure for no
reason.
3. When resizing up, the buffer is resized to exactly the requested
size. That is, if the current size is 1MiB, following requests
of 1MiB+1B and 1MiB+2B will both cause a resize.
It's preferable to limit the set of possible sizes so that every
reset doesn't tend to cause multiple resizes of almost the same size.
The natural set of sizes is powers of 2, because that's what the
underlying buddy allocator uses. No waste is caused by rounding up
the allocation to a power of 2.
4. The interval of 100'000 uses is both too low and too arbitrary.
This is up for discussion, but I think that it's preferable to base
the dynamics of the buffer on time, rather than the number of uses.
It's more predictable to humans.

The implementation proposed in this patch addresses these as follows:
1. Instead of resizing down to 0, we resize to the biggest size
   seen in the last period.
   As long as at least one maximal (up to a power of 2) "normal" request
   appears each period, the buffer will never have to be resized.
2. The capacity of the buffer is always rounded up to the nearest
   power of 2.
3. The resize down period is no longer measured in number of requests
   but in real time.

Additionally, since a shared buffer in asynchronous code is quite a
footgun, some rudimentary refcounting is added to assert that only
one reference to the buffer exists at a time, and that the buffer isn't
downsized while a reference to it exists.

Fixes #13437
2023-04-26 22:09:17 +02:00
Anna Stuchlik
1ce50faf02 doc: remove reduntant information about versions
Fixes https://github.com/scylladb/scylladb/issues/13578

Now that the documentation is versioned, we can remove
the .. versionadded:: and .. versionchanged:: information
(especially that the latter is hard to maintain and now
outdated), as well as the outdated information about
experimental features in very old releases.

This commit removes that information and nothing else.

Closes #13680
2023-04-26 17:20:52 +03:00
Botond Dénes
5aaa30b267 Merge 'treewide: stop using std::rel_ops' from Kefu Chai
std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`.

so, in this series, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. there are many more cases where we could either use `operator<=>` or the default-generated `operator!=` to simplify the implementation. but here, we care more about `std::rel_ops`, we will drop the most (if not all of them) of the explicitly defined `operator!=`  and other comparison operators later.

Closes #13676

* github.com:scylladb/scylladb:
  treewide: do not use std::rel_ops
  dht: token: s/tri_compare/operator<=>/
2023-04-26 16:49:44 +03:00
Aleksandra Martyniuk
725110a035 docs: clarify the meaning of cfhistogram's sstable column
Closes #13669
2023-04-26 16:19:23 +03:00
Tomasz Grabiec
8d5467fa9c Merge 'Some minor improvements in table' from Raphael "Raph" Carvalho
Removed outdated comments and added reverse() to avoid reallocations.

Closes #13672

* github.com:scylladb/scylladb:
  table: Avoid reallocations in make_compaction_groups()
  table: Remove another outdated comment regarding sstable generation
  table: Remove outdated comment regarding automatic compaction
2023-04-26 14:43:49 +02:00
Botond Dénes
88c19b23dc reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
Currently, the `reset_to()` implementation calls `consume(new_amount)` (if
not zero), then calls `signal(old_amount)`. This means that even if
`reset_to()` is a net reduction in the amount of resources, there is a
call to `consume()` which can now potentially throw.
Add a special case for when the new amount of resources is strictly
smaller than the old amount. In this case, just call `signal()` with the
difference. This not just avoids a potential `std::bad_alloc`, but also
helps relieving memory pressure when this is most needed, by not failing
calls to release memory.
2023-04-26 07:41:57 -04:00
Botond Dénes
2449b714df reader_permit: split resource_units::reset()
Into reset_to() and reset_to_zero(). The latter replaces `reset()` with
the default 0 resources argument, which was often called from noexcept
contexts. Splitting it out from `reset()` allows for a specialized
implementation that is guaranteed to be `noexcept` indeed and thus
peace of mind.
2023-04-26 07:41:57 -04:00
Botond Dénes
21988842de reader_permit: make consume()/signal() API private
This API is dangerous, all resource consumption should happen via RAII
objects that guarantee that all consumed resources are appropriately
released.
At this poit, said API is just a low-level building block for
higher-level, RAII objects. To ensure nobody thinks of using it for
other purposes, make it private and make external users friends instead.
2023-04-26 07:41:53 -04:00
Tomasz Grabiec
ce94a2a5b0 Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun
Fix two issues with the replace operation introduced by recent PRs.

Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).

Fixes: #13651

Closes #13655

* github.com:scylladb/scylladb:
  test: new suite for testing raft-based topology
  test: remove topology_custom/test_custom.py
  raft topology: don't require new CDC generation UUID to always be present
  raft topology: include shard_count/ignore_msb during replace
2023-04-26 11:38:07 +02:00
Kefu Chai
951457a711 treewide: do not use std::rel_ops
std::rel_ops was deprecated in C++20, as C++20 provides a better
solution for defining comparison operators. and all the use cases
previously to be addressed by `using namespace std::rel_ops` have
been addressed either by `operator<=>` or the default-generated
`operator!=`.

so, in this change, to avoid using deprecated facilities, let's
drop all these `using namespace std::rel_ops`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-26 14:09:58 +08:00
Kefu Chai
5a11d67709 dht: token: s/tri_compare/operator<=>/
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.

also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-26 14:09:57 +08:00
Kefu Chai
20da130cdf mutation: specialize fmt::formatter<range_tombstone_{entry,list}>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `range_tombstone_list` and `range_tombstone_entry`
without the help of `operator<<`.

the corresponding `operator<<()` for `range_tombstone_entry` is moved
into test, where it is used. and the other one is dropped in this change,
as all its callers are now using fmtlib for formatting now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13627
2023-04-26 09:00:25 +03:00
Kefu Chai
c8aa7295d4 cql3: drop unused function
there are two variants of `query_processor::for_each_cql_result()`,
both of them perform the pagination of results returned by a CQL
statement. the one which accepts a function returning an instant
value is not used now. so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13675
2023-04-26 08:43:22 +03:00
Raphael S. Carvalho
59904be5c3 table: Avoid reallocations in make_compaction_groups()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-25 11:14:33 -03:00
Raphael S. Carvalho
9f5e19224d table: Remove another outdated comment regarding sstable generation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-25 11:09:51 -03:00
Raphael S. Carvalho
2d45dd35c7 table: Remove outdated comment regarding automatic compaction
We already provide a way to disable automatic compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-25 11:09:45 -03:00
Pavel Emelyanov
9bb4ee160f gossiper: Remove features and sysks from gossiper
Now gossiper doesn't need those two as its dependencies, they can be
removed making code shorter and dependencies graph simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:06:06 +03:00
Pavel Emelyanov
5cbc8fe2f9 system_keyspace: De-static save_local_supported_features()
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:04:54 +03:00
Pavel Emelyanov
a5bd6cc832 system_keyspace: De-static load_|save_local_enabled_features()
All callers now have the system keyspace instance at hand.

Unfortunately, this de-static doesn't allow more qctx drops, because
both methods use set_|get_scylla_local_param helpers that do use qctx
and are still in use by other static methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:03:09 +03:00
Pavel Emelyanov
9bfbcaa3f6 system_keyspace: Move enable_features_on_startup to feature_service (cont)
Now move the code itself. No functional changes here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:02:38 +03:00
Pavel Emelyanov
858db9f706 system_keyspace: Move enable_features_on_startup to feature_service
This code belongs to feature service, system keyspace shoulnd't be aware
of any pecularities of startup features enabling, only loading and
saving the feature lists.

For now the move happens only in terms of code declarations, the
implementation is kept in its old place to reduce the patch churn.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 17:00:30 +03:00
Pavel Emelyanov
71eb4edf3c feature_service: Open-code persist_enabled_feature_info() into enabler
The method in question is only called by the enabler and is short enough
to be merged into the caller. This kills two birds with one stone --
makes less walks over features list and will make it possible to
de-static system keyspace features load and save methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:58:49 +03:00
Pavel Emelyanov
474548f614 gms: Move feature enabler to feature_service.cc
No functional changes, just move the code. Thie makes gossiper not mess
with enabling/persisting features, but just gossiping them around.
Feature intersection code is still in gossiper, but can be moved in
more suitable location any time later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:57:19 +03:00
Pavel Emelyanov
dcf88b07a4 gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
This will make it possible to move the enabler to feature_service.cc

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:56:07 +03:00
Pavel Emelyanov
1461a892a6 gms: Persist features explicitly in features enabler
Nowadays features are persisted in feature_service::enable() and there
are four callers of it

- feature enabler via gossiper notifications
- boot kicks feature enabler too
- schema loader tool
- cql test env

None but the first case need to persist features. The loader tool in
fact doesn't do it even now it by keeping qctx uninitialized. Cql test
env wires up the qctx, but it makes no differences for the test cases
themselves if the features are persisted or not.

Boot-time is a bit trickier -- it loads the feature list from system
keyspace and may filter-out some of them, then enable. In that case
committing the list back into system keyspace makes no sense, as the
persisted list doesn't extend.

The enabler, in turn, can call system keyspace directly via its explicit
dependency reference. This fixes the inverse dependency between system
keyspace and feature service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:51:15 +03:00
Pavel Emelyanov
ba7af749b1 feature_service: Make persist_enabled_feature_info() return a future
It now knows that it runs inside async context, but things are changing
and soon it will be moved out of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:50:32 +03:00
Pavel Emelyanov
1ee04e4934 system_keyspace: De-static load_peer_features()
This makes use of feature_enabler::_sys_ks dependency and gets rid of
one more global qctx usage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:50:00 +03:00
Pavel Emelyanov
e30c72109f gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
It's the enabler that's responsible for enabling the features and,
implicitly, persisting them into the system keyspace. This patch moves
this logic from gossiper to feature_enabler, further patching will make
the persisting code be explicit.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:47:30 +03:00
Pavel Emelyanov
ac60d8afca gossiper: Enable features and register enabler from outside
It's a bit hairy. The maybe_enable_features() is called from two places
-- the feature_enabler upon notifications from gossiper and directory by
gossiper from wait_for_gossip_to_settle().

The _latter_ is called only when the wait_for_gossip_to_settle() is
called for the first time because of the _gossip_settled checks in it.
For the first time this method is called by storage_service when it
tries to join the ring (next it's called from main, but that's not of
interest here).

Next, despite feature_enabler is registered early -- when gossiper
instance is constructed by sharded<gossiper>::start() -- it checks for
the _gossip_settled to be true to take any actions.

Considering both -- calling maybe_enable_features() _and_ registering
enabler after storage_service's call to wait_for_gossip_to_settle()
doesn't break the code logic, but make further patching possible. In
particular, the feature_enabler will move to feature_service not to
pollute gossiper code with anything that's not gossiping.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:42:17 +03:00
Pavel Emelyanov
cefcdeee1e gms: Add feature_service and system_keyspace to feature_enabler
And rename the guy. These dependencies will be used further, both are
available and started when the enabler is constructed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-25 16:41:09 +03:00
Kamil Braun
a29b8cd02b Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.

Fixes #13657

Closes #13658

* github.com:scylladb/scylladb:
  cql3: fix printing of column_specification::name in some error messages
  cql3: fix printing of column_definition::name in some error messages
2023-04-25 14:21:09 +02:00
Avi Kivity
a1b99d457f Update tools/jmx submodule (error handling when jdk not available)
* tools/jmx fdd0474...5f98894 (1):
  > install.sh: bail out if jdk is not available
2023-04-25 14:20:57 +02:00
Kefu Chai
5804eb6d81 storage_service: specialize fmt::formatter<storage_service::mode>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `storage_service::mode` without the help of `operator<<`.

the corresponding `operator<<()` for `storage_service::mode` is removed
in this change, as all its callers are now using fmtlib for formatting
now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13640
2023-04-25 14:20:57 +02:00
Tomasz Grabiec
a717c803c7 tests: row_cache: Add reproducer for reader producing missing closing range tombstone
Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c992f. It's still useful
to keep the test to avoid regresions.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.

Closes #13665
2023-04-25 14:20:57 +02:00
Gleb Natapov
9849409c2a service/raft: raft_group0: drop dependency on migration_manager
raft_group0 does not really depends on migration_manager, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
2023-04-25 12:38:01 +03:00
Gleb Natapov
d5d156d474 service/raft: raft_group0: drop dependency on query_processor
raft_group0 does not really depends on query_processor, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
2023-04-25 12:35:57 +03:00
Kamil Braun
59eb01b7a6 test: new suite for testing raft-based topology
Introduce new test suite for testing the new topology coordinator
(runs under `raft` experimental flag). Add a simple test that performs a
basic sequence of topology operations.
2023-04-25 11:04:51 +02:00
Gleb Natapov
029f1737ef service/raft: raft_group0: drop dependency on storage_service
raft_group0 does not really depends on storage_service, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
2023-04-25 11:07:47 +03:00
Botond Dénes
8765442f3f Merge 'utils: add basic_xx_hasher' from Benny Halevy
Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher.

To be used in a followup series for netw:msg_addr.

Closes #13530

* github.com:scylladb/scylladb:
  utils: hashing: use simple_xx_hasher
  utils: hashing: add simple_xx_hasher
  utils: hashers: add HasherReturning concept
  hashing: move static_assert to source file
2023-04-25 09:53:47 +02:00
Kefu Chai
f4016d3289 cql3: coroutinize query_processor::for_each_cql_result
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13621
2023-04-25 09:53:47 +02:00
Botond Dénes
b9491c0134 Merge 'Test the column_family rest api' from Benny Halevy
Add a test for get/enable/disable auto_compaction via to column_family api.
And add log messages for admin operations over that api.

Closes #13566

* github.com:scylladb/scylladb:
  api: column_family: add log messages for admin operation
  test: rest_api: add test_column_family
2023-04-25 09:53:47 +02:00
Wojciech Mitros
b0fa59b260 build: add tools for optimizing the Wasm binaries and translating to wat
After the addition of the rust-std-static-wasm32-wasi target, we're
able to compile the Rust programs to Wasm binaries. However, we're still
only able to handle the Wasm UDFs in the Text format, so we need a tool
to translate the .wasm files to .wat. Additionally, the .wasm files
generated by default are unnecessarily large, which can be helped
using wasm-opt and wasm-strip.
The tool for translating wasm to wat (wasm2wat), and the tool for
stripping the wasm binaries (wasm-strip) are included in the `wabt`
package, and the optimization tool (wasm-opt) is included in the
`binaryen` package. Both packages are added to install-dependencies.sh

Closes #13282

[avi: regenerate frozen toolchain]

Closes #13605
2023-04-25 09:53:47 +02:00
Pavel Emelyanov
9a9dbffce3 s3/client: Zeroify stat by default
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.

fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13650
2023-04-25 09:53:47 +02:00
Kefu Chai
b0a01d85e9 s3/test: collect log on exit
the temporary directory holding the log file collecting the scylla
subprocess's output is specified by the test itself, and it is
`test_tempdir`. but unfortunately, cql-pytest/run.py is not aware
of this. so `cleanup_all()` is not able to print out the logging
messages at exit. as, please note, cql-pytest/run.py always
collect "log" file under the directory created using `pid_to_dir()`
where pid is the spawned subprocesses. but `object_store/run` uses
the main process's pid for its reusable tempdir.

so, with this change, we also register a cleanup func to printout
the logging message when the test exits.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-25 09:53:47 +02:00
Alejo Sanchez
c06e01cfba test/topology: log stages for concurrent test
For concurrent schema changes test, log when the different stages of the
test are finished.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13654
2023-04-25 09:53:47 +02:00
Kefu Chai
cc87e10f40 dht: print pk in decorated_key with "pk" prefix
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.

a test is added accordingly to avoid future changes which break the
dtest.

Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13653
2023-04-25 09:53:47 +02:00
Nadav Har'El
bd09dc308c cql3: fix printing of column_specification::name in some error messages
column_specification::name is a shared pointer, so it should be
dereferenced before printing - because we want to print the name, not
the pointer.

Fix a few instances of this mistake in prepare_expr.cc. Other instances
were already correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-04-25 10:46:56 +03:00
Nadav Har'El
4eabb3f429 cql3: fix printing of column_definition::name in some error messages
Printing a column_definition::name() in an error message is wrong,
because it is "bytes" and printed as hexadecimal ASCII codes :-(

Some error messages in cql3/operation.cc incorrectly used name()
and should be changed to name_as_text(), as was correctly done in
a few other error messages in the same file.

This patch also fixes a few places in the test/cql approval tests which
"enshrined" the wrong behavior - printing things like 666c697374696e74
in error messages - and now needs to be fixed for the right behavior.

Fixes #13657

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-04-25 10:46:47 +03:00
Kamil Braun
b1d58c3d3a test: remove topology_custom/test_custom.py
It was a temporary test just to check that the `topology_custom` suite
works. The suite now contains a real test so we can remove this one.
2023-04-24 14:41:33 +02:00
Kamil Braun
3f0498ca53 raft topology: don't require new CDC generation UUID to always be present
During node replace we don't introduce a new CDC generation, only during
regular bootstrap. Instead of checking that `new_cdc_generation_uuid`
must be present whenever there's a topology transition, only check it
when we're in `commit_cdc_generation` state.
2023-04-24 14:41:33 +02:00
Kamil Braun
9ca53478ed raft topology: include shard_count/ignore_msb during replace
Fixes: #13651
2023-04-24 14:40:47 +02:00
Kefu Chai
124153d439 build: cmake: sync with configure.py
this changes updates the CMake building system with the changes
introduced by 3f1ac846d8 and
d1817e9e1b

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13648
2023-04-24 14:55:20 +03:00
Benny Halevy
b3d91cbf65 utils: hashing: use simple_xx_hasher
Use simple_xx_hasher for bytes_view and effective_replication_map::factory_key
appending hashers instead of their custom, yet identical implementations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 14:07:25 +03:00
Benny Halevy
f4fefec343 utils: hashing: add simple_xx_hasher
And a respective unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 14:06:43 +03:00
Benny Halevy
b638dddf1b utils: hashers: add HasherReturning concept
And a more specific HasherReturningBytes for hashers
that return bytes in finalize().

HasherReturning will be used by the following patch
also for simple hashers that return size_t from
finalize().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 14:06:40 +03:00
Benny Halevy
a765472b8b hashing: move static_assert to source file
No need to check it inline in the header.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 12:23:03 +03:00
Tomasz Grabiec
03035e3675 test: topology: Introduce test_tablets.py 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
c1fdbe79b7 raft: Introduce 'raft_server_force_snapshot' error injection
Will be used by tests to force followers to catch up from the snapshot.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
819bc86f0f locator: network_topology_strategy: Support tablet replication 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
5e89f2f5ba service: Introduce tablet_allocator
Currently, responsible for injecting mutations of system.tablets to
schema changes.

Note that not all migrations are handled currently. Dependant view or
cdc table drops are not handled.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
6d4d3d8bbd locator: Introduce tablet_aware_replication_strategy
tablet_aware_replication_strategy is a trait class meant to be
inherited by replication strategy which want to work with tablets. The
trait produces per-table effective_replication_map which looks at
tablet metadata to determine replicas.

No replication startegy is changed to use tablets yet in this patch.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
97b969224c locator: Extract maybe_remove_node_being_replaced() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
e6b76ac4b9 dht: token_metadata: Introduce get_my_id() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
46eae545ad migration_manager: Send tablet metadata as part of schema pull
This is currently used by group0 to transfer snapshot of the raft
state machine.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
a8a03ee502 storage_service: Load tablet metadata when reloading topology state
This change puts the reloading into topology_state_load(), which is a
function which reloads token_metadata from system.topology (the new
raft-based topology management). It clears the metadata, so needs to
reload tablet map too. In the future, tablet metadata could change as
part of topology transaction too, so we reload rather than preserve.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
d42685d0cb storage_service: Load tablet metadata on boot and from group0 changes 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
41e69836fd db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
b754433ac1 migration_notifier: Introduce before_drop_keyspace()
Tablet allocator will need to inject mutations on keyspace drop.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
5b046043ea migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
It will be extended with listener notification firing, which is an
async operation.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
4b4238b069 test: perf: Introduce perf-tablets
Example output:

  $ build/release/scylla perf-tablets --tables 10 --tablets-per-table $((8*1024)) --rf 3

  testlog - Total tablet count: 81920
  testlog - Size of tablet_metadata in memory: 7683 KiB
  testlog - Copied in 2.163421 [ms]
  testlog - Cleared in 0.767507 [ms]
  testlog - Saved in 774.813232 [ms]
  testlog - Read in 246.666885 [ms]
  testlog - Read mutations in 211.677292 [ms]
  testlog - Size of canonical mutations: 20.633621 [MiB]
  testlog - Disk space used by system.tablets: 0.902344 [MiB]
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
70a35f70a6 test: Introduce tablets_test 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
b4ac329367 test: lib: Do not override table id in create_table()
It is already set by schema_maker. In tablets_test we will depend on
the id being the same as that set in the schema_builder, so don't
change it to something else.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
5a24984147 utils, tablets: Introduce external_memory_usage() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
f3fbfdaa37 db: tablets: Add printers
Example:

TRACE 2023-03-30 12:06:33,918 [shard 0] tablets - Read tablet metadata: {
  8cd5b560-cee2-11ed-9cd5-7f37187f2167: {
    [0]: last_token=-6917529027641081857, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [1]: last_token=-4611686018427387905, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [2]: last_token=-2305843009213693953, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [3]: last_token=-1, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [4]: last_token=2305843009213693951, replicas={3160b965-1925-4677-884b-c761e2bf4272:0},
    [5]: last_token=4611686018427387903, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [6]: last_token=6917529027641081855, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0},
    [7]: last_token=9223372036854775807, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}
  }
}
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
9d786c1ebc db: tablets: Add persistence layer 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
fa8ad9a585 dht: Use last_token_of_compaction_group() in split_token_range_msb() 2023-04-24 10:49:37 +02:00
Tomasz Grabiec
fceb5f8cf6 locator: Introduce tablet_metadata
token_metadata now stores tablet metadata with information about
tablets in the system.
2023-04-24 10:49:37 +02:00
Tomasz Grabiec
241f7febec dht: Introduce first_token() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
462e3ffd36 dht: Introduce next_token() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
27acf3b129 storage_proxy: Improve trace-level logging 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
34a9c62ae5 locator: token_metadata: Fix confusing comment on ring_range()
It could be interpreted to mean that the search token is excluded.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
e4865bd4d1 dht, storage_proxy: Abstract token space splitting
Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.

This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
b769c4ee55 Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
This reverts commit 95bf8eebe0.

Later patches will adapt this code to work with token_range_splitter,
and the unit test added by the reverted commit will start to fail.

The unit test asks the query_ranges_to_vnodes_generator to split the range:

   [t:end, t+1:start)

around token t, and expects the generator to produce an empty range

   [t:end, t:end]

After adapting this code to token_range_splitter, the input range will
not be split because it is recognized as adjacent to t:end, and the
optimization logic will not kick in. Rather than adding more logic to
handle this case, I think it's better to drop the optimization, as it
is not very useful (rarely happens) and not required for correctness.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
94e1c7b859 db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.

This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.

Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
dc04da15ec db: Introduce get_non_local_vnode_based_strategy_keyspaces()
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
8fcb320e71 service: storage_proxy: Avoid copying keyspace name in write handler 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
9b17ad3771 locator: Introduce per-table replication strategy
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.

Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.

For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.

Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
5d9bcb45de treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
bb297d86a0 locator: Introduce effective_replication_map
With tablet-based replication strategies it will represent replication
of a single table.

Current vnode_effective_replication_map can be adapted to this interface.

This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
d3c9ad4ed6 locator: Rename effective_replication_map to vnode_effective_replication_map
In preparation for introducing a more abstract
effective_replication_map which can describe replication maps which
are not based on vnodes.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
1343bfa708 locator: effective_replication_map: Abstract get_pending_endpoints() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
7b01fe8742 db: Propagate feature_service to abstract_replication_strategy::validate_options()
Some replication strategy options may be feature-dependent.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
9781d3ffc5 db: config: Introduce experimental "TABLETS" feature 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
a892e144cc db: Log replication strategy for debugging purposes 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
7543c75b62 db: Log full exception on error in do_parse_schema_tables() 2023-04-24 10:49:36 +02:00
Tomasz Grabiec
c923bdd222 db: keyspace: Remove non-const replication strategy getter
Keyspace will store replication_ptr, which is a const pointer. No user
needs a mutable reference.
2023-04-24 10:49:36 +02:00
Tomasz Grabiec
bf2ce8ff75 config: Reformat 2023-04-24 10:49:36 +02:00
Benny Halevy
9768046d7c compaction_manager: print compaction_group id
Add a formatter to compaction::table_state that
prints the table ks_name.cf_name and compaction group id.

Fixes #13467

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 10:07:03 +03:00
Benny Halevy
dabf46c37f compaction_group, table_state: add group_id member
To help identify the compaction group / table_state.

Ref #13467

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 10:06:04 +03:00
Benny Halevy
1134ca2767 compaction_manager: offstrategy compaction: skip compaction if no candidates are found
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do.  In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.

This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.

Fixes #13466

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-24 09:23:32 +03:00
Benny Halevy
2e24b05122 compaction: make_partition_filter: do not assert shard ownership
Now, with f1bbf705f9
(Cleanup sstables in resharding and other compaction types),
we may filter sstables as part of resharding compaction
and the assertion that all tokens are owned by the current
shard when filtering is no longer true.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 15:24:20 +03:00
Benny Halevy
c7d064b8b1 distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
When distributing the resharding jobs, prefer one of
the sstable shard owners based on foreign_sstable_open_info.

This is particularly important for uploaded sstables
that are resharded since they require cleanup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 15:13:16 +03:00
Benny Halevy
2f61de8f7b table, compaction_manager: prevent cross shard access to owned_ranges_ptr
Seen after f1bbf705f9 in debug mode

distributed_loader collect_all_shared_sstables copies
compaction::owned_ranges_ptr (lw_shared_ptr<const
dht::token_range_vector>)
across shards.

Since update_sstable_cleanup_state is synchronous, it can
be passed a const refrence to the token_range_vector instead.
It is ok to access the memory read-only across shards
and since this happens on start-up, there are no special
performance requirements.

Fixes #13631

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-23 15:12:13 +03:00
Botond Dénes
ecbb118d32 reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
Update comments, test names and etc. that are still using the old terminology for
permit state names, bring them up to date with the recent state name changes.
2023-04-19 05:31:27 -04:00
Botond Dénes
e71d6566ab reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
They are still using the old terminology for permit state names, bring
them up to date with the recent state name changes.
2023-04-19 05:20:44 -04:00
Botond Dénes
804403f618 reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
They is still using the old terminology for permit state names, bring
them up to date with the recent state name changes.
2023-04-19 05:20:42 -04:00
Botond Dénes
89328ce447 reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
It is still using the old terminology for permit state names, bring it
up to date with the recent state name changes.
2023-04-19 05:18:13 -04:00
Botond Dénes
3919effe2d reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
It is still using the old terminology for permit state names, bring it
up to date with the recent state name changes.
2023-04-19 05:17:34 -04:00
Benny Halevy
456f5dfce5 api: column_family: add log messages for admin operation
Similar to the storage_service api, print a log message
for admin operations like enabling/disabling auto_compaction,
running major compaction, and setting the table compaction
strategy.

Note that there is overlap in functionality
between the storage_service and the column_family api entry points.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-18 17:11:33 +03:00
Benny Halevy
5e371e7861 test: rest_api: add test_column_family
Add a test for column_family/autocompaction

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-18 17:09:31 +03:00
Kefu Chai
37cf04818e alternator: split the param list of executor ctor into multi lines
before this change, the line is 249 chars long, so split it into
multiple lines for better readabitlity.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-23 20:57:28 +08:00
Kefu Chai
69c21f490a alternator,config: make alternator_timeout_in_ms live-updateable
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore.

in this change,

* alternator_timeout_in_ms is marked as live-updateable
* executor::_s_default_timeout is changed to a thread_local variable,
  so it can be updated by a per-shard updateable_value. and
  it is now a updateable_value, so its variable name is updated
  accordingly. this value is set in the ctor of executor, and
  it is disconnected from the corresponding named_value<> option
  in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
  executor via sharded_parameter, so executor::_timeout_in_ms can
  be initialized on per-shard basis
* executor::set_default_timeout() is dropped, as we already pass
  the option to executor in its ctor.

please note, in the ctor of executor, we always update the cached
value of `s_default_timeout` with the value of `_timeout_in_ms`,
and we set the default timeout to 10s in `alternator_test_env`.
this is a design decision to avoid bending the production code for
testing, as in production, we always set the timeout with the value
specified either by the default value of yaml conf file.

Fixes #12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-23 20:57:08 +08:00
Wojciech Mitros
b03fce524b cql-pytest: test permissions for UDTs with quoted names
Currently, we only tested whether permissions with UDFs
that have quoted names work correctly. This patch adds
the missing test that confirms that we can also use UDTs
(as UDF parameter types) when altering permissions.
2023-03-23 01:41:58 +01:00
Wojciech Mitros
169a821316 cql: maybe quote user type name in ut_name::to_string()
Currently, the ut_name::to_string() is used only in 2 cases:
the first one is in logs or as part of error messages, and the
second one is during parsing, temporarily storing the user
defined type name in the auth::resource for later preparation
with database and data_dictionary context.

This patch changes the string so that the 'name' part of the
ut_name (as opposed to the 'keyspace' part) is now quoted when
needed. This does not worsen the logging set of cases, but it
does help with parsing of the resulting string, when finishing
preparing the auth::resource.

After the modification, a more fitting name for the function
is "ut_name::to_cql_string()", so the function is renamed to that.
2023-03-23 01:41:58 +01:00
Wojciech Mitros
fc8dcc1a62 cql: add a check for currently used stack in parser
While in debug mode, we may switch the default stack to
a larger one when parsing cql. We may, however, invoke
the parser recusively, causing us to switch to the big
stack while currently using it. After the reset, we
assume that the stack is empty, so after switching to
the same stack, we write over its previous contents.

This is fixed by checking if we're already using the large
stack, which is achieved by comparing the address of
a local variable to the start and end of the large stack.
2023-03-23 01:41:58 +01:00
Wojciech Mitros
a086682ecb cql-pytest: add an optional name parameter to new_type()
Currently, when creating a UDT, we're always generating
a new name for it. This patch enables setting the name
to a specific string instead.
2023-03-23 01:41:58 +01:00
Marcin Maliszkiewicz
0b5655021a alternator: remove redundant flush call in make_streamed
As output_stream close() is doing flush anyway.
2023-01-23 13:46:06 +01:00
Marcin Maliszkiewicz
f96ed4dba5 utils: yield when streaming json in print()
- removed buffer reuse to simplify the code
- added co_await suspention point on each send() making it yield
2023-01-23 13:46:06 +01:00
Marcin Maliszkiewicz
f2788f5391 alternator: yield during BatchGetItem operation 2023-01-20 14:44:24 +01:00
1476 changed files with 70819 additions and 51440 deletions

2
.gitignore vendored
View File

@@ -26,8 +26,6 @@ tags
testlog
test/*/*.reject
.vscode
docs/_build
docs/poetry.lock
compile_commands.json
.ccls-cache/
.mypy_cache

View File

@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.18)
cmake_minimum_required(VERSION 3.27)
project(scylla)
@@ -8,11 +8,19 @@ list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)
set(CMAKE_BUILD_TYPE "${CMAKE_BUILD_TYPE}" CACHE
STRING "Choose the type of build." FORCE)
# Set the possible values of build type for cmake-gui
set(scylla_build_types
"Debug" "Release" "Dev" "Sanitize" "Coverage")
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS
"Debug" "Release" "Dev" "Sanitize")
${scylla_build_types})
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE "Release" CACHE
STRING "Choose the type of build." FORCE)
message(WARNING "CMAKE_BUILD_TYPE not specified, Using 'Release'")
elseif(NOT CMAKE_BUILD_TYPE IN_LIST scylla_build_types)
message(FATAL_ERROR "Unknown CMAKE_BUILD_TYPE: ${CMAKE_BUILD_TYPE}. "
"Following types are supported: ${scylla_build_types}")
endif()
string(TOUPPER "${CMAKE_BUILD_TYPE}" build_mode)
include(mode.${build_mode})
include(mode.common)
@@ -26,6 +34,9 @@ set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_VISIBILITY_PRESET hidden)
set(Seastar_TESTING ON CACHE BOOL "" FORCE)
set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)
set(Seastar_APPS ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
# System libraries dependencies
@@ -45,6 +56,8 @@ find_package(xxHash REQUIRED)
set(scylla_gen_build_dir "${CMAKE_BINARY_DIR}/gen")
file(MAKE_DIRECTORY "${scylla_gen_build_dir}")
include(add_version_library)
generate_scylla_version()
add_library(scylla-main STATIC)
target_sources(scylla-main
@@ -65,7 +78,6 @@ target_sources(scylla-main
debug.cc
init.cc
keys.cc
message/messaging_service.cc
multishard_mutation_query.cc
mutation_query.cc
partition_slice_builder.cc
@@ -111,8 +123,10 @@ add_subdirectory(index)
add_subdirectory(interface)
add_subdirectory(lang)
add_subdirectory(locator)
add_subdirectory(message)
add_subdirectory(mutation)
add_subdirectory(mutation_writer)
add_subdirectory(node_ops)
add_subdirectory(readers)
add_subdirectory(redis)
add_subdirectory(replica)
@@ -130,7 +144,6 @@ add_subdirectory(tracing)
add_subdirectory(transport)
add_subdirectory(types)
add_subdirectory(utils)
include(add_version_library)
add_version_library(scylla_version
release.cc)
@@ -152,6 +165,7 @@ target_link_libraries(scylla PRIVATE
index
lang
locator
message
mutation
mutation_writer
raft
@@ -180,22 +194,8 @@ target_link_libraries(scylla PRIVATE
seastar
Boost::program_options)
# Force SHA1 build-id generation
set(default_linker_flags "-Wl,--build-id=sha1")
include(CheckLinkerFlag)
foreach(linker "lld" "gold")
set(linker_flag "-fuse-ld=${linker}")
check_linker_flag(CXX ${linker_flag} "CXX_LINKER_HAVE_${linker}")
if(CXX_LINKER_HAVE_${linker})
string(APPEND default_linker_flags " ${linker_flag}")
break()
endif()
endforeach()
set(CMAKE_EXE_LINKER_FLAGS "${default_linker_flags}" CACHE INTERNAL "")
# TODO: patch dynamic linker to match configure.py behavior
target_include_directories(scylla PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}"
"${scylla_gen_build_dir}")
add_subdirectory(dist)

View File

@@ -7,6 +7,7 @@ Options:
-h|--help show this help message.
-o|--output-dir PATH specify destination path at which the version files are to be created.
-d|--date-stamp DATE manually set date for release parameter
-v|--verbose also print out the version number
By default, the script will attempt to parse 'version' file
in the current directory, which should contain a string of
@@ -33,6 +34,7 @@ END
)
DATE=""
PRINT_VERSION=false
while [ $# -gt 0 ]; do
opt="$1"
@@ -51,6 +53,10 @@ while [ $# -gt 0 ]; do
shift
shift
;;
-v|--verbose)
PRINT_VERSION=true
shift
;;
*)
echo "Unexpected argument found: $1"
echo
@@ -72,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.3.0-dev
VERSION=5.5.0-dev
if test -f version
then
@@ -102,7 +108,9 @@ if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
fi
fi
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
if $PRINT_VERSION; then
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
fi
mkdir -p "$OUTPUT_DIR"
echo "$SCYLLA_VERSION" > "$OUTPUT_DIR/SCYLLA-VERSION-FILE"
echo "$SCYLLA_RELEASE" > "$OUTPUT_DIR/SCYLLA-RELEASE-FILE"

View File

@@ -53,7 +53,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::strin
if (result_set->empty()) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));
}
const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
const managed_bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
if (!salted_hash) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
}

View File

@@ -76,13 +76,16 @@ future<> controller::start_server() {
_ssg = create_smp_service_group(c).get0();
rmw_operation::set_default_write_isolation(_config.alternator_write_isolation());
executor::set_default_timeout(std::chrono::milliseconds(_config.alternator_timeout_in_ms()));
net::inet_address addr = utils::resolve(_config.alternator_address, family).get0();
auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();
auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {
return cfg.alternator_timeout_in_ms;
};
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server

View File

@@ -6,8 +6,6 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <regex>
#include "utils/base64.hh"
#include <seastar/core/sleep.hh>
@@ -40,7 +38,6 @@
#include <seastar/json/json_elements.hh>
#include <boost/algorithm/cxx11/any_of.hpp>
#include "collection_mutation.hh"
#include "db/query_context.hh"
#include "schema/schema.hh"
#include "db/tags/extension.hh"
#include "db/tags/utils.hh"
@@ -62,7 +59,28 @@ logging::logger elogger("alternator-executor");
namespace alternator {
static future<std::vector<mutation>> create_keyspace(std::string_view keyspace_name, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, api::timestamp_type);
enum class table_status {
active = 0,
creating,
updating,
deleting
};
static sstring_view table_status_to_sstring(table_status tbl_status) {
switch(tbl_status) {
case table_status::active:
return "ACTIVE";
case table_status::creating:
return "CREATING";
case table_status::updating:
return "UPDATING";
case table_status::deleting:
return "DELETING";
}
return "UKNOWN";
}
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type);
static map_type attrs_type() {
static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);
@@ -90,17 +108,20 @@ json::json_return_type make_streamed(rjson::value&& value) {
// move objects to coroutine frame.
auto los = std::move(os);
auto lrs = std::move(rs);
std::exception_ptr ex;
try {
co_await rjson::print(*lrs, los);
co_await los.flush();
co_await los.close();
} catch (...) {
// at this point, we cannot really do anything. HTTP headers and return code are
// already written, and quite potentially a portion of the content data.
// just log + rethrow. It is probably better the HTTP server closes connection
// abruptly or something...
elogger.error("Unhandled exception in data streaming: {}", std::current_exception());
throw;
ex = std::current_exception();
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return;
};
@@ -190,9 +211,8 @@ static std::string lsi_name(const std::string& table_name, std::string_view inde
/** Extract table name from a request.
* Most requests expect the table's name to be listed in a "TableName" field.
* This convenience function returns the name, with appropriate validation
* and api_error in case the table name is missing or not a string, or
* doesn't pass validate_table_name().
* This convenience function returns the name or api_error in case the
* table name is missing or not a string.
*/
static std::optional<std::string> find_table_name(const rjson::value& request) {
const rjson::value* table_name_value = rjson::find(request, "TableName");
@@ -203,7 +223,6 @@ static std::optional<std::string> find_table_name(const rjson::value& request) {
throw api_error::validation("Non-string TableName field in request");
}
std::string table_name = table_name_value->GetString();
validate_table_name(table_name);
return table_name;
}
@@ -230,6 +249,10 @@ schema_ptr executor::find_table(service::storage_proxy& proxy, const rjson::valu
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + sstring(*table_name), *table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name.value());
throw api_error::resource_not_found(
format("Requested resource not found: Table: {} not found", *table_name));
}
@@ -280,6 +303,10 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
try {
return { proxy.data_dictionary().find_schema(sstring(internal_ks_name), sstring(internal_table_name)), type };
} catch (data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(
format("Requested resource not found: Internal table: {}.{} not found", internal_ks_name, internal_table_name));
}
@@ -415,6 +442,91 @@ static rjson::value generate_arn_for_index(const schema& schema, std::string_vie
schema.ks_name(), schema.cf_name(), index_name));
}
static rjson::value fill_table_description(schema_ptr schema, table_status tbl_status, service::storage_proxy const& proxy)
{
rjson::value table_description = rjson::empty_object();
rjson::add(table_description, "TableName", rjson::from_string(schema->cf_name()));
// FIXME: take the tables creation time, not the current time!
size_t creation_date_seconds = std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch()).count();
// FIXME: In DynamoDB the CreateTable implementation is asynchronous, and
// the table may be in "Creating" state until creating is finished.
// We don't currently do this in Alternator - instead CreateTable waits
// until the table is really available. So/ DescribeTable returns either
// ACTIVE or doesn't exist at all (and DescribeTable returns an error).
// The states CREATING and UPDATING are not currently returned.
rjson::add(table_description, "TableStatus", rjson::from_string(table_status_to_sstring(tbl_status)));
rjson::add(table_description, "TableArn", generate_arn_for_table(*schema));
rjson::add(table_description, "TableId", rjson::from_string(schema->id().to_sstring()));
// FIXME: Instead of hardcoding, we should take into account which mode was chosen
// when the table was created. But, Spark jobs expect something to be returned
// and PAY_PER_REQUEST seems closer to reality than PROVISIONED.
rjson::add(table_description, "BillingModeSummary", rjson::empty_object());
rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");
rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));
// In PAY_PER_REQUEST billing mode, provisioned capacity should return 0
rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());
rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);
data_dictionary::table t = proxy.data_dictionary().find_column_family(schema);
if (tbl_status != table_status::deleting) {
rjson::add(table_description, "CreationDateTime", rjson::value(creation_date_seconds));
std::unordered_map<std::string,std::string> key_attribute_types;
// Add base table's KeySchema and collect types for AttributeDefinitions:
executor::describe_key_schema(table_description, *schema, key_attribute_types);
if (!t.views().empty()) {
rjson::value gsi_array = rjson::empty_array();
rjson::value lsi_array = rjson::empty_array();
for (const view_ptr& vptr : t.views()) {
rjson::value view_entry = rjson::empty_object();
const sstring& cf_name = vptr->cf_name();
size_t delim_it = cf_name.find(':');
if (delim_it == sstring::npos) {
elogger.error("Invalid internal index table name: {}", cf_name);
continue;
}
sstring index_name = cf_name.substr(delim_it + 1);
rjson::add(view_entry, "IndexName", rjson::from_string(index_name));
rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));
// Add indexes's KeySchema and collect types for AttributeDefinitions:
executor::describe_key_schema(view_entry, *vptr, key_attribute_types);
// Add projection type
rjson::value projection = rjson::empty_object();
rjson::add(projection, "ProjectionType", "ALL");
// FIXME: we have to get ProjectionType from the schema when it is added
rjson::add(view_entry, "Projection", std::move(projection));
// Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter
rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;
rjson::push_back(index_array, std::move(view_entry));
}
if (!lsi_array.Empty()) {
rjson::add(table_description, "LocalSecondaryIndexes", std::move(lsi_array));
}
if (!gsi_array.Empty()) {
rjson::add(table_description, "GlobalSecondaryIndexes", std::move(gsi_array));
}
}
// Use map built by describe_key_schema() for base and indexes to produce
// AttributeDefinitions for all key columns:
rjson::value attribute_definitions = rjson::empty_array();
for (auto& type : key_attribute_types) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(type.first));
rjson::add(key, "AttributeType", rjson::from_string(type.second));
rjson::push_back(attribute_definitions, std::move(key));
}
rjson::add(table_description, "AttributeDefinitions", std::move(attribute_definitions));
}
executor::supplement_table_stream_info(table_description, *schema, proxy);
// FIXME: still missing some response fields (issue #5026)
return table_description;
}
bool is_alternator_keyspace(const sstring& ks_name) {
return ks_name.find(executor::KEYSPACE_NAME_PREFIX) == 0;
}
@@ -431,85 +543,7 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
rjson::value table_description = rjson::empty_object();
rjson::add(table_description, "TableName", rjson::from_string(schema->cf_name()));
// FIXME: take the tables creation time, not the current time!
size_t creation_date_seconds = std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch()).count();
rjson::add(table_description, "CreationDateTime", rjson::value(creation_date_seconds));
// FIXME: In DynamoDB the CreateTable implementation is asynchronous, and
// the table may be in "Creating" state until creating is finished.
// We don't currently do this in Alternator - instead CreateTable waits
// until the table is really available. So/ DescribeTable returns either
// ACTIVE or doesn't exist at all (and DescribeTable returns an error).
// The other states (CREATING, UPDATING, DELETING) are not currently
// returned.
rjson::add(table_description, "TableStatus", "ACTIVE");
rjson::add(table_description, "TableArn", generate_arn_for_table(*schema));
rjson::add(table_description, "TableId", rjson::from_string(schema->id().to_sstring()));
// FIXME: Instead of hardcoding, we should take into account which mode was chosen
// when the table was created. But, Spark jobs expect something to be returned
// and PAY_PER_REQUEST seems closer to reality than PROVISIONED.
rjson::add(table_description, "BillingModeSummary", rjson::empty_object());
rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");
rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));
// In PAY_PER_REQUEST billing mode, provisioned capacity should return 0
rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());
rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);
std::unordered_map<std::string,std::string> key_attribute_types;
// Add base table's KeySchema and collect types for AttributeDefinitions:
describe_key_schema(table_description, *schema, key_attribute_types);
data_dictionary::table t = _proxy.data_dictionary().find_column_family(schema);
if (!t.views().empty()) {
rjson::value gsi_array = rjson::empty_array();
rjson::value lsi_array = rjson::empty_array();
for (const view_ptr& vptr : t.views()) {
rjson::value view_entry = rjson::empty_object();
const sstring& cf_name = vptr->cf_name();
size_t delim_it = cf_name.find(':');
if (delim_it == sstring::npos) {
elogger.error("Invalid internal index table name: {}", cf_name);
continue;
}
sstring index_name = cf_name.substr(delim_it + 1);
rjson::add(view_entry, "IndexName", rjson::from_string(index_name));
rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));
// Add indexes's KeySchema and collect types for AttributeDefinitions:
describe_key_schema(view_entry, *vptr, key_attribute_types);
// Add projection type
rjson::value projection = rjson::empty_object();
rjson::add(projection, "ProjectionType", "ALL");
// FIXME: we have to get ProjectionType from the schema when it is added
rjson::add(view_entry, "Projection", std::move(projection));
// Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter
rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;
rjson::push_back(index_array, std::move(view_entry));
}
if (!lsi_array.Empty()) {
rjson::add(table_description, "LocalSecondaryIndexes", std::move(lsi_array));
}
if (!gsi_array.Empty()) {
rjson::add(table_description, "GlobalSecondaryIndexes", std::move(gsi_array));
}
}
// Use map built by describe_key_schema() for base and indexes to produce
// AttributeDefinitions for all key columns:
rjson::value attribute_definitions = rjson::empty_array();
for (auto& type : key_attribute_types) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(type.first));
rjson::add(key, "AttributeType", rjson::from_string(type.second));
rjson::push_back(attribute_definitions, std::move(key));
}
rjson::add(table_description, "AttributeDefinitions", std::move(attribute_definitions));
supplement_table_stream_info(table_description, *schema, _proxy);
// FIXME: still missing some response fields (issue #5026)
rjson::value table_description = fill_table_description(schema, table_status::active, _proxy);
rjson::value response = rjson::empty_object();
rjson::add(response, "Table", std::move(table_description));
elogger.trace("returning {}", response);
@@ -521,10 +555,17 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
elogger.trace("Deleting table {}", request);
std::string table_name = get_table_name(request);
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;
tracing::add_table_name(trace_state, keyspace_name, table_name);
auto& p = _proxy.container();
schema_ptr schema = get_table(_proxy, request);
rjson::value table_description = fill_table_description(schema, table_status::deleting, _proxy);
co_await _mm.container().invoke_on(0, [&] (service::migration_manager& mm) -> future<> {
// FIXME: the following needs to be in a loop. If mm.announce() below
// fails, we need to retry the whole thing.
@@ -534,18 +575,14 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
throw api_error::resource_not_found(format("Requested resource not found: Table: {} not found", table_name));
}
auto m = co_await mm.prepare_column_family_drop_announcement(keyspace_name, table_name, group0_guard.write_timestamp(), service::migration_manager::drop_views::yes);
auto m2 = mm.prepare_keyspace_drop_announcement(keyspace_name, group0_guard.write_timestamp());
auto m = co_await service::prepare_column_family_drop_announcement(_proxy, keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
auto m2 = co_await service::prepare_keyspace_drop_announcement(_proxy.local_db(), keyspace_name, group0_guard.write_timestamp());
std::move(m2.begin(), m2.end(), std::back_inserter(m));
co_await mm.announce(std::move(m), std::move(group0_guard));
co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: delete {} table", table_name));
});
// FIXME: need more attributes?
rjson::value table_description = rjson::empty_object();
rjson::add(table_description, "TableName", rjson::from_string(table_name));
rjson::add(table_description, "TableStatus", "DELETING");
rjson::value response = rjson::empty_object();
rjson::add(response, "TableDescription", std::move(table_description));
elogger.trace("returning {}", response);
@@ -830,17 +867,6 @@ future<executor::request_return_type> executor::list_tags_of_resource(client_sta
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
}
static future<> wait_for_schema_agreement(service::migration_manager& mm, db::timeout_clock::time_point deadline) {
return do_until([&mm, deadline] {
if (db::timeout_clock::now() > deadline) {
throw std::runtime_error("Unable to reach schema agreement");
}
return mm.have_schema_agreement();
}, [] {
return seastar::sleep(500ms);
});
}
static void verify_billing_mode(const rjson::value& request) {
// Alternator does not yet support billing or throughput limitations, but
// let's verify that BillingMode is at least legal.
@@ -858,6 +884,38 @@ static void verify_billing_mode(const rjson::value& request) {
}
}
// Validate that a AttributeDefinitions parameter in CreateTable is valid, and
// throws user-facing api_error::validation if it's not.
// In particular, verify that the same AttributeName doesn't appear more than
// once (Issue #13870).
static void validate_attribute_definitions(const rjson::value& attribute_definitions){
if (!attribute_definitions.IsArray()) {
throw api_error::validation("AttributeDefinitions must be an array");
}
std::unordered_set<std::string> seen_attribute_names;
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value* attribute_name = rjson::find(*it, "AttributeName");
if (!attribute_name) {
throw api_error::validation("AttributeName missing in AttributeDefinitions");
}
if (!attribute_name->IsString()) {
throw api_error::validation("AttributeName in AttributeDefinitions must be a string");
}
auto [it2, added] = seen_attribute_names.emplace(rjson::to_string_view(*attribute_name));
if (!added) {
throw api_error::validation(format("Duplicate AttributeName={} in AttributeDefinitions",
rjson::to_string_view(*attribute_name)));
}
const rjson::value* attribute_type = rjson::find(*it, "AttributeType");
if (!attribute_type) {
throw api_error::validation("AttributeType missing in AttributeDefinitions");
}
if (!attribute_type->IsString()) {
throw api_error::validation("AttributeType in AttributeDefinitions must be a string");
}
}
}
static future<executor::request_return_type> create_table_on_shard0(tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper) {
assert(this_shard_id() == 0);
@@ -866,11 +924,14 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// (e.g., verify that this table doesn't already exist) - we can only
// do this further down - after taking group0_guard.
std::string table_name = get_table_name(request);
validate_table_name(table_name);
if (table_name.find(executor::INTERNAL_TABLE_PREFIX) == 0) {
co_return api_error::validation(format("Prefix {} is reserved for accessing internal tables", executor::INTERNAL_TABLE_PREFIX));
}
std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;
const rjson::value& attribute_definitions = request["AttributeDefinitions"];
validate_attribute_definitions(attribute_definitions);
tracing::add_table_name(trace_state, keyspace_name, table_name);
@@ -1060,8 +1121,9 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
std::vector<mutation> schema_mutations;
auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts);
try {
schema_mutations = co_await create_keyspace(keyspace_name, sp, mm, gossiper, ts);
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
if (sp.data_dictionary().has_schema(keyspace_name, table_name)) {
co_return api_error::resource_in_use(format("Table {} already exists", table_name));
@@ -1071,22 +1133,14 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// This should never happen, the ID is supposed to be unique
co_return api_error::internal(format("Table with ID {} already exists", schema->id()));
}
db::schema_tables::add_table_or_view_to_schema_mutation(schema, ts, true, schema_mutations);
// we must call before_create_column_family callbacks - which allow
// listeners to modify our schema_mutations. For example, CDC may add
// another table (the CDC log table) to the same keyspace.
// Unfortunately the convention is that this callback must be run in
// a Seastar thread.
co_await seastar::async([&] {
mm.get_notifier().before_create_column_family(*schema, schema_mutations, ts);
});
co_await service::prepare_new_column_family_announcement(schema_mutations, sp, *ksm, schema, ts);
for (schema_builder& view_builder : view_builders) {
db::schema_tables::add_table_or_view_to_schema_mutation(
view_ptr(view_builder.build()), ts, true, schema_mutations);
}
co_await mm.announce(std::move(schema_mutations), std::move(group0_guard));
co_await mm.announce(std::move(schema_mutations), std::move(group0_guard), format("alternator-executor: create {} table", table_name));
co_await wait_for_schema_agreement(mm, db::timeout_clock::now() + 10s);
co_await mm.wait_for_schema_agreement(sp.local_db(), db::timeout_clock::now() + 10s, nullptr);
rjson::value status = rjson::empty_object();
executor::supplement_table_info(request, *schema, sp);
rjson::add(status, "TableDescription", std::move(request));
@@ -1149,11 +1203,11 @@ future<executor::request_return_type> executor::update_table(client_state& clien
auto schema = builder.build();
auto m = co_await mm.prepare_column_family_update_announcement(schema, false, std::vector<view_ptr>(), group0_guard.write_timestamp());
auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, false, std::vector<view_ptr>(), group0_guard.write_timestamp());
co_await mm.announce(std::move(m), std::move(group0_guard));
co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: update {} table", tab->cf_name()));
co_await wait_for_schema_agreement(mm, db::timeout_clock::now() + 10s);
co_await mm.wait_for_schema_agreement(p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
rjson::value status = rjson::empty_object();
supplement_table_info(request, *schema, p.local());
@@ -1365,14 +1419,11 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) co
// The DynamoDB API doesn't let the client control the server's timeout, so
// we have a global default_timeout() for Alternator requests. The value of
// s_default_timeout is overwritten in alternator::controller::start_server()
// s_default_timeout_ms is overwritten in alternator::controller::start_server()
// based on the "alternator_timeout_in_ms" configuration parameter.
db::timeout_clock::duration executor::s_default_timeout = 10s;
void executor::set_default_timeout(db::timeout_clock::duration timeout) {
s_default_timeout = timeout;
}
thread_local utils::updateable_value<uint32_t> executor::s_default_timeout_in_ms{10'000};
db::timeout_clock::time_point executor::default_timeout() {
return db::timeout_clock::now() + s_default_timeout;
return db::timeout_clock::now() + std::chrono::milliseconds(s_default_timeout_in_ms);
}
static future<std::unique_ptr<rjson::value>> get_previous_item(
@@ -1592,7 +1643,7 @@ static parsed::condition_expression get_parsed_condition_expression(rjson::value
throw api_error::validation("ConditionExpression must not be empty");
}
try {
return parse_condition_expression(rjson::to_string_view(*condition_expression));
return parse_condition_expression(rjson::to_string_view(*condition_expression), "ConditionExpression");
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
@@ -1607,17 +1658,16 @@ static bool check_needs_read_before_write(const parsed::condition_expression& co
// Fail the expression if it has unused attribute names or values. This is
// how DynamoDB behaves, so we do too.
static void verify_all_are_used(const rjson::value& req, const char* field,
const std::unordered_set<std::string>& used, const char* operation) {
const rjson::value* attribute_names = rjson::find(req, field);
if (!attribute_names) {
static void verify_all_are_used(const rjson::value* field,
const std::unordered_set<std::string>& used, const char* field_name, const char* operation) {
if (!field) {
return;
}
for (auto it = attribute_names->MemberBegin(); it != attribute_names->MemberEnd(); ++it) {
for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {
if (!used.contains(it->name.GetString())) {
throw api_error::validation(
format("{} has spurious '{}', not used in {}",
field, it->name.GetString(), operation));
field_name, it->name.GetString(), operation));
}
}
}
@@ -1644,8 +1694,8 @@ public:
resolve_condition_expression(_condition_expression,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "PutItem");
verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "PutItem");
verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "PutItem");
verify_all_are_used(expression_attribute_values, used_attribute_values,"ExpressionAttributeValues", "PutItem");
} else {
if (expression_attribute_names) {
throw api_error::validation("ExpressionAttributeNames cannot be used without ConditionExpression");
@@ -1729,8 +1779,8 @@ public:
resolve_condition_expression(_condition_expression,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "DeleteItem");
verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "DeleteItem");
verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "DeleteItem");
verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "DeleteItem");
} else {
if (expression_attribute_names) {
throw api_error::validation("ExpressionAttributeNames cannot be used without ConditionExpression");
@@ -2300,14 +2350,14 @@ static std::optional<attrs_to_get> calculate_attrs_to_get(const rjson::value& re
* as before.
*/
void executor::describe_single_item(const cql3::selection::selection& selection,
const std::vector<bytes_opt>& result_row,
const std::vector<managed_bytes_opt>& result_row,
const std::optional<attrs_to_get>& attrs_to_get,
rjson::value& item,
bool include_all_embedded_attributes)
{
const auto& columns = selection.get_columns();
auto column_it = columns.begin();
for (const bytes_opt& cell : result_row) {
for (const managed_bytes_opt& cell : result_row) {
std::string column_name = (*column_it)->name_as_text();
if (cell && column_name != executor::ATTRS_COLUMN_NAME) {
if (!attrs_to_get || attrs_to_get->contains(column_name)) {
@@ -2315,7 +2365,9 @@ void executor::describe_single_item(const cql3::selection::selection& selection,
// so add() makes sense
rjson::add_with_string_name(item, column_name, rjson::empty_object());
rjson::value& field = item[column_name.c_str()];
rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(*cell, **column_it));
cell->with_linearized([&] (bytes_view linearized_cell) {
rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(linearized_cell, **column_it));
});
}
} else if (cell) {
auto deserialized = attrs_type()->deserialize(*cell);
@@ -2371,21 +2423,22 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
return item;
}
std::vector<rjson::value> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get) {
cql3::selection::result_set_builder builder(selection, gc_clock::now());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
describe_single_item(selection, result_row, attrs_to_get, item);
describe_single_item(*selection, result_row, *attrs_to_get, item);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
return ret;
co_return ret;
}
static bool check_needs_read_before_write(const parsed::value& v) {
@@ -2500,8 +2553,8 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "UpdateItem");
verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "UpdateItem");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "UpdateItem");
verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "UpdateItem");
// DynamoDB forbids having both old-style AttributeUpdates or Expected
// and new-style UpdateExpression or ConditionExpression in the same request
@@ -3110,7 +3163,8 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
std::unordered_set<std::string> used_attribute_names;
auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names);
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state)).then(
@@ -3221,7 +3275,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rs.cl = get_read_consistency(it->value);
std::unordered_set<std::string> used_attribute_names;
rs.attrs_to_get = ::make_shared<const std::optional<attrs_to_get>>(calculate_attrs_to_get(it->value, used_attribute_names));
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");
const rjson::value* expression_attribute_names = rjson::find(it->value, "ExpressionAttributeNames");
verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "GetItem");
auto& keys = (it->value)["Keys"];
for (rjson::value& key : keys.GetArray()) {
rs.add(key);
@@ -3257,8 +3312,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
std::vector<rjson::value> jsons = describe_multi_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
return make_ready_future<std::vector<rjson::value>>(std::move(jsons));
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get));
});
response_futures.push_back(std::move(f));
}
@@ -3391,7 +3445,7 @@ filter::filter(const rjson::value& request, request_type rt,
throw api_error::validation("Cannot use both old-style and new-style parameters in same request: FilterExpression and AttributesToGet");
}
try {
auto parsed = parse_condition_expression(rjson::to_string_view(*expression));
auto parsed = parse_condition_expression(rjson::to_string_view(*expression), "FilterExpression");
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");
resolve_condition_expression(parsed,
@@ -3498,7 +3552,7 @@ public:
_column_it = _columns.begin();
}
void accept_value(const std::optional<query::result_bytes_view>& result_bytes_view) {
void accept_value(managed_bytes_view_opt result_bytes_view) {
if (!result_bytes_view) {
++_column_it;
return;
@@ -3795,8 +3849,10 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
// optimized the filtering by modifying partition_ranges and/or
// ck_bounds. We haven't done this optimization yet.
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "Scan");
verify_all_are_used(request, "ExpressionAttributeValues", used_attribute_values, "Scan");
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "Scan");
verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Scan");
return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
std::move(filter), query::partition_slice::option_set(), client_state, _stats.cql_stats, trace_state, std::move(permit));
@@ -4017,7 +4073,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
// sort-key range.
parsed::condition_expression p;
try {
p = parse_condition_expression(rjson::to_string_view(expression));
p = parse_condition_expression(rjson::to_string_view(expression), "KeyConditionExpression");
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
@@ -4237,13 +4293,17 @@ future<executor::request_return_type> executor::query(client_state& client_state
throw api_error::validation("Query must have one of "
"KeyConditions or KeyConditionExpression");
}
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");
// exactly one of key_conditions or key_condition_expression
auto [partition_ranges, ck_bounds] = key_conditions
? calculate_bounds_conditions(schema, *key_conditions)
: calculate_bounds_condition_expression(schema, *key_condition_expression,
rjson::find(request, "ExpressionAttributeValues"),
expression_attribute_values,
used_attribute_values,
rjson::find(request, "ExpressionAttributeNames"),
expression_attribute_names,
used_attribute_names);
filter filter(request, filter::request_type::QUERY,
@@ -4270,8 +4330,8 @@ future<executor::request_return_type> executor::query(client_state& client_state
select_type select = parse_select(request, table_type);
auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names, select);
verify_all_are_used(request, "ExpressionAttributeValues", used_attribute_values, "Query");
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "Query");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "Query");
verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Query");
query::partition_slice::option_set opts;
opts.set_if<query::partition_slice::option::reversed>(!forward);
return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
@@ -4332,6 +4392,17 @@ future<executor::request_return_type> executor::list_tables(client_state& client
future<executor::request_return_type> executor::describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header) {
_stats.api_operations.describe_endpoints++;
// The alternator_describe_endpoints configuration can be used to disable
// the DescribeEndpoints operation, or set it to return a fixed string
std::string override = _proxy.data_dictionary().get_config().alternator_describe_endpoints();
if (!override.empty()) {
if (override == "disabled") {
_stats.unsupported_operations++;
return make_ready_future<request_return_type>(api_error::unknown_operation(
"DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)"));
}
host_header = std::move(override);
}
rjson::value response = rjson::empty_object();
// Without having any configuration parameter to say otherwise, we tell
// the user to return to the same endpoint they used to reach us. The only
@@ -4369,6 +4440,10 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
try {
schema = _proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::table_not_found(
format("Table {} not found", table_name));
}
@@ -4382,25 +4457,23 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
co_return make_jsonable(std::move(response));
}
// Create the keyspace in which we put the alternator table, if it doesn't
// already exist.
// Create the metadata for the keyspace in which we put the alternator
// table if it doesn't already exist.
// Currently, we automatically configure the keyspace based on the number
// of nodes in the cluster: A cluster with 3 or more live nodes, gets RF=3.
// A smaller cluster (presumably, a test only), gets RF=1. The user may
// manually create the keyspace to override this predefined behavior.
static future<std::vector<mutation>> create_keyspace(std::string_view keyspace_name, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, api::timestamp_type ts) {
sstring keyspace_name_str(keyspace_name);
int endpoint_count = gossiper.get_endpoint_states().size();
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts) {
int endpoint_count = gossiper.num_endpoints();
int rf = 3;
if (endpoint_count < rf) {
rf = 1;
elogger.warn("Creating keyspace '{}' for Alternator with unsafe RF={} because cluster only has {} nodes.",
keyspace_name_str, rf, endpoint_count);
keyspace_name, rf, endpoint_count);
}
auto opts = get_network_topology_options(sp, gossiper, rf);
auto ksm = keyspace_metadata::new_keyspace(keyspace_name_str, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);
co_return mm.prepare_new_keyspace_announcement(ksm, ts);
return keyspace_metadata::new_keyspace(keyspace_name, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);
}
future<> executor::start() {

View File

@@ -22,6 +22,7 @@
#include "alternator/error.hh"
#include "stats.hh"
#include "utils/rjson.hh"
#include "utils/updateable_value.hh"
namespace db {
class system_distributed_keyspace;
@@ -170,8 +171,16 @@ public:
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
executor(gms::gossiper& gossiper, service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, cdc::metadata& cdc_metadata, smp_service_group ssg)
: _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
executor(gms::gossiper& gossiper,
service::storage_proxy& proxy,
service::migration_manager& mm,
db::system_distributed_keyspace& sdks,
cdc::metadata& cdc_metadata,
smp_service_group ssg,
utils::updateable_value<uint32_t> default_timeout_in_ms)
: _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {
s_default_timeout_in_ms = std::move(default_timeout_in_ms);
}
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -199,13 +208,16 @@ public:
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);
future<> start();
future<> stop() { return make_ready_future<>(); }
future<> stop() {
// disconnect from the value source, but keep the value unchanged.
s_default_timeout_in_ms = utils::updateable_value<uint32_t>{s_default_timeout_in_ms()};
return make_ready_future<>();
}
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
static void set_default_timeout(db::timeout_clock::duration timeout);
private:
static db::timeout_clock::duration s_default_timeout;
static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;
public:
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
@@ -213,30 +225,31 @@ private:
friend class rmw_operation;
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);
static std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&);
static std::vector<rjson::value> describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get);
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
bool = false);
static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};
// is_big() checks approximately if the given JSON value is "bigger" than

View File

@@ -29,7 +29,7 @@
namespace alternator {
template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
Result do_with_parser(std::string_view input, Func&& f) {
static Result do_with_parser(std::string_view input, Func&& f) {
expressionsLexer::InputStreamType input_stream{
reinterpret_cast<const ANTLR_UINT8*>(input.data()),
ANTLR_ENC_UTF8,
@@ -43,31 +43,41 @@ Result do_with_parser(std::string_view input, Func&& f) {
return result;
}
template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
static Result parse(const char* input_name, std::string_view input, Func&& f) {
if (input.length() > 4096) {
throw expressions_syntax_error(format("{} expression size {} exceeds allowed maximum 4096.",
input_name, input.length()));
}
try {
return do_with_parser(input, f);
} catch (expressions_syntax_error& e) {
// If already an expressions_syntax_error, don't print the type's
// name (it's just ugly), just the message.
// TODO: displayRecognitionError could set a position inside the
// expressions_syntax_error in throws, and we could use it here to
// mark the broken position in 'input'.
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
input_name, input, e.what()));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
input_name, input, std::current_exception()));
}
}
parsed::update_expression
parse_update_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::update_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing UpdateExpression '{}': {}", query, std::current_exception()));
}
return parse("UpdateExpression", query, std::mem_fn(&expressionsParser::update_expression));
}
std::vector<parsed::path>
parse_projection_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::projection_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing ProjectionExpression '{}': {}", query, std::current_exception()));
}
return parse ("ProjectionExpression", query, std::mem_fn(&expressionsParser::projection_expression));
}
parsed::condition_expression
parse_condition_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::condition_expression));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing ConditionExpression '{}': {}", query, std::current_exception()));
}
parse_condition_expression(std::string_view query, const char* caller) {
return parse(caller, query, std::mem_fn(&expressionsParser::condition_expression));
}
namespace parsed {
@@ -418,9 +428,14 @@ void for_condition_expression_on(const parsed::condition_expression& ce, const n
// calculate_size() is ConditionExpression's size() function, i.e., it takes
// a JSON-encoded value and returns its "size" as defined differently for the
// different types - also as a JSON-encoded number.
// It return a JSON-encoded "null" value if this value's type has no size
// defined. Comparisons against this non-numeric value will later fail.
static rjson::value calculate_size(const rjson::value& v) {
// If the value's type (e.g. number) has no size defined, there are two cases:
// 1. If from_data (the value came directly from an attribute of the data),
// It returns a JSON-encoded "null" value. Comparisons against this
// non-numeric value will later fail, so eventually the application will
// get a ConditionalCheckFailedException.
// 2. Otherwise (the value came from a constant in the query or some other
// calculation), throw a ValidationException.
static rjson::value calculate_size(const rjson::value& v, bool from_data) {
// NOTE: If v is improperly formatted for our JSON value encoding, it
// must come from the request itself, not from the database, so it makes
// sense to throw a ValidationException if we see such a problem.
@@ -449,10 +464,12 @@ static rjson::value calculate_size(const rjson::value& v) {
throw api_error::validation(format("invalid byte string: {}", v));
}
ret = base64_decoded_len(rjson::to_string_view(it->value));
} else {
} else if (from_data) {
rjson::value json_ret = rjson::empty_object();
rjson::add(json_ret, "null", rjson::value(true));
return json_ret;
} else {
throw api_error::validation(format("Unsupported operand type {} for function size()", it->name));
}
rjson::value json_ret = rjson::empty_object();
rjson::add(json_ret, "N", rjson::from_string(std::to_string(ret)));
@@ -534,7 +551,7 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
format("{}: size() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
return calculate_size(v);
return calculate_size(v, f._parameters[0].is_path());
}
},
{"attribute_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -662,7 +679,7 @@ static rjson::value extract_path(const rjson::value* item,
// objects. But today Alternator does not validate the structure
// of nested documents before storing them, so this can happen on
// read.
throw api_error::validation(format("{}: malformed item read: {}", *item));
throw api_error::validation(format("{}: malformed item read: {}", caller, *item));
}
const char* type = v->MemberBegin()->name.GetString();
v = &(v->MemberBegin()->value);

View File

@@ -74,7 +74,22 @@ options {
*/
@parser::context {
void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
throw expressions_syntax_error("syntax error");
const char* err;
switch (ex->getType()) {
case antlr3::ExceptionType::FAILED_PREDICATE_EXCEPTION:
err = "expression nested too deeply";
break;
default:
err = "syntax error";
break;
}
// Alternator expressions are always single line so ex->get_line()
// is always 1, no sense to print it.
// TODO: return the position as part of the exception, so the
// caller in expressions.cc that knows the expression string can
// mark the error position in the final error message.
throw expressions_syntax_error(format("{} at char {}", err,
ex->get_charPositionInLine()));
}
}
@lexer::context {
@@ -83,6 +98,23 @@ options {
}
}
/* Unfortunately, ANTLR uses recursion - not the heap - to parse recursive
* expressions. To make things even worse, ANTLR has no way to limit the
* depth of this recursion (unlike Yacc which has YYMAXDEPTH). So deeply-
* nested expression like "(((((((((((((..." can easily crash Scylla on a
* stack overflow (see issue #14477).
*
* We are lucky that in the grammar for DynamoDB expressions (below),
* only a few specific rules can recurse, so it was fairly easy to add a
* "depth" counter to a few specific rules, and then use a predicate
* "{depth<MAX_DEPTH}?" to avoid parsing if the depth exceeds this limit,
* and throw a FAILED_PREDICATE_EXCEPTION in that case, which we will
* report to the user as a "expression nested too deeply" error.
*/
@parser::members {
static constexpr int MAX_DEPTH = 400;
}
/*
* Lexical analysis phase, i.e., splitting the input up to tokens.
* Lexical analyzer rules have names starting in capital letters.
@@ -155,19 +187,20 @@ path returns [parsed::path p]:
| '[' INTEGER ']' { $p.add_index(std::stoi($INTEGER.text)); }
)*;
value returns [parsed::value v]:
/* See comment above why the "depth" counter was needed here */
value[int depth] returns [parsed::value v]:
VALREF { $v.set_valref($VALREF.text); }
| path { $v.set_path($path.p); }
| NAME { $v.set_func_name($NAME.text); }
'(' x=value { $v.add_func_parameter($x.v); }
(',' x=value { $v.add_func_parameter($x.v); })*
| {depth<MAX_DEPTH}? NAME { $v.set_func_name($NAME.text); }
'(' x=value[depth+1] { $v.add_func_parameter($x.v); }
(',' x=value[depth+1] { $v.add_func_parameter($x.v); })*
')'
;
update_expression_set_rhs returns [parsed::set_rhs rhs]:
v=value { $rhs.set_value(std::move($v.v)); }
( '+' v=value { $rhs.set_plus(std::move($v.v)); }
| '-' v=value { $rhs.set_minus(std::move($v.v)); }
v=value[0] { $rhs.set_value(std::move($v.v)); }
( '+' v=value[0] { $rhs.set_plus(std::move($v.v)); }
| '-' v=value[0] { $rhs.set_minus(std::move($v.v)); }
)?
;
@@ -205,7 +238,7 @@ projection_expression returns [std::vector<parsed::path> v]:
primitive_condition returns [parsed::primitive_condition c]:
v=value { $c.add_value(std::move($v.v));
v=value[0] { $c.add_value(std::move($v.v));
$c.set_operator(parsed::primitive_condition::type::VALUE); }
( ( '=' { $c.set_operator(parsed::primitive_condition::type::EQ); }
| '<' '>' { $c.set_operator(parsed::primitive_condition::type::NE); }
@@ -214,14 +247,14 @@ primitive_condition returns [parsed::primitive_condition c]:
| '>' { $c.set_operator(parsed::primitive_condition::type::GT); }
| '>' '=' { $c.set_operator(parsed::primitive_condition::type::GE); }
)
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
| BETWEEN { $c.set_operator(parsed::primitive_condition::type::BETWEEN); }
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
AND
v=value { $c.add_value(std::move($v.v)); }
v=value[0] { $c.add_value(std::move($v.v)); }
| IN '(' { $c.set_operator(parsed::primitive_condition::type::IN); }
v=value { $c.add_value(std::move($v.v)); }
(',' v=value { $c.add_value(std::move($v.v)); })*
v=value[0] { $c.add_value(std::move($v.v)); }
(',' v=value[0] { $c.add_value(std::move($v.v)); })*
')'
)?
;
@@ -231,19 +264,20 @@ primitive_condition returns [parsed::primitive_condition c]:
// common rule prefixes, and (lack of) support for operator precedence.
// These rules could have been written more clearly using a more powerful
// parser generator - such as Yacc.
boolean_expression returns [parsed::condition_expression e]:
b=boolean_expression_1 { $e.append(std::move($b.e), '|'); }
(OR b=boolean_expression_1 { $e.append(std::move($b.e), '|'); } )*
// See comment above why the "depth" counter was needed here.
boolean_expression[int depth] returns [parsed::condition_expression e]:
b=boolean_expression_1[depth] { $e.append(std::move($b.e), '|'); }
(OR b=boolean_expression_1[depth] { $e.append(std::move($b.e), '|'); } )*
;
boolean_expression_1 returns [parsed::condition_expression e]:
b=boolean_expression_2 { $e.append(std::move($b.e), '&'); }
(AND b=boolean_expression_2 { $e.append(std::move($b.e), '&'); } )*
boolean_expression_1[int depth] returns [parsed::condition_expression e]:
b=boolean_expression_2[depth] { $e.append(std::move($b.e), '&'); }
(AND b=boolean_expression_2[depth] { $e.append(std::move($b.e), '&'); } )*
;
boolean_expression_2 returns [parsed::condition_expression e]:
boolean_expression_2[int depth] returns [parsed::condition_expression e]:
p=primitive_condition { $e.set_primitive(std::move($p.c)); }
| NOT b=boolean_expression_2 { $e = std::move($b.e); $e.apply_not(); }
| '(' b=boolean_expression ')' { $e = std::move($b.e); }
| {depth<MAX_DEPTH}? NOT b=boolean_expression_2[depth+1] { $e = std::move($b.e); $e.apply_not(); }
| {depth<MAX_DEPTH}? '(' b=boolean_expression[depth+1] ')' { $e = std::move($b.e); }
;
condition_expression returns [parsed::condition_expression e]:
boolean_expression { e=std::move($boolean_expression.e); } EOF;
boolean_expression[0] { e=std::move($boolean_expression.e); } EOF;

View File

@@ -28,7 +28,7 @@ public:
parsed::update_expression parse_update_expression(std::string_view query);
std::vector<parsed::path> parse_projection_expression(std::string_view query);
parsed::condition_expression parse_condition_expression(std::string_view query);
parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);
void resolve_update_expression(parsed::update_expression& ue,
const rjson::value* expression_attribute_names,

View File

@@ -50,6 +50,115 @@ type_representation represent_type(alternator_type atype) {
return it->second;
}
// Get the magnitude and precision of a big_decimal - as these concepts are
// defined by DynamoDB - to allow us to enforce limits on those as explained
// in ssue #6794. The "magnitude" of 9e123 is 123 and of -9e-123 is -123,
// the "precision" of 12.34e56 is the number of significant digits - 4.
//
// Unfortunately it turned out to be quite difficult to take a big_decimal and
// calculate its magnitude and precision from its scale() and unscaled_value().
// So in the following ugly implementation we calculate them from the string
// representation instead. We assume the number was already parsed
// sucessfully to a big_decimal to it follows its syntax rules.
//
// FIXME: rewrite this function to take a big_decimal, not a string.
// Maybe a snippet like this can help:
// boost::multiprecision::cpp_int digits = boost::multiprecision::log10(num.unscaled_value().convert_to<boost::multiprecision::mpf_float_50>()).convert_to<boost::multiprecision::cpp_int>() + 1;
internal::magnitude_and_precision internal::get_magnitude_and_precision(std::string_view s) {
size_t e_or_end = s.find_first_of("eE");
std::string_view base = s.substr(0, e_or_end);
if (s[0]=='-' || s[0]=='+') {
base = base.substr(1);
}
int magnitude = 0;
int precision = 0;
size_t dot_or_end = base.find_first_of(".");
size_t nonzero = base.find_first_not_of("0");
if (dot_or_end != std::string_view::npos) {
if (nonzero == dot_or_end) {
// 0.000031 => magnitude = -5 (like 3.1e-5), precision = 2.
std::string_view fraction = base.substr(dot_or_end + 1);
size_t nonzero2 = fraction.find_first_not_of("0");
if (nonzero2 != std::string_view::npos) {
magnitude = -nonzero2 - 1;
precision = fraction.size() - nonzero2;
}
} else {
// 000123.45678 => magnitude = 2, precision = 8.
magnitude = dot_or_end - nonzero - 1;
precision = base.size() - nonzero - 1;
}
// trailing zeros don't count to precision, e.g., precision
// of 1000.0, 1.0 or 1.0000 are just 1.
size_t last_significant = base.find_last_not_of(".0");
if (last_significant == std::string_view::npos) {
precision = 0;
} else if (last_significant < dot_or_end) {
// e.g., 1000.00 reduce 5 = 7 - (0+1) - 1 from precision
precision -= base.size() - last_significant - 2;
} else {
// e.g., 1235.60 reduce 5 = 7 - (5+1) from precision
precision -= base.size() - last_significant - 1;
}
} else if (nonzero == std::string_view::npos) {
// all-zero integer 000000
magnitude = 0;
precision = 0;
} else {
magnitude = base.size() - 1 - nonzero;
precision = base.size() - nonzero;
// trailing zeros don't count to precision, e.g., precision
// of 1000 is just 1.
size_t last_significant = base.find_last_not_of("0");
if (last_significant == std::string_view::npos) {
precision = 0;
} else {
// e.g., 1000 reduce 3 = 4 - (0+1)
precision -= base.size() - last_significant - 1;
}
}
if (precision && e_or_end != std::string_view::npos) {
std::string_view exponent = s.substr(e_or_end + 1);
if (exponent.size() > 4) {
// don't even bother atoi(), exponent is too large
magnitude = exponent[0]=='-' ? -9999 : 9999;
} else {
try {
magnitude += boost::lexical_cast<int32_t>(exponent);
} catch (...) {
magnitude = 9999;
}
}
}
return magnitude_and_precision {magnitude, precision};
}
// Parse a number read from user input, validating that it has a valid
// numeric format and also in the allowed magnitude and precision ranges
// (see issue #6794). Throws an api_error::validation if the validation
// failed.
static big_decimal parse_and_validate_number(std::string_view s) {
try {
big_decimal ret(s);
auto [magnitude, precision] = internal::get_magnitude_and_precision(s);
if (magnitude > 125) {
throw api_error::validation(format("Number overflow: {}. Attempting to store a number with magnitude larger than supported range.", s));
}
if (magnitude < -130) {
throw api_error::validation(format("Number underflow: {}. Attempting to store a number with magnitude lower than supported range.", s));
}
if (precision > 38) {
throw api_error::validation(format("Number too precise: {}. Attempting to store a number with more significant digits than supported.", s));
}
return ret;
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", s));
}
}
struct from_json_visitor {
const rjson::value& v;
bytes_ostream& bo;
@@ -67,11 +176,7 @@ struct from_json_visitor {
bo.write(boolean_type->decompose(v.GetBool()));
}
void operator()(const decimal_type_impl& t) const {
try {
bo.write(t.from_string(rjson::to_string_view(v)));
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", v));
}
bo.write(decimal_type->decompose(parse_and_validate_number(rjson::to_string_view(v))));
}
// default
void operator()(const abstract_type& t) const {
@@ -203,6 +308,8 @@ bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column
// FIXME: it's difficult at this point to get information if value was provided
// in request or comes from the storage, for now we assume it's user's fault.
return *unwrap_bytes(value, true);
} else if (column.type == decimal_type) {
return decimal_type->decompose(parse_and_validate_number(rjson::to_string_view(value)));
} else {
return column.type->from_string(value_view);
}
@@ -295,16 +402,13 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
if (it->name != "N") {
throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));
}
try {
if (!it->value.IsString()) {
// We shouldn't reach here. Callers normally validate their input
// earlier with validate_value().
throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));
}
return big_decimal(rjson::to_string_view(it->value));
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", it->value));
if (!it->value.IsString()) {
// We shouldn't reach here. Callers normally validate their input
// earlier with validate_value().
throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));
}
big_decimal ret = parse_and_validate_number(rjson::to_string_view(it->value));
return ret;
}
std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {
@@ -316,8 +420,8 @@ std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {
return std::nullopt;
}
try {
return big_decimal(rjson::to_string_view(it->value));
} catch (const marshal_exception& e) {
return parse_and_validate_number(rjson::to_string_view(it->value));
} catch (api_error&) {
return std::nullopt;
}
}

View File

@@ -94,5 +94,12 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
// Returns a null value if one of the arguments is not actually a list.
rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2);
namespace internal {
struct magnitude_and_precision {
int magnitude;
int precision;
};
magnitude_and_precision get_magnitude_and_precision(std::string_view);
}
}

View File

@@ -424,7 +424,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
co_await client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace(trace_state, op);
tracing::trace(trace_state, "{}", op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));

View File

@@ -1096,7 +1096,7 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
}
}
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp) {
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp) {
auto& opts = schema.cdc_options();
if (opts.enabled()) {
auto db = sp.data_dictionary();

View File

@@ -241,7 +241,7 @@ static bool is_expired(const rjson::value& expiration_time, gc_clock::time_point
// understands it is an expiration event - not a user-initiated deletion.
static future<> expire_item(service::storage_proxy& proxy,
const service::query_state& qs,
const std::vector<bytes_opt>& row,
const std::vector<managed_bytes_opt>& row,
schema_ptr schema,
api::timestamp_type ts) {
// Prepare the row key to delete
@@ -260,7 +260,7 @@ static future<> expire_item(service::storage_proxy& proxy,
// FIXME: log or increment a metric if this happens.
return make_ready_future<>();
}
exploded_pk.push_back(*row_c);
exploded_pk.push_back(to_bytes(*row_c));
}
auto pk = partition_key::from_exploded(exploded_pk);
mutation m(schema, pk);
@@ -280,7 +280,7 @@ static future<> expire_item(service::storage_proxy& proxy,
// FIXME: log or increment a metric if this happens.
return make_ready_future<>();
}
exploded_ck.push_back(*row_c);
exploded_ck.push_back(to_bytes(*row_c));
}
auto ck = clustering_key::from_exploded(exploded_ck);
m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));
@@ -387,7 +387,7 @@ class token_ranges_owned_by_this_shard {
class ranges_holder_primary {
const dht::token_range_vector _token_ranges;
public:
ranges_holder_primary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
ranges_holder_primary(const locator::vnode_effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(erm->get_primary_ranges(ep)) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
@@ -430,6 +430,7 @@ class token_ranges_owned_by_this_shard {
size_t _range_idx;
size_t _end_idx;
std::optional<dht::selective_token_range_sharder> _intersecter;
locator::effective_replication_map_ptr _erm;
public:
token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)
: _s(s)
@@ -437,6 +438,7 @@ public:
g, utils::fb_utilities::get_broadcast_address())
, _range_idx(random_offset(0, _token_ranges.size() - 1))
, _end_idx(_range_idx + _token_ranges.size())
, _erm(s->table().get_effective_replication_map())
{
tlogger.debug("Generating token ranges starting from base range {} of {}", _range_idx, _token_ranges.size());
}
@@ -469,7 +471,7 @@ public:
return std::nullopt;
}
}
_intersecter.emplace(_s->get_sharder(), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());
_intersecter.emplace(_erm->get_sharder(*_s), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());
}
}
@@ -593,7 +595,7 @@ static future<> scan_table_ranges(
continue;
}
for (const auto& row : rows) {
const bytes_opt& cell = row[*expiration_column];
const managed_bytes_opt& cell = row[*expiration_column];
if (!cell) {
continue;
}

View File

@@ -14,6 +14,7 @@ set(swagger_files
api-doc/hinted_handoff.json
api-doc/lsa.json
api-doc/messaging_service.json
api-doc/metrics.json
api-doc/storage_proxy.json
api-doc/storage_service.json
api-doc/stream_manager.json

View File

@@ -437,6 +437,68 @@
}
]
},
{
"path":"/column_family/tombstone_gc/{name}",
"operations":[
{
"method":"GET",
"summary":"Check if tombstone GC is enabled for a given table",
"type":"boolean",
"nickname":"get_tombstone_gc",
"produces":[
"application/json"
],
"parameters":[
{
"name":"name",
"description":"The table name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
},
{
"method":"POST",
"summary":"Enable tombstone GC for a given table",
"type":"void",
"nickname":"enable_tombstone_gc",
"produces":[
"application/json"
],
"parameters":[
{
"name":"name",
"description":"The table name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
},
{
"method":"DELETE",
"summary":"Disable tombstone GC for a given table",
"type":"void",
"nickname":"disable_tombstone_gc",
"produces":[
"application/json"
],
"parameters":[
{
"name":"name",
"description":"The table name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/column_family/estimate_keys/{name}",
"operations":[

View File

@@ -34,6 +34,14 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"parameters",
"description":"dict of parameters to pass to the injection (json format)",
"required":false,
"allowMultiple":false,
"type":"dict",
"paramType":"body"
}
]
},
@@ -58,6 +66,30 @@
}
]
},
{
"path":"/v2/error_injection/injection/{injection}/message",
"operations":[
{
"method":"POST",
"summary":"Send message to trigger an event in injection's code",
"type":"void",
"nickname":"message_injection",
"produces":[
"application/json"
],
"parameters":[
{
"name":"injection",
"description":"injection name, should correspond to an injection added in code",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/v2/error_injection/injection",
"operations":[
@@ -86,5 +118,15 @@
}
]
}
]
],
"components":{
"schemas": {
"dict": {
"type": "object",
"additionalProperties": {
"type": "string"
}
}
}
}
}

View File

@@ -245,7 +245,7 @@
"GOSSIP_SHUTDOWN",
"DEFINITIONS_UPDATE",
"TRUNCATE",
"REPLICATION_FINISHED",
"UNUSED__REPLICATION_FINISHED",
"MIGRATION_REQUEST",
"PREPARE_MESSAGE",
"PREPARE_DONE_MESSAGE",

View File

@@ -0,0 +1,34 @@
"metrics_config": {
"id": "metrics_config",
"summary": "An entry in the metrics configuration",
"properties": {
"source_labels": {
"type": "array",
"items": {
"type": "string"
},
"description": "The source labels, a match is based on concatination of the labels"
},
"action": {
"type": "string",
"description": "The action to perfrom on match",
"enum": ["skip_when_empty", "report_when_empty", "replace", "keep", "drop", "drop_label"]
},
"target_label": {
"type": "string",
"description": "The application state version"
},
"replacement": {
"type": "string",
"description": "The replacement string to use when replacing a value"
},
"regex": {
"type": "string",
"description": "The regex string to use when replacing a value"
},
"separator": {
"type": "string",
"description": "The separator string to use when concatinating the labels"
}
}
}

66
api/api-doc/metrics.json Normal file
View File

@@ -0,0 +1,66 @@
"/v2/metrics-config/":{
"get":{
"description":"Return the metrics layer configuration",
"operationId":"get_metrics_config",
"produces":[
"application/json"
],
"tags":[
"metrics"
],
"parameters":[
],
"responses":{
"200":{
"schema": {
"type":"array",
"items":{
"$ref":"#/definitions/metrics_config",
"description":"metrics Config value"
}
}
},
"default":{
"description":"unexpected error",
"schema":{
"$ref":"#/definitions/ErrorModel"
}
}
}
},
"post": {
"description":"Set the metrics layer relabel configuration",
"operationId":"set_metrics_config",
"produces":[
"application/json"
],
"tags":[
"metrics"
],
"parameters":[
{
"in":"body",
"name":"conf",
"description":"An array of relabel_config objects",
"schema": {
"type":"array",
"items":{
"$ref":"#/definitions/metrics_config",
"description":"metrics Config value"
}
}
}
],
"responses":{
"200":{
"description": "OK"
},
"default":{
"description":"unexpected error",
"schema":{
"$ref":"#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -465,7 +465,7 @@
"operations":[
{
"method":"GET",
"summary":"Retrieve the mapping of endpoint to host ID",
"summary":"Retrieve the mapping of endpoint to host ID of all nodes that own tokens",
"type":"array",
"items":{
"type":"mapper"
@@ -1114,6 +1114,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ranges_parallelism",
"description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
@@ -1946,7 +1954,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset local schema",
"summary":"Forces this node to recalculate versions of schema objects.",
"type":"void",
"nickname":"reset_local_schema",
"produces":[
@@ -2110,6 +2118,65 @@
}
]
},
{
"path":"/storage_service/tombstone_gc/{keyspace}",
"operations":[
{
"method":"POST",
"summary":"Enable tombstone GC",
"type":"void",
"nickname":"enable_tombstone_gc",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Disable tombstone GC",
"type":"void",
"nickname":"disable_tombstone_gc",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/deliver_hints",
"operations":[
@@ -2428,7 +2495,23 @@
]
}
]
}
},
{
"path":"/storage_service/raft_topology/reload",
"operations":[
{
"method":"POST",
"summary":"Reload Raft topology state from disk.",
"type":"void",
"nickname":"reload_raft_topology_state",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{
"mapper":{
@@ -2631,7 +2714,7 @@
"description":"File creation time"
},
"generation":{
"type":"long",
"type":"string",
"description":"SSTable generation"
},
"level":{

View File

@@ -16,7 +16,7 @@
}
},
"host": "{{Host}}",
"basePath": "/v2",
"basePath": "/",
"schemes": [
"http"
],

View File

@@ -1,182 +1,182 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager/list_modules",
"operations":[
{
"method":"GET",
"summary":"Get all modules names",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_modules",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/list_module_tasks/{module}",
"operations":[
{
"method":"GET",
"summary":"Get a list of tasks",
"type":"array",
"items":{
"type":"task_stats"
},
"nickname":"get_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"internal",
"description":"Boolean flag indicating whether internal tasks should be shown (false by default)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager/task_status/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get task status",
"type":"task_status",
"nickname":"get_task_status",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/abort_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Abort running task and its descendants",
"type":"void",
"nickname":"abort_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to abort",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/wait_task/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Wait for a task to complete",
"type":"task_status",
"nickname":"wait_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to wait for",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/task_status_recursive/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get statuses of the task and all its descendants",
"type":"array",
"items":{
"type":"task_status"
},
"nickname":"get_task_status_recursively",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager/list_modules",
"operations":[
{
"method":"GET",
"summary":"Get all modules names",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_modules",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/list_module_tasks/{module}",
"operations":[
{
"method":"GET",
"summary":"Get a list of tasks",
"type":"array",
"items":{
"type":"task_stats"
},
"nickname":"get_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"internal",
"description":"Boolean flag indicating whether internal tasks should be shown (false by default)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager/task_status/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get task status",
"type":"task_status",
"nickname":"get_task_status",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/abort_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Abort running task and its descendants",
"type":"void",
"nickname":"abort_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to abort",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/wait_task/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Wait for a task to complete",
"type":"task_status",
"nickname":"wait_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to wait for",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/task_status_recursive/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get statuses of the task and all its descendants",
"type":"array",
"items":{
"type":"task_status"
},
"nickname":"get_task_status_recursively",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/ttl",
"operations":[
{
@@ -199,88 +199,96 @@
]
}
]
}
],
"models":{
"task_stats" :{
"id": "task_stats",
"description":"A task statistics object",
"properties":{
"task_id":{
"type":"string",
"description":"The uuid of a task"
},
"state":{
"type":"string",
"enum":[
}
],
"models":{
"task_stats" :{
"id": "task_stats",
"description":"A task statistics object",
"properties":{
"task_id":{
"type":"string",
"description":"The uuid of a task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
],
"description":"The state of a task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"keyspace":{
"type":"string",
"description":"The keyspace the task is working on (if applicable)"
},
"table":{
"type":"string",
"description":"The table the task is working on (if applicable)"
},
"entity":{
"type":"string",
"description":"Task-specific entity description"
},
"sequence_number":{
"type":"long",
"description":"The running sequence number of the task"
}
}
},
"task_status":{
"id":"task_status",
"description":"A task status object",
"properties":{
"id":{
"type":"string",
"description":"The uuid of the task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"state":{
],
"description":"The state of a task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"scope":{
"type":"string",
"description":"The scope of the task"
},
"keyspace":{
"type":"string",
"description":"The keyspace the task is working on (if applicable)"
},
"table":{
"type":"string",
"description":"The table the task is working on (if applicable)"
},
"entity":{
"type":"string",
"description":"Task-specific entity description"
},
"sequence_number":{
"type":"long",
"description":"The running sequence number of the task"
}
}
},
"task_status":{
"id":"task_status",
"description":"A task status object",
"properties":{
"id":{
"type":"string",
"description":"The uuid of the task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"scope":{
"type":"string",
"description":"The scope of the task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
"created",
"running",
"done",
"failed"
],
"description":"The state of the task"
},
"is_abortable":{
"type":"boolean",
"description":"Boolean flag indicating whether the task can be aborted"
},
"start_time":{
"type":"datetime",
"description":"The start time of the task"
},
"end_time":{
"type":"datetime",
"description":"The end time of the task (unspecified when the task is not completed)"
},
"error":{
"type":"string",
"description":"Error string, if the task failed"
},
"parent_id":{
"description":"The state of the task"
},
"is_abortable":{
"type":"boolean",
"description":"Boolean flag indicating whether the task can be aborted"
},
"start_time":{
"type":"datetime",
"description":"The start time of the task"
},
"end_time":{
"type":"datetime",
"description":"The end time of the task (unspecified when the task is not completed)"
},
"error":{
"type":"string",
"description":"Error string, if the task failed"
},
"parent_id":{
"type":"string",
"description":"The uuid of the parent task"
},
@@ -318,12 +326,12 @@
},
"children_ids":{
"type":"array",
"items":{
"type":"string"
},
"items":{
"type":"string"
},
"description":"Task IDs of children of this task"
}
}
}
}
}
}
}
}
}

View File

@@ -1,153 +1,153 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager_test",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager_test/test_module",
"operations":[
{
"method":"POST",
"summary":"Register test module in task manager",
"type":"void",
"nickname":"register_test_module",
"produces":[
"application/json"
],
"parameters":[
]
},
{
"method":"DELETE",
"summary":"Unregister test module in task manager",
"type":"void",
"nickname":"unregister_test_module",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager_test/test_task",
"operations":[
{
"method":"POST",
"summary":"Register test task",
"type":"string",
"nickname":"register_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"shard",
"description":"The shard of the task",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"parent_id",
"description":"The uuid of a parent task",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"entity",
"description":"Task-specific entity description",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Unregister test task",
"type":"void",
"nickname":"unregister_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/finish_test_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Finish test task",
"type":"void",
"nickname":"finish_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to finish",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"error",
"description":"The error with which task fails (if it does)",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
}
]
}
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager_test",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager_test/test_module",
"operations":[
{
"method":"POST",
"summary":"Register test module in task manager",
"type":"void",
"nickname":"register_test_module",
"produces":[
"application/json"
],
"parameters":[
]
},
{
"method":"DELETE",
"summary":"Unregister test module in task manager",
"type":"void",
"nickname":"unregister_test_module",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager_test/test_task",
"operations":[
{
"method":"POST",
"summary":"Register test task",
"type":"string",
"nickname":"register_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"shard",
"description":"The shard of the task",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"parent_id",
"description":"The uuid of a parent task",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"entity",
"description":"Task-specific entity description",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Unregister test task",
"type":"void",
"nickname":"unregister_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/finish_test_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Finish test task",
"type":"void",
"nickname":"finish_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to finish",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"error",
"description":"The error with which task fails (if it does)",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -60,8 +60,10 @@ future<> set_server_init(http_context& ctx) {
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
rb02->register_api_file(r, "metrics");
rb->register_function(r, "system",
"The system related API");
rb02->add_definitions_file(r, "metrics");
set_system(ctx, r);
});
}
@@ -69,7 +71,7 @@ future<> set_server_init(http_context& ctx) {
future<> set_server_config(http_context& ctx, const db::config& cfg) {
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {
set_config(rb02, ctx, r, cfg);
set_config(rb02, ctx, r, cfg, false);
});
}
@@ -100,12 +102,16 @@ future<> unset_rpc_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs, &sys_ks] (http_context& ctx, routes& r) {
set_storage_service(ctx, r, ss, g.local(), cdc_gs, sys_ks);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &group0_client] (http_context& ctx, routes& r) {
set_storage_service(ctx, r, ss, group0_client);
});
}
future<> unset_server_storage_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });
}
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {
return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });
}
@@ -187,10 +193,10 @@ future<> unset_server_messaging_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_messaging_service(ctx, r); });
}
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss) {
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_proxy>& proxy) {
return register_api(ctx, "storage_proxy",
"The storage proxy API", [&ss] (http_context& ctx, routes& r) {
set_storage_proxy(ctx, r, ss);
"The storage proxy API", [&proxy] (http_context& ctx, routes& r) {
set_storage_proxy(ctx, r, proxy);
});
}
@@ -214,10 +220,10 @@ future<> set_server_cache(http_context& ctx) {
"The cache service API", set_cache_service);
}
future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g) {
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy) {
return register_api(ctx, "hinted_handoff",
"The hinted handoff API", [&g] (http_context& ctx, routes& r) {
set_hinted_handoff(ctx, r, g.local());
"The hinted handoff API", [&proxy] (http_context& ctx, routes& r) {
set_hinted_handoff(ctx, r, proxy);
});
}
@@ -264,28 +270,36 @@ future<> set_server_done(http_context& ctx) {
});
}
future<> set_server_task_manager(http_context& ctx, lw_shared_ptr<db::config> cfg) {
future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &cfg = *cfg](routes& r) {
return ctx.http_server.set_routes([rb, &ctx, &tm, &cfg = *cfg](routes& r) {
rb->register_function(r, "task_manager",
"The task manager API");
set_task_manager(ctx, r, cfg);
set_task_manager(ctx, r, tm, cfg);
});
}
future<> unset_server_task_manager(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_task_manager(ctx, r); });
}
#ifndef SCYLLA_BUILD_MODE_RELEASE
future<> set_server_task_manager_test(http_context& ctx) {
future<> set_server_task_manager_test(http_context& ctx, sharded<tasks::task_manager>& tm) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) mutable {
return ctx.http_server.set_routes([rb, &ctx, &tm](routes& r) mutable {
rb->register_function(r, "task_manager_test",
"The task manager test API");
set_task_manager_test(ctx, r);
set_task_manager_test(ctx, r, tm);
});
}
future<> unset_server_task_manager_test(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_task_manager_test(ctx, r); });
}
#endif
void req_params::process(const request& req) {

View File

@@ -22,6 +22,7 @@ namespace service {
class load_meter;
class storage_proxy;
class storage_service;
class raft_group0_client;
} // namespace service
@@ -51,7 +52,6 @@ class system_keyspace;
}
namespace netw { class messaging_service; }
class repair_service;
namespace cdc { class generation_service; }
namespace gms {
@@ -61,6 +61,10 @@ class gossiper;
namespace auth { class service; }
namespace tasks {
class task_manager;
}
namespace api {
struct http_context {
@@ -68,15 +72,12 @@ struct http_context {
sstring api_doc;
httpd::http_server_control http_server;
distributed<replica::database>& db;
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
const sharded<locator::shared_token_metadata>& shared_token_metadata;
sharded<tasks::task_manager>& tm;
http_context(distributed<replica::database>& _db,
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm, sharded<tasks::task_manager>& _tm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm), tm(_tm) {
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)
: db(_db), lmeter(_lm), shared_token_metadata(_stm) {
}
const locator::token_metadata& get_token_metadata();
@@ -86,7 +87,8 @@ future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx, const db::config& cfg);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);
@@ -106,17 +108,19 @@ future<> set_server_load_sstable(http_context& ctx, sharded<db::system_keyspace>
future<> unset_server_load_sstable(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> unset_server_messaging_service(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss);
future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_proxy>& proxy);
future<> unset_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);
future<> unset_server_stream_manager(http_context& ctx);
future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p);
future<> unset_hinted_handoff(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx, lw_shared_ptr<db::config> cfg);
future<> set_server_task_manager_test(http_context& ctx);
future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg);
future<> unset_server_task_manager(http_context& ctx);
future<> set_server_task_manager_test(http_context& ctx, sharded<tasks::task_manager>& tm);
future<> unset_server_task_manager_test(http_context& ctx);
}

View File

@@ -11,6 +11,7 @@
#include "api/authorization_cache.hh"
#include "api/api.hh"
#include "auth/common.hh"
#include "auth/service.hh"
namespace api {
using namespace json;

View File

@@ -43,7 +43,7 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {
return std::make_tuple(name.substr(0, pos), name.substr(end));
}
const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
table_id get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
try {
return db.find_uuid(ks, cf);
} catch (replica::no_such_column_family& e) {
@@ -51,7 +51,7 @@ const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::da
}
}
const table_id& get_uuid(const sstring& name, const replica::database& db) {
table_id get_uuid(const sstring& name, const replica::database& db) {
auto [ks, cf] = parse_fully_qualified_cf_name(name);
return get_uuid(ks, cf, db);
}
@@ -135,9 +135,9 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, const
static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
std::function<utils::ihistogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::ihistogram res;
for (auto i : db.get_column_families()) {
res += (i.second->get_stats().*f).hist;
}
db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) mutable {
res += (table->get_stats().*f).hist;
});
return res;
};
return ctx.db.map(fun).then([](const std::vector<utils::ihistogram> &res) {
@@ -162,9 +162,9 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& c
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
std::function<utils::rate_moving_average_and_histogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::rate_moving_average_and_histogram res;
for (auto i : db.get_column_families()) {
res += (i.second->get_stats().*f).rate();
}
db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) {
res += (table->get_stats().*f).rate();
});
return res;
};
return ctx.db.map(fun).then([](const std::vector<utils::rate_moving_average_and_histogram> &res) {
@@ -306,21 +306,21 @@ ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared
void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace>& sys_ks) {
cf::get_column_family_name.set(r, [&ctx] (const_req req){
std::vector<sstring> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
res.push_back(i.first.first + ":" + i.first.second);
}
ctx.db.local().get_tables_metadata().for_each_table_id([&] (const std::pair<sstring, sstring>& kscf, table_id) {
res.push_back(kscf.first + ":" + kscf.second);
});
return res;
});
cf::get_column_family.set(r, [&ctx] (std::unique_ptr<http::request> req){
std::list<cf::column_family_info> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
std::list<cf::column_family_info> res;
ctx.db.local().get_tables_metadata().for_each_table_id([&] (const std::pair<sstring, sstring>& kscf, table_id) {
cf::column_family_info info;
info.ks = i.first.first;
info.cf = i.first.second;
info.ks = kscf.first;
info.cf = kscf.second;
info.type = "ColumnFamilies";
res.push_back(info);
}
});
return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
});
@@ -871,6 +871,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_auto_compaction: name={}", req->param["name"]);
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
@@ -882,6 +883,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_auto_compaction: name={}", req->param["name"]);
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
@@ -892,6 +894,30 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
});
cf::get_tombstone_gc.set(r, [&ctx] (const_req req) {
auto uuid = get_uuid(req.param["name"], ctx.db.local());
replica::table& t = ctx.db.local().find_column_family(uuid);
return t.tombstone_gc_enabled();
});
cf::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
t.set_tombstone_gc_enabled(true);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
cf::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
t.set_tombstone_gc_enabled(false);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {
auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);
auto&& ks = std::get<0>(ks_cf);
@@ -955,6 +981,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<http::request> req) {
sstring strategy = req->get_query_param("class_name");
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->param["name"], strategy);
return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {
cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));
}).then([] {
@@ -990,11 +1017,12 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (replica::database& db) {
return db.find_column_family(uuid).get_sstables_by_partition_key(key);
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
co_return boost::copy_range<std::unordered_set<sstring>>(sstables | boost::adaptors::transformed([] (auto s) { return s->get_filename(); }));
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.insert(b.begin(),b.end());
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.merge(b);
return a;
}).then([](const std::unordered_set<sstring>& res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
@@ -1023,9 +1051,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
fail(unimplemented::cause::API);
}
apilog.info("column_family/force_major_compaction: name={}", req->param["name"]);
auto [ks, cf] = parse_fully_qualified_cf_name(req->param["name"]);
auto keyspace = validate_keyspace(ctx, ks);
std::vector<table_id> table_infos = {ctx.db.local().find_uuid(ks, cf)};
std::vector<table_info> table_infos = {table_info{
.name = cf,
.id = ctx.db.local().find_uuid(ks, cf)
}};
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, std::move(table_infos));

View File

@@ -23,7 +23,7 @@ namespace api {
void set_column_family(http_context& ctx, httpd::routes& r, sharded<db::system_keyspace>& sys_ks);
void unset_column_family(http_context& ctx, httpd::routes& r);
const table_id& get_uuid(const sstring& name, const replica::database& db);
table_id get_uuid(const sstring& name, const replica::database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);
@@ -68,9 +68,10 @@ struct map_reduce_column_families_locally {
std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;
future<std::unique_ptr<std::any>> operator()(replica::database& db) const {
auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));
return do_for_each(db.get_column_families(), [res, this](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {
*res = reducer(std::move(*res), mapper(*i.second.get()));
}).then([res] {
return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) {
*res = reducer(std::move(*res), mapper(*table.get()));
return make_ready_future();
}).then([res] () {
return std::move(*res);
});
}

View File

@@ -68,8 +68,8 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return ctx.db.map_reduce0([](replica::database& db) {
return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) -> future<> {
replica::table& cf = *i.second.get();
return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) {
replica::table& cf = *table.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();
return make_ready_future<>();
}).then([&tasks] {

View File

@@ -45,7 +45,7 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc
} else {
ss <<',';
};
ss << "\"/config/" << name <<"\": {"
ss << "\"/v2/config/" << name <<"\": {"
"\"get\": {"
"\"description\": \"" << boost::replace_all_copy(boost::replace_all_copy(boost::replace_all_copy(description,"\n","\\n"),"\"", "''"), "\t", " ") <<"\","
"\"operationId\": \"find_config_"<< name <<"\","
@@ -76,9 +76,9 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc
namespace cs = httpd::config_json;
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg) {
rb->register_function(r, [&cfg] (output_stream<char>& os) {
return do_with(true, [&os, &cfg] (bool& first) {
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg, bool first) {
rb->register_function(r, [&cfg, first] (output_stream<char>& os) {
return do_with(first, [&os, &cfg] (bool& first) {
auto f = make_ready_future();
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();

View File

@@ -13,5 +13,5 @@
namespace api {
void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg);
void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg, bool first = false);
}

View File

@@ -12,7 +12,9 @@
#include <seastar/http/exception.hh>
#include "log.hh"
#include "utils/error_injection.hh"
#include "utils/rjson.hh"
#include <seastar/core/future-util.hh>
#include <seastar/util/short_streams.hh>
namespace api {
using namespace seastar::httpd;
@@ -24,10 +26,27 @@ void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
bool one_shot = req->get_query_param("one_shot") == "True";
auto& errinj = utils::get_local_injector();
return errinj.enable_on_all(injection, one_shot).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
auto params = req->content;
const size_t max_params_size = 1024 * 1024;
if (params.size() > max_params_size) {
// This is a hard limit, because we don't want to allocate
// too much memory or block the thread for too long.
throw httpd::bad_param_exception(format("Injection parameters are too long, max length is {}", max_params_size));
}
try {
auto parameters = params.empty()
? utils::error_injection_parameters{}
: rjson::parse_to_map<utils::error_injection_parameters>(params);
auto& errinj = utils::get_local_injector();
return errinj.enable_on_all(injection, one_shot, std::move(parameters)).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
} catch (const rjson::error& e) {
throw httpd::bad_param_exception(format("Failed to parse injections parameters: {}", e.what()));
}
});
hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {
@@ -52,6 +71,13 @@ void set_error_injection(http_context& ctx, routes& r) {
});
});
hf::message_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
auto& errinj = utils::get_local_injector();
return errinj.receive_message_on_all(injection).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
});
}
} // namespace api

View File

@@ -19,24 +19,25 @@ namespace fd = httpd::failure_detector_json;
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {
std::vector<fd::endpoint_state> res;
for (auto i : g.get_endpoint_states()) {
res.reserve(g.num_endpoints());
g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {
fd::endpoint_state val;
val.addrs = fmt::to_string(i.first);
val.is_alive = i.second.is_alive();
val.generation = i.second.get_heart_beat_state().get_generation().value();
val.version = i.second.get_heart_beat_state().get_heart_beat_version().value();
val.update_time = i.second.get_update_timestamp().time_since_epoch().count();
for (auto a : i.second.get_application_state_map()) {
val.addrs = fmt::to_string(addr);
val.is_alive = g.is_alive(addr);
val.generation = eps.get_heart_beat_state().get_generation().value();
val.version = eps.get_heart_beat_state().get_heart_beat_version().value();
val.update_time = eps.get_update_timestamp().time_since_epoch().count();
for (const auto& [as_type, app_state] : eps.get_application_state_map()) {
fd::version_value version_val;
// We return the enum index and not it's name to stay compatible to origin
// method that the state index are static but the name can be changed.
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);
version_val.value = a.second.value();
version_val.version = a.second.version().value();
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(as_type);
version_val.value = app_state.value();
version_val.version = app_state.version().value();
val.application_state.push(version_val);
}
res.push_back(val);
}
res.emplace_back(std::move(val));
});
return make_ready_future<json::json_return_type>(res);
});
@@ -56,9 +57,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {
nodes_status.emplace(node.to_sstring(), g.is_alive(node) ? "UP" : "DOWN");
});
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
@@ -70,7 +71,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}

View File

@@ -6,8 +6,11 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "gossiper.hh"
#include "api/api-doc/gossiper.json.hh"
#include "gms/endpoint_state.hh"
#include "gms/gossiper.hh"
namespace api {
@@ -15,9 +18,9 @@ using namespace seastar::httpd;
using namespace json;
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_unreachable_members();
return container_to_vec(res);
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto res = co_await g.get_unreachable_members_synchronized();
co_return json::json_return_type(container_to_vec(res));
});
@@ -27,9 +30,11 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
});
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {
gms::inet_address ep(req.param["addr"]);
return g.get_endpoint_downtime(ep);
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->param["addr"]);
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {
@@ -59,7 +64,7 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
return g.force_remove_endpoint(ep).then([] {
return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});

View File

@@ -13,7 +13,6 @@
#include "api/api-doc/hinted_handoff.json.hh"
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "service/storage_proxy.hh"
namespace api {
@@ -22,38 +21,33 @@ using namespace json;
using namespace seastar::httpd;
namespace hh = httpd::hinted_handoff_json;
void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {
hh::create_hints_sync_point.set(r, [&ctx, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto parse_hosts_list = [&g] (sstring arg) {
void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {
hh::create_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto parse_hosts_list = [] (sstring arg) {
std::vector<sstring> hosts_str = split(arg, ",");
std::vector<gms::inet_address> hosts;
hosts.reserve(hosts_str.size());
if (hosts_str.empty()) {
// No target_hosts specified means that we should wait for hints for all nodes to be sent
const auto members_set = g.get_live_members();
std::copy(members_set.begin(), members_set.end(), std::back_inserter(hosts));
} else {
for (const auto& host_str : hosts_str) {
try {
gms::inet_address host;
host = gms::inet_address(host_str);
hosts.push_back(host);
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));
}
for (const auto& host_str : hosts_str) {
try {
gms::inet_address host;
host = gms::inet_address(host_str);
hosts.push_back(host);
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));
}
}
return hosts;
};
std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));
return ctx.sp.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {
return proxy.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {
return json::json_return_type(sync_point.encode());
});
});
hh::get_hints_sync_point.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
hh::get_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
db::hints::sync_point sync_point;
const sstring encoded = req->get_query_param("id");
try {
@@ -87,7 +81,7 @@ void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {
using return_type = hh::ns_get_hints_sync_point::get_hints_sync_point_return_type;
using return_type_wrapper = hh::ns_get_hints_sync_point::return_type_wrapper;
return ctx.sp.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {
return proxy.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {
return json::json_return_type(return_type_wrapper(return_type::DONE));
}).handle_exception_type([] (const timed_out_error&) {
return json::json_return_type(return_type_wrapper(return_type::IN_PROGRESS));

View File

@@ -8,17 +8,14 @@
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
namespace gms {
class gossiper;
}
namespace service { class storage_proxy; }
namespace api {
void set_hinted_handoff(http_context& ctx, httpd::routes& r, gms::gossiper& g);
void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p);
void unset_hinted_handoff(http_context& ctx, httpd::routes& r);
}

View File

@@ -10,7 +10,6 @@
#include "service/storage_proxy.hh"
#include "api/api-doc/storage_proxy.json.hh"
#include "api/api-doc/utils.json.hh"
#include "service/storage_service.hh"
#include "db/config.hh"
#include "utils/histogram.hh"
#include "replica/database.hh"
@@ -116,17 +115,17 @@ utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimat
return res;
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
static future<json::json_return_type> sum_estimated_histogram(sharded<service::storage_proxy>& proxy, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(proxy, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).histogram();
}, utils::time_estimated_histogram_merge, utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(val));
});
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {
static future<json::json_return_type> sum_estimated_histogram(sharded<service::storage_proxy>& proxy, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::estimated_histogram_merge,
return two_dimensional_map_reduce(proxy, f, utils::estimated_histogram_merge,
utils::estimated_histogram()).then([](const utils::estimated_histogram& val) {
utils_json::estimated_histogram res;
res = val;
@@ -134,8 +133,8 @@ static future<json::json_return_type> sum_estimated_histogram(http_context& ctx
});
}
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
static future<json::json_return_type> total_latency(sharded<service::storage_proxy>& proxy, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(proxy, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist.mean * (stats.*f).hist.count;
}, std::plus<double>(), 0.0).then([](double val) {
int64_t res = val;
@@ -184,43 +183,43 @@ sum_timer_stats_storage_proxy(distributed<proxy>& d,
});
}
void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss) {
void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {
sp::get_total_hints.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(0);
});
sp::get_hinted_handoff_enabled.set(r, [&ctx](std::unique_ptr<http::request> req) {
const auto& filter = ctx.sp.local().get_hints_host_filter();
sp::get_hinted_handoff_enabled.set(r, [&proxy](std::unique_ptr<http::request> req) {
const auto& filter = proxy.local().get_hints_host_filter();
return make_ready_future<json::json_return_type>(!filter.is_disabled_for_all());
});
sp::set_hinted_handoff_enabled.set(r, [&ctx](std::unique_ptr<http::request> req) {
sp::set_hinted_handoff_enabled.set(r, [&proxy](std::unique_ptr<http::request> req) {
auto enable = req->get_query_param("enable");
auto filter = (enable == "true" || enable == "1")
? db::hints::host_filter(db::hints::host_filter::enabled_for_all_tag {})
: db::hints::host_filter(db::hints::host_filter::disabled_for_all_tag {});
return ctx.sp.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return proxy.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return sp.change_hints_host_filter(filter);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
sp::get_hinted_handoff_enabled_by_dc.set(r, [&ctx](std::unique_ptr<http::request> req) {
sp::get_hinted_handoff_enabled_by_dc.set(r, [&proxy](std::unique_ptr<http::request> req) {
std::vector<sstring> res;
const auto& filter = ctx.sp.local().get_hints_host_filter();
const auto& filter = proxy.local().get_hints_host_filter();
const auto& dcs = filter.get_dcs();
res.reserve(res.size());
std::copy(dcs.begin(), dcs.end(), std::back_inserter(res));
return make_ready_future<json::json_return_type>(res);
});
sp::set_hinted_handoff_enabled_by_dc_list.set(r, [&ctx](std::unique_ptr<http::request> req) {
sp::set_hinted_handoff_enabled_by_dc_list.set(r, [&proxy](std::unique_ptr<http::request> req) {
auto dcs = req->get_query_param("dcs");
auto filter = db::hints::host_filter::parse_from_dc_list(std::move(dcs));
return ctx.sp.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return proxy.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return sp.change_hints_host_filter(filter);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -342,144 +341,131 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se
return make_ready_future<json::json_return_type>(json_void());
});
sp::get_read_repair_attempted.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_attempts);
sp::get_read_repair_attempted.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_attempts);
});
sp::get_read_repair_repaired_blocking.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_blocking);
sp::get_read_repair_repaired_blocking.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_repaired_blocking);
});
sp::get_read_repair_repaired_background.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_background);
sp::get_read_repair_repaired_background.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_repaired_background);
});
sp::get_schema_versions.set(r, [&ss](std::unique_ptr<http::request> req) {
return ss.local().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
for (auto e : result) {
sp::mapper_list entry;
entry.key = std::move(e.first);
entry.value = std::move(e.second);
res.emplace_back(std::move(entry));
}
return make_ready_future<json::json_return_type>(std::move(res));
});
sp::get_cas_read_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &proxy::stats::cas_read_timeouts);
});
sp::get_cas_read_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_read_timeouts);
sp::get_cas_read_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &proxy::stats::cas_read_unavailables);
});
sp::get_cas_read_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_read_unavailables);
sp::get_cas_write_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &proxy::stats::cas_write_timeouts);
});
sp::get_cas_write_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_write_timeouts);
sp::get_cas_write_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &proxy::stats::cas_write_unavailables);
});
sp::get_cas_write_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_write_unavailables);
sp::get_cas_write_metrics_unfinished_commit.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats(proxy, &proxy::stats::cas_write_unfinished_commit);
});
sp::get_cas_write_metrics_unfinished_commit.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_write_unfinished_commit);
sp::get_cas_write_metrics_contention.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(proxy, &proxy::stats::cas_write_contention);
});
sp::get_cas_write_metrics_contention.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(ctx, &proxy::stats::cas_write_contention);
sp::get_cas_write_metrics_condition_not_met.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats(proxy, &proxy::stats::cas_write_condition_not_met);
});
sp::get_cas_write_metrics_condition_not_met.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_write_condition_not_met);
sp::get_cas_write_metrics_failed_read_round_optimization.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats(proxy, &proxy::stats::cas_failed_read_round_optimization);
});
sp::get_cas_write_metrics_failed_read_round_optimization.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_failed_read_round_optimization);
sp::get_cas_read_metrics_unfinished_commit.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_stats(proxy, &proxy::stats::cas_read_unfinished_commit);
});
sp::get_cas_read_metrics_unfinished_commit.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_stats(ctx.sp, &proxy::stats::cas_read_unfinished_commit);
sp::get_cas_read_metrics_contention.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(proxy, &proxy::stats::cas_read_contention);
});
sp::get_cas_read_metrics_contention.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(ctx, &proxy::stats::cas_read_contention);
sp::get_read_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::read_timeouts);
});
sp::get_read_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);
sp::get_read_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::read_unavailables);
});
sp::get_read_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);
sp::get_range_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::range_slice_timeouts);
});
sp::get_range_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);
sp::get_range_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::range_slice_unavailables);
});
sp::get_range_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);
sp::get_write_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::write_timeouts);
});
sp::get_write_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);
sp::get_write_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::write_unavailables);
});
sp::get_write_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);
sp::get_read_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::read_timeouts);
});
sp::get_read_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);
sp::get_read_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::read_unavailables);
});
sp::get_read_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);
sp::get_range_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::range_slice_timeouts);
});
sp::get_range_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);
sp::get_range_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::range_slice_unavailables);
});
sp::get_range_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);
sp::get_write_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::write_timeouts);
});
sp::get_write_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);
sp::get_write_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::write_unavailables);
});
sp::get_write_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);
sp::get_range_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);
});
sp::get_range_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
sp::get_write_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::write);
});
sp::get_write_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);
sp::get_read_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read);
});
sp::get_read_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);
sp::get_range_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);
});
sp::get_range_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
sp::get_write_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::write);
});
sp::get_cas_write_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats(proxy, &proxy::stats::cas_write);
});
sp::get_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);
});
sp::get_cas_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::cas_write);
});
sp::get_cas_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::cas_read);
sp::get_cas_read_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats(proxy, &proxy::stats::cas_read);
});
sp::get_view_write_metrics_latency_histogram.set(r, [](std::unique_ptr<http::request> req) {
@@ -490,31 +476,31 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se
return make_ready_future<json::json_return_type>(get_empty_moving_average());
});
sp::get_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);
sp::get_read_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read);
});
sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::read);
sp::get_read_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(proxy, &service::storage_proxy_stats::stats::read);
});
sp::get_read_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {
return total_latency(ctx, &service::storage_proxy_stats::stats::read);
sp::get_read_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {
return total_latency(proxy, &service::storage_proxy_stats::stats::read);
});
sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::write);
sp::get_write_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_estimated_histogram(proxy, &service::storage_proxy_stats::stats::write);
});
sp::get_write_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {
return total_latency(ctx, &service::storage_proxy_stats::stats::write);
sp::get_write_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {
return total_latency(proxy, &service::storage_proxy_stats::stats::write);
});
sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);
sp::get_range_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {
return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);
});
sp::get_range_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {
return total_latency(ctx, &service::storage_proxy_stats::stats::range);
sp::get_range_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {
return total_latency(proxy, &service::storage_proxy_stats::stats::range);
});
}
@@ -547,7 +533,6 @@ void unset_storage_proxy(http_context& ctx, routes& r) {
sp::get_read_repair_attempted.unset(r);
sp::get_read_repair_repaired_blocking.unset(r);
sp::get_read_repair_repaired_background.unset(r);
sp::get_schema_versions.unset(r);
sp::get_cas_read_timeouts.unset(r);
sp::get_cas_read_unavailables.unset(r);
sp::get_cas_write_timeouts.unset(r);

View File

@@ -11,11 +11,11 @@
#include <seastar/core/sharded.hh>
#include "api.hh"
namespace service { class storage_service; }
namespace service { class storage_proxy; }
namespace api {
void set_storage_proxy(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss);
void set_storage_proxy(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& proxy);
void unset_storage_proxy(http_context& ctx, httpd::routes& r);
}

View File

@@ -8,6 +8,7 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "api/api-doc/storage_proxy.json.hh"
#include "db/config.hh"
#include "db/schema_tables.hh"
#include "utils/hash.hh"
@@ -42,7 +43,6 @@
#include "thrift/controller.hh"
#include "locator/token_metadata.hh"
#include "cdc/generation_service.hh"
#include "service/storage_proxy.hh"
#include "locator/abstract_replication_strategy.hh"
#include "sstables_loader.hh"
#include "db/view/view_builder.hh"
@@ -52,22 +52,10 @@ using namespace std::chrono_literals;
extern logging::logger apilog;
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti) {
fmt::print(os, "table{{name={}, id={}}}", ti.name, ti.id);
return os;
}
} // namespace std
namespace api {
const locator::token_metadata& http_context::get_token_metadata() {
return *shared_token_metadata.local().get();
}
namespace ss = httpd::storage_service_json;
namespace sp = httpd::storage_proxy_json;
using namespace json;
sstring validate_keyspace(http_context& ctx, sstring ks_name) {
@@ -220,32 +208,47 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
static future<json::json_return_type> set_tables(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return do_with(keyspace, std::move(tables), [&ctx, enabled] (const sstring &keyspace, const std::vector<sstring>& tables) {
return ctx.db.invoke_on(0, [&ctx, &keyspace, &tables, enabled] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return ctx.db.invoke_on_all([&keyspace, &tables, enabled] (replica::database& db) {
return parallel_for_each(tables, [&db, &keyspace, enabled] (const sstring& table) {
replica::column_family& cf = db.find_column_family(keyspace, table);
if (enabled) {
cf.enable_auto_compaction();
} else {
return cf.disable_auto_compaction();
}
return make_ready_future<>();
});
}).finally([g = std::move(g)] {});
return do_with(keyspace, std::move(tables), [&ctx, set] (const sstring& keyspace, const std::vector<sstring>& tables) {
return ctx.db.invoke_on_all([&keyspace, &tables, set] (replica::database& db) {
return parallel_for_each(tables, [&db, &keyspace, set] (const sstring& table) {
replica::table& t = db.find_column_family(keyspace, table);
return set(t);
});
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return ctx.db.invoke_on(0, [&ctx, keyspace, tables = std::move(tables), enabled] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return set_tables(ctx, keyspace, tables, [enabled] (replica::table& cf) {
if (enabled) {
cf.enable_auto_compaction();
} else {
return cf.disable_auto_compaction();
}
return make_ready_future<>();
}).finally([g = std::move(g)] {});
});
}
future<json::json_return_type> set_tables_tombstone_gc(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_tombstone_gc: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return set_tables(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {
t.set_tombstone_gc_enabled(enabled);
return make_ready_future<>();
});
}
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {
ss::start_native_transport.set(r, [&ctl](std::unique_ptr<http::request> req) {
return smp::submit_to(0, [&] {
@@ -314,7 +317,7 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {
ss::repair_async.set(r, [&ctx, &repair](std::unique_ptr<http::request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
"startToken", "endToken" };
"startToken", "endToken", "ranges_parallelism"};
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
auto s = req->get_query_param(o);
@@ -459,29 +462,21 @@ static future<json::json_return_type> describe_ring_as_json(sharded<service::sto
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));
}
static std::vector<table_id> get_table_ids(const std::vector<table_info>& table_infos) {
std::vector<table_id> table_ids{table_infos.size()};
boost::transform(table_infos, table_ids.begin(), [] (const auto& ti) {
return ti.id;
});
return table_ids;
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {
ss::local_hostid.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto id = ctx.db.local().get_config().host_id;
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
ss::local_hostid.set(r, [&ss](std::unique_ptr<http::request> req) {
auto id = ss.local().get_token_metadata().get_my_id();
return make_ready_future<json::json_return_type>(id.to_sstring());
});
ss::get_tokens.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().sorted_tokens(), [](const dht::token& i) {
ss::get_tokens.set(r, [&ss] (std::unique_ptr<http::request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_metadata().sorted_tokens(), [](const dht::token& i) {
return fmt::to_string(i);
}));
});
ss::get_node_tokens.set(r, [&ctx] (std::unique_ptr<http::request> req) {
ss::get_node_tokens.set(r, [&ss] (std::unique_ptr<http::request> req) {
gms::inet_address addr(req->param["endpoint"]);
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return fmt::to_string(i);
}));
});
@@ -549,8 +544,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
});
ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());
ss::get_leaving_nodes.set(r, [&ss](const_req req) {
return container_to_vec(ss.local().get_token_metadata().get_leaving_endpoints());
});
ss::get_moving_nodes.set(r, [](const_req req) {
@@ -558,8 +553,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return container_to_vec(addr);
});
ss::get_joining_nodes.set(r, [&ctx](const_req req) {
auto points = ctx.get_token_metadata().get_bootstrap_tokens();
ss::get_joining_nodes.set(r, [&ss](const_req req) {
auto points = ss.local().get_token_metadata().get_bootstrap_tokens();
std::unordered_set<sstring> addr;
for (auto i: points) {
addr.insert(fmt::to_string(i.second));
@@ -619,7 +614,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::describe_any_ring.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) {
// Find an arbitrary non-system keyspace.
auto keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();
auto keyspaces = ctx.db.local().get_non_local_vnode_based_strategy_keyspaces();
if (keyspaces.empty()) {
throw std::runtime_error("No keyspace provided and no non system kespace exist");
}
@@ -631,9 +626,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return describe_ring_as_json(ss, validate_keyspace(ctx, req->param));
});
ss::get_host_id_map.set(r, [&ctx](const_req req) {
ss::get_host_id_map.set(r, [&ss](const_req req) {
std::vector<ss::mapper> res;
return map_to_key_value(ctx.get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);
return map_to_key_value(ss.local().get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);
});
ss::get_load.set(r, [&ctx](std::unique_ptr<http::request> req) {
@@ -653,9 +648,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
});
ss::get_current_generation_number.set(r, [&g](std::unique_ptr<http::request> req) {
ss::get_current_generation_number.set(r, [&ss](std::unique_ptr<http::request> req) {
gms::inet_address ep(utils::fb_utilities::get_broadcast_address());
return g.get_current_generation_number(ep).then([](gms::generation_type res) {
return ss.local().gossiper().get_current_generation_number(ep).then([](gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
@@ -666,11 +661,10 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
req.get_query_param("key")));
});
ss::cdc_streams_check_and_repair.set(r, [&cdc_gs] (std::unique_ptr<http::request> req) {
if (!cdc_gs.local_is_initialized()) {
throw std::runtime_error("get_cdc_generation_service: not initialized yet");
}
return cdc_gs.local().check_and_repair_cdc_streams().then([] {
ss::cdc_streams_check_and_repair.set(r, [&ss] (std::unique_ptr<http::request> req) {
return ss.invoke_on(0, [] (service::storage_service& ss) {
return ss.check_and_repair_cdc_streams();
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -682,7 +676,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
apilog.debug("force_keyspace_compaction: keyspace={} tables={}", keyspace, table_infos);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos));
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos);
try {
co_await task->done();
} catch (...) {
@@ -705,7 +699,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos));
auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos);
try {
co_await task->done();
} catch (...) {
@@ -720,7 +714,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);
bool res = false;
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, get_table_ids(table_infos), res);
auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, res);
try {
co_await task->done();
} catch (...) {
@@ -738,7 +732,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos), exclude_current_version);
auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
try {
co_await task->done();
} catch (...) {
@@ -779,21 +773,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::remove_node.set(r, [&ss](std::unique_ptr<http::request> req) {
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
std::vector<sstring> ignore_nodes_strs = utils::split_comma_separated_list(req->get_query_param("ignore_nodes"));
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
for (std::string n : ignore_nodes_strs) {
for (const sstring& n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto hoep = locator::host_id_or_endpoint(n);
if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {
throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");
}
ignore_nodes.push_back(std::move(hoep));
auto hoep = locator::host_id_or_endpoint(n);
if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {
throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");
}
ignore_nodes.push_back(std::move(hoep));
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));
}
@@ -906,11 +895,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(json_void());
});
ss::is_initialized.set(r, [&ss, &g](std::unique_ptr<http::request> req) {
return ss.local().get_operation_mode().then([&g] (auto mode) {
ss::is_initialized.set(r, [&ss](std::unique_ptr<http::request> req) {
return ss.local().get_operation_mode().then([&ss] (auto mode) {
bool is_initialized = mode >= service::storage_service::mode::STARTING;
if (mode == service::storage_service::mode::NORMAL) {
is_initialized = g.is_enabled();
is_initialized = ss.local().gossiper().is_enabled();
}
return make_ready_future<json::json_return_type>(is_initialized);
});
@@ -979,10 +968,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ks.set_incremental_backups(value);
}
for (auto& pair: db.get_column_families()) {
auto cf_ptr = pair.second;
cf_ptr->set_incremental_backups(value);
}
db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) {
table->set_incremental_backups(value);
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1023,13 +1011,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(res);
});
ss::reset_local_schema.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {
ss::reset_local_schema.set(r, [&ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& fs = ctx.sp.local().features();
apilog.info("reset_local_schema");
return db::schema_tables::recalculate_schema_version(sys_ks, ctx.sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
co_await ss.local().reload_schema();
co_return json_void();
});
ss::set_trace_probability.set(r, [](std::unique_ptr<http::request> req) {
@@ -1111,6 +1097,22 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, true);
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, keyspace, tables, false);
});
ss::deliver_hints.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
@@ -1118,12 +1120,12 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(json_void());
});
ss::get_cluster_name.set(r, [&g](const_req req) {
return g.get_cluster_name();
ss::get_cluster_name.set(r, [&ss](const_req req) {
return ss.local().gossiper().get_cluster_name();
});
ss::get_partitioner_name.set(r, [&g](const_req req) {
return g.get_partitioner_name();
ss::get_partitioner_name.set(r, [&ss](const_req req) {
return ss.local().gossiper().get_partitioner_name();
});
ss::get_tombstone_warn_threshold.set(r, [](std::unique_ptr<http::request> req) {
@@ -1241,7 +1243,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto& ext = db.get_config().extensions();
for (auto& t : db.get_column_families() | boost::adaptors::map_values) {
db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {
auto& schema = t->schema();
if ((ks.empty() || ks == schema->ks_name()) && (cf.empty() || cf == schema->cf_name())) {
// at most Nsstables long
@@ -1257,7 +1259,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::sstable info;
info.timestamp = t;
info.generation = sstables::generation_value(sstable->generation());
info.generation = fmt::to_string(sstable->generation());
info.level = sstable->get_sstable_level();
info.size = sstable->bytes_on_disk();
info.data_size = sstable->ondisk_data_size();
@@ -1322,7 +1324,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
res.emplace_back(std::move(tst));
}
}
});
std::sort(res.begin(), res.end(), [](const ss::table_sstables& t1, const ss::table_sstables& t2) {
return t1.keyspace() < t2.keyspace() || (t1.keyspace() == t2.keyspace() && t1.table() < t2.table());
});
@@ -1332,6 +1334,123 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
});
});
ss::reload_raft_topology_state.set(r,
[&ss, &group0_client] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await ss.invoke_on(0, [&group0_client] (service::storage_service& ss) -> future<> {
apilog.info("Waiting for group 0 read/apply mutex before reloading Raft topology state...");
auto holder = co_await group0_client.hold_read_apply_mutex();
apilog.info("Reloading Raft topology state");
// Using topology_transition() instead of topology_state_load(), because the former notifies listeners
co_await ss.topology_transition();
apilog.info("Reloaded Raft topology state");
});
co_return json_void();
});
sp::get_schema_versions.set(r, [&ss](std::unique_ptr<http::request> req) {
return ss.local().describe_schema_versions().then([] (auto result) {
std::vector<sp::mapper_list> res;
for (auto e : result) {
sp::mapper_list entry;
entry.key = std::move(e.first);
entry.value = std::move(e.second);
res.emplace_back(std::move(entry));
}
return make_ready_future<json::json_return_type>(std::move(res));
});
});
}
void unset_storage_service(http_context& ctx, routes& r) {
ss::local_hostid.unset(r);
ss::get_tokens.unset(r);
ss::get_node_tokens.unset(r);
ss::get_commitlog.unset(r);
ss::get_token_endpoint.unset(r);
ss::toppartitions_generic.unset(r);
ss::get_leaving_nodes.unset(r);
ss::get_moving_nodes.unset(r);
ss::get_joining_nodes.unset(r);
ss::get_release_version.unset(r);
ss::get_scylla_release_version.unset(r);
ss::get_schema_version.unset(r);
ss::get_all_data_file_locations.unset(r);
ss::get_saved_caches_location.unset(r);
ss::get_range_to_endpoint_map.unset(r);
ss::get_pending_range_to_endpoint_map.unset(r);
ss::describe_any_ring.unset(r);
ss::describe_ring.unset(r);
ss::get_host_id_map.unset(r);
ss::get_load.unset(r);
ss::get_load_map.unset(r);
ss::get_current_generation_number.unset(r);
ss::get_natural_endpoints.unset(r);
ss::cdc_streams_check_and_repair.unset(r);
ss::force_keyspace_compaction.unset(r);
ss::force_keyspace_cleanup.unset(r);
ss::perform_keyspace_offstrategy_compaction.unset(r);
ss::upgrade_sstables.unset(r);
ss::force_keyspace_flush.unset(r);
ss::decommission.unset(r);
ss::move.unset(r);
ss::remove_node.unset(r);
ss::get_removal_status.unset(r);
ss::force_remove_completion.unset(r);
ss::set_logging_level.unset(r);
ss::get_logging_levels.unset(r);
ss::get_operation_mode.unset(r);
ss::is_starting.unset(r);
ss::get_drain_progress.unset(r);
ss::drain.unset(r);
ss::truncate.unset(r);
ss::get_keyspaces.unset(r);
ss::stop_gossiping.unset(r);
ss::start_gossiping.unset(r);
ss::is_gossip_running.unset(r);
ss::stop_daemon.unset(r);
ss::is_initialized.unset(r);
ss::join_ring.unset(r);
ss::is_joined.unset(r);
ss::set_stream_throughput_mb_per_sec.unset(r);
ss::get_stream_throughput_mb_per_sec.unset(r);
ss::get_compaction_throughput_mb_per_sec.unset(r);
ss::set_compaction_throughput_mb_per_sec.unset(r);
ss::is_incremental_backups_enabled.unset(r);
ss::set_incremental_backups_enabled.unset(r);
ss::rebuild.unset(r);
ss::bulk_load.unset(r);
ss::bulk_load_async.unset(r);
ss::reschedule_failed_deletions.unset(r);
ss::sample_key_range.unset(r);
ss::reset_local_schema.unset(r);
ss::set_trace_probability.unset(r);
ss::get_trace_probability.unset(r);
ss::get_slow_query_info.unset(r);
ss::set_slow_query.unset(r);
ss::enable_auto_compaction.unset(r);
ss::disable_auto_compaction.unset(r);
ss::enable_tombstone_gc.unset(r);
ss::disable_tombstone_gc.unset(r);
ss::deliver_hints.unset(r);
ss::get_cluster_name.unset(r);
ss::get_partitioner_name.unset(r);
ss::get_tombstone_warn_threshold.unset(r);
ss::set_tombstone_warn_threshold.unset(r);
ss::get_tombstone_failure_threshold.unset(r);
ss::set_tombstone_failure_threshold.unset(r);
ss::get_batch_size_failure_threshold.unset(r);
ss::set_batch_size_failure_threshold.unset(r);
ss::set_hinted_handoff_throttle_in_kb.unset(r);
ss::get_metrics_load.unset(r);
ss::get_exceptions.unset(r);
ss::get_total_hints_in_progress.unset(r);
ss::get_total_hints.unset(r);
ss::get_ownership.unset(r);
ss::get_effective_ownership.unset(r);
ss::sstable_info.unset(r);
ss::reload_raft_topology_state.unset(r);
sp::get_schema_versions.unset(r);
}
enum class scrub_status {
@@ -1494,27 +1613,12 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
throw httpd::bad_param_exception(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));
}
const auto& reduce_compaction_stats = [] (const compaction_manager::compaction_stats_opt& lhs, const compaction_manager::compaction_stats_opt& rhs) {
sstables::compaction_stats stats{};
stats += lhs.value();
stats += rhs.value();
return stats;
};
sstables::compaction_stats stats;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<scrub_sstables_compaction_task_impl>({}, std::move(keyspace), db, column_families, opts, stats);
try {
auto opt_stats = co_await db.map_reduce0([&] (replica::database& db) {
return map_reduce(column_families, [&] (sstring cfname) -> future<std::optional<sstables::compaction_stats>> {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
sstables::compaction_stats stats{};
co_await cf.parallel_foreach_table_state([&] (compaction::table_state& ts) mutable -> future<> {
auto r = co_await cm.perform_sstable_scrub(ts, opts);
stats += r.value_or(sstables::compaction_stats{});
});
co_return stats;
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
if (opt_stats && opt_stats->validation_errors) {
co_await task->done();
if (stats.validation_errors) {
co_return json::json_return_type(static_cast<int>(scrub_status::validation_errors));
}
} catch (const sstables::compaction_aborted_exception&) {

View File

@@ -25,7 +25,6 @@ class system_keyspace;
}
namespace netw { class messaging_service; }
class repair_service;
namespace cdc { class generation_service; }
class sstables_loader;
namespace gms {
@@ -51,11 +50,6 @@ sstring validate_keyspace(http_context& ctx, const httpd::parameters& param);
// If the parameter is found and empty, returns a list of all table names in the keyspace.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
struct table_info {
sstring name;
table_id id;
};
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
@@ -63,7 +57,8 @@ struct table_info {
// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);
void unset_storage_service(http_context& ctx, httpd::routes& r);
void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, httpd::routes& r);
void set_view_builder(http_context& ctx, httpd::routes& r, sharded<db::view::view_builder>& vb);
@@ -79,9 +74,3 @@ void unset_snapshot(http_context& ctx, httpd::routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
} // namespace api
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti);
} // namespace std

View File

@@ -7,10 +7,18 @@
*/
#include "api/api-doc/system.json.hh"
#include "api/api-doc/metrics.json.hh"
#include "api/api.hh"
#include <seastar/core/reactor.hh>
#include <seastar/core/metrics_api.hh>
#include <seastar/core/relabel_config.hh>
#include <seastar/http/exception.hh>
#include <seastar/util/short_streams.hh>
#include <seastar/http/short_streams.hh>
#include "utils/rjson.hh"
#include "log.hh"
#include "replica/database.hh"
@@ -20,8 +28,77 @@ namespace api {
using namespace seastar::httpd;
namespace hs = httpd::system_json;
namespace hm = httpd::metrics_json;
void set_system(http_context& ctx, routes& r) {
hm::get_metrics_config.set(r, [](const_req req) {
std::vector<hm::metrics_config> res;
res.resize(seastar::metrics::get_relabel_configs().size());
size_t i = 0;
for (auto&& r : seastar::metrics::get_relabel_configs()) {
res[i].action = r.action;
res[i].target_label = r.target_label;
res[i].replacement = r.replacement;
res[i].separator = r.separator;
res[i].source_labels = r.source_labels;
res[i].regex = r.expr.str();
i++;
}
return res;
});
hm::set_metrics_config.set(r, [](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
rapidjson::Document doc;
doc.Parse(req->content.c_str());
if (!doc.IsArray()) {
throw bad_param_exception("Expected a json array");
}
std::vector<seastar::metrics::relabel_config> relabels;
relabels.resize(doc.Size());
for (rapidjson::SizeType i = 0; i < doc.Size(); i++) {
const auto& element = doc[i];
if (element.HasMember("source_labels")) {
std::vector<std::string> source_labels;
source_labels.resize(element["source_labels"].Size());
for (size_t j = 0; j < element["source_labels"].Size(); j++) {
source_labels[j] = element["source_labels"][j].GetString();
}
relabels[i].source_labels = source_labels;
}
if (element.HasMember("action")) {
relabels[i].action = seastar::metrics::relabel_config_action(element["action"].GetString());
}
if (element.HasMember("replacement")) {
relabels[i].replacement = element["replacement"].GetString();
}
if (element.HasMember("separator")) {
relabels[i].separator = element["separator"].GetString();
}
if (element.HasMember("target_label")) {
relabels[i].target_label = element["target_label"].GetString();
}
if (element.HasMember("regex")) {
relabels[i].expr = element["regex"].GetString();
}
}
return do_with(std::move(relabels), false, [](const std::vector<seastar::metrics::relabel_config>& relabels, bool& failed) {
return smp::invoke_on_all([&relabels, &failed] {
return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {
if (result.metrics_relabeled_due_to_collision > 0) {
failed = true;
}
return;
});
}).then([&failed](){
if (failed) {
throw bad_param_exception("conflicts found during relabeling");
}
return make_ready_future<json::json_return_type>(seastar::json::json_void());
});
});
});
hs::get_system_uptime.set(r, [](const_req req) {
return std::chrono::duration_cast<std::chrono::milliseconds>(engine().uptime()).count();
});

View File

@@ -44,6 +44,7 @@ struct task_stats {
: task_id(task->id().to_sstring())
, state(task->get_status().state)
, type(task->type())
, scope(task->get_status().scope)
, keyspace(task->get_status().keyspace)
, table(task->get_status().table)
, entity(task->get_status().entity)
@@ -53,6 +54,7 @@ struct task_stats {
sstring task_id;
tasks::task_manager::task_state state;
std::string type;
std::string scope;
std::string keyspace;
std::string table;
std::string entity;
@@ -69,6 +71,7 @@ tm::task_status make_status(full_task_status status) {
tm::task_status res{};
res.id = status.task_status.id.to_sstring();
res.type = status.type;
res.scope = status.task_status.scope;
res.state = status.task_status.state;
res.is_abortable = bool(status.abortable);
res.start_time = st;
@@ -108,18 +111,23 @@ future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task
co_return s;
}
void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
tm::get_modules.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(ctx.tm.local().get_modules() | boost::adaptors::map_keys);
void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {
tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(tm.local().get_modules() | boost::adaptors::map_keys);
co_return v;
});
tm::get_tasks.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tm::get_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
using chunked_stats = utils::chunked_vector<task_stats>;
auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};
std::vector<chunked_stats> res = co_await ctx.tm.map([&req, internal] (tasks::task_manager& tm) {
std::vector<chunked_stats> res = co_await tm.map([&req, internal] (tasks::task_manager& tm) {
chunked_stats local_res;
auto module = tm.find_module(req->param["module"]);
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->param["module"]);
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {
return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);
});
@@ -148,57 +156,76 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
co_return std::move(f);
});
tm::get_task_status.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tm::get_task_status.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
auto state = task->get_status().state;
if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {
task->unregister_task();
}
co_return std::move(task);
}));
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
if (task->is_complete()) {
task->unregister_task();
}
co_return std::move(task);
}));
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
auto s = co_await retrieve_status(task);
co_return make_status(s);
});
tm::abort_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tm::abort_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
});
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
co_return json_void();
});
tm::wait_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tm::wait_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
f.get();
return make_foreign(task);
});
}));
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
return make_foreign(task);
});
}));
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
auto s = co_await retrieve_status(task);
co_return make_status(s);
});
tm::get_task_status_recursively.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& _ctx = ctx;
tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& tm = _tm;
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
std::queue<tasks::task_manager::foreign_task_ptr> q;
utils::chunked_vector<full_task_status> res;
// Get requested task.
auto task = co_await tasks::task_manager::invoke_on_task(_ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
auto state = task->get_status().state;
if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {
task->unregister_task();
}
co_return task;
}));
tasks::task_manager::foreign_task_ptr task;
try {
// Get requested task.
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
if (task->is_complete()) {
task->unregister_task();
}
co_return task;
}));
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
// Push children's statuses in BFS order.
q.push(co_await task.copy()); // Task cannot be moved since we need it to be alive during whole loop execution.
@@ -228,9 +255,23 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
tm::get_and_update_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
uint32_t ttl = cfg.task_ttl_seconds();
co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);
try {
co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
co_return json::json_return_type(ttl);
});
}
void unset_task_manager(http_context& ctx, routes& r) {
tm::get_modules.unset(r);
tm::get_tasks.unset(r);
tm::get_task_status.unset(r);
tm::abort_task.unset(r);
tm::wait_task.unset(r);
tm::get_task_status_recursively.unset(r);
tm::get_and_update_ttl.unset(r);
}
}

View File

@@ -8,11 +8,17 @@
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/config.hh"
namespace tasks {
class task_manager;
}
namespace api {
void set_task_manager(http_context& ctx, httpd::routes& r, db::config& cfg);
void set_task_manager(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm, db::config& cfg);
void unset_task_manager(http_context& ctx, httpd::routes& r);
}

View File

@@ -20,17 +20,17 @@ namespace tmt = httpd::task_manager_test_json;
using namespace json;
using namespace seastar::httpd;
void set_task_manager_test(http_context& ctx, routes& r) {
tmt::register_test_module.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) {
void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm) {
tmt::register_test_module.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await tm.invoke_on_all([] (tasks::task_manager& tm) {
auto m = make_shared<tasks::test_module>(tm);
tm.register_module("test", m);
});
co_return json_void();
});
tmt::unregister_test_module.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {
tmt::unregister_test_module.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {
auto module_name = "test";
auto module = tm.find_module(module_name);
co_await module->stop();
@@ -38,8 +38,8 @@ void set_task_manager_test(http_context& ctx, routes& r) {
co_return json_void();
});
tmt::register_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
sharded<tasks::task_manager>& tms = ctx.tm;
tmt::register_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
sharded<tasks::task_manager>& tms = tm;
auto it = req->query_parameters.find("task_id");
auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();
it = req->query_parameters.find("shard");
@@ -54,7 +54,7 @@ void set_task_manager_test(http_context& ctx, routes& r) {
tasks::task_info data;
if (it != req->query_parameters.end()) {
data.id = tasks::task_id{utils::UUID{it->second}};
auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(ctx.tm, data.id);
auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(tm, data.id);
data.shard = parent_ptr->get_status().shard;
}
@@ -69,34 +69,50 @@ void set_task_manager_test(http_context& ctx, routes& r) {
co_return id.to_sstring();
});
tmt::unregister_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tmt::unregister_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
});
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
co_return json_void();
});
tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
tmt::finish_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto it = req->query_parameters.find("error");
bool fail = it != req->query_parameters.end();
std::string error = fail ? it->second : "";
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
}
return make_ready_future<>();
});
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
}
return make_ready_future<>();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
co_return json_void();
});
}
void unset_task_manager_test(http_context& ctx, routes& r) {
tmt::register_test_module.unset(r);
tmt::unregister_test_module.unset(r);
tmt::register_test_task.unset(r);
tmt::unregister_test_task.unset(r);
tmt::finish_test_task.unset(r);
}
}
#endif

View File

@@ -10,11 +10,17 @@
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
namespace tasks {
class task_manager;
}
namespace api {
void set_task_manager_test(http_context& ctx, httpd::routes& r);
void set_task_manager_test(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm);
void unset_task_manager_test(http_context& ctx, httpd::routes& r);
}

View File

@@ -7,6 +7,7 @@ target_sources(scylla_auth
allow_all_authorizer.cc
authenticated_user.cc
authenticator.cc
certificate_authenticator.cc
common.cc
default_authorizer.cc
password_authenticator.cc
@@ -30,6 +31,7 @@ target_link_libraries(scylla_auth
PRIVATE
cql3
idl
wasmtime_bindings)
wasmtime_bindings
libxcrypt::libxcrypt)
add_whole_archive(auth scylla_auth)

View File

@@ -35,16 +35,9 @@ public:
///
authenticated_user() = default;
explicit authenticated_user(std::string_view name);
friend bool operator==(const authenticated_user&, const authenticated_user&) noexcept = default;
};
inline bool operator==(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return u1.name == u2.name;
}
inline bool operator!=(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return !(u1 == u2);
}
const authenticated_user& anonymous_user() noexcept;
inline bool is_anonymous(const authenticated_user& u) noexcept {

View File

@@ -18,3 +18,7 @@
const sstring auth::authenticator::USERNAME_KEY("username");
const sstring auth::authenticator::PASSWORD_KEY("password");
future<std::optional<auth::authenticated_user>> auth::authenticator::authenticate(session_dn_func) const {
return make_ready_future<std::optional<auth::authenticated_user>>(std::nullopt);
}

View File

@@ -15,6 +15,8 @@
#include <set>
#include <stdexcept>
#include <unordered_map>
#include <optional>
#include <functional>
#include <seastar/core/enum.hh>
#include <seastar/core/future.hh>
@@ -36,6 +38,16 @@ namespace auth {
class authenticated_user;
// Query alt name info as a single (subject style) string
using alt_name_func = std::function<future<std::string>()>;
struct certificate_info {
std::string subject;
alt_name_func get_alt_names;
};
using session_dn_func = std::function<future<std::optional<certificate_info>>()>;
///
/// Abstract client for authenticating role identity.
///
@@ -87,6 +99,13 @@ public:
///
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const = 0;
///
/// Authenticate (early) using transport info
///
/// \returns nullopt if not supported/required. exceptional future if failed
///
virtual future<std::optional<authenticated_user>> authenticate(session_dn_func) const;
///
/// Create an authentication record for a new user. This is required before the user can log-in.
///

View File

@@ -39,10 +39,6 @@ inline bool operator==(const permission_details& pd1, const permission_details&
== std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions.mask());
}
inline bool operator!=(const permission_details& pd1, const permission_details& pd2) {
return !(pd1 == pd2);
}
inline bool operator<(const permission_details& pd1, const permission_details& pd2) {
return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions)
< std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions);

View File

@@ -0,0 +1,181 @@
/*
* Copyright (C) 2022-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "auth/certificate_authenticator.hh"
#include <regex>
#include "utils/class_registrator.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
static const auto CERT_AUTH_NAME = "com.scylladb.auth.CertificateAuthenticator";
const std::string_view auth::certificate_authenticator_name(CERT_AUTH_NAME);
static logging::logger clogger("certificate_authenticator");
static const std::string cfg_source_attr = "source";
static const std::string cfg_query_attr = "query";
static const std::string cfg_source_subject = "SUBJECT";
static const std::string cfg_source_altname = "ALTNAME";
static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::migration_manager&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();
if (queries.empty()) {
throw std::invalid_argument("No role extraction queries specified.");
}
std::vector<std::pair<query_source, boost::regex>> res;
for (auto& map : queries) {
// first, check for any invalid config keys
if (map.size() == 2) {
try {
auto& source = map.at(cfg_source_attr);
std::string query = map.at(cfg_query_attr);
std::transform(source.begin(), source.end(), source.begin(), ::toupper);
boost::regex ex(query);
if (ex.mark_count() != 1) {
throw std::invalid_argument("Role query must have exactly one mark expression");
}
clogger.debug("Append role query: {} : {}", source, query);
if (source == cfg_source_subject) {
res.emplace_back(query_source::subject, std::move(ex));
} else if (source == cfg_source_altname) {
res.emplace_back(query_source::altname, std::move(ex));
} else {
throw std::invalid_argument(fmt::format("Invalid source: {}", map.at(cfg_source_attr)));
}
continue;
} catch (std::out_of_range&) {
// just fallthrough
} catch (std::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}
throw std::invalid_argument(fmt::format("Invalid query: {}", map));
}
return res;
}())
{}
auth::certificate_authenticator::~certificate_authenticator() = default;
future<> auth::certificate_authenticator::start() {
co_return;
}
future<> auth::certificate_authenticator::stop() {
co_return;
}
std::string_view auth::certificate_authenticator::qualified_java_name() const {
return certificate_authenticator_name;
}
bool auth::certificate_authenticator::require_authentication() const {
return true;
}
auth::authentication_option_set auth::certificate_authenticator::supported_options() const {
return {};
}
auth::authentication_option_set auth::certificate_authenticator::alterable_options() const {
return {};
}
future<std::optional<auth::authenticated_user>> auth::certificate_authenticator::authenticate(session_dn_func f) const {
if (!f) {
co_return std::nullopt;
}
auto dninfo = co_await f();
if (!dninfo) {
throw exceptions::authentication_exception("No valid certificate found");
}
auto& subject = dninfo->subject;
std::optional<std::string> altname ;
const std::string* source_str = nullptr;
for (auto& [source, expr] : _queries) {
switch (source) {
default:
case query_source::subject:
source_str = &subject;
break;
case query_source::altname:
if (!altname) {
altname = dninfo->get_alt_names ? co_await dninfo->get_alt_names() : std::string{};
}
source_str = &*altname;
break;
}
clogger.debug("Checking {}: {}", int(source), *source_str);
boost::smatch m;
if (boost::regex_search(*source_str, m, expr)) {
auto username = m[1].str();
clogger.debug("Return username: {}", username);
co_return username;
}
}
throw exceptions::authentication_exception(format("Subject '{}'/'{}' does not match any query expression", subject, altname));
}
future<auth::authenticated_user> auth::certificate_authenticator::authenticate(const credentials_map&) const {
throw exceptions::authentication_exception("Cannot authenticate using attribute map");
}
future<> auth::certificate_authenticator::create(std::string_view role_name, const authentication_options& options) const {
// TODO: should we keep track of roles/enforce existence? Role manager should deal with this...
co_return;
}
future<> auth::certificate_authenticator::alter(std::string_view role_name, const authentication_options& options) const {
co_return;
}
future<> auth::certificate_authenticator::drop(std::string_view role_name) const {
co_return;
}
future<auth::custom_options> auth::certificate_authenticator::query_custom_options(std::string_view) const {
co_return auth::custom_options{};
}
const auth::resource_set& auth::certificate_authenticator::protected_resources() const {
static const resource_set resources;
return resources;
}
::shared_ptr<auth::sasl_challenge> auth::certificate_authenticator::new_sasl_challenge() const {
throw exceptions::authentication_exception("Login authentication not supported");
}

View File

@@ -0,0 +1,62 @@
/*
* Copyright (C) 2022-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <boost/regex.hpp>
#include "auth/authenticator.hh"
namespace cql3 {
class query_processor;
} // namespace cql3
namespace service {
class migration_manager;
}
namespace auth {
extern const std::string_view certificate_authenticator_name;
class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::migration_manager&);
~certificate_authenticator();
future<> start() override;
future<> stop() override;
std::string_view qualified_java_name() const override;
bool require_authentication() const override;
authentication_option_set supported_options() const override;
authentication_option_set alterable_options() const override;
future<authenticated_user> authenticate(const credentials_map& credentials) const override;
future<std::optional<authenticated_user>> authenticate(session_dn_func) const override;
future<> create(std::string_view role_name, const authentication_options& options) const override;
future<> alter(std::string_view role_name, const authentication_options& options) const override;
future<> drop(std::string_view role_name) const override;
future<custom_options> query_custom_options(std::string_view role_name) const override;
const resource_set& protected_resources() const override;
::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
private:
};
}

View File

@@ -71,7 +71,8 @@ static future<> create_metadata_table_if_missing_impl(
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
try {
co_return co_await mm.announce(co_await mm.prepare_new_column_family_announcement(table, ts), std::move(group0_guard));
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (exceptions::already_exists_exception&) {}
}
}
@@ -84,20 +85,6 @@ future<> create_metadata_table_if_missing(
return futurize_invoke(create_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const replica::database& db, seastar::abort_source& as) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db, &as] {
as.check();
return db.get_version() != replica::database::empty_version;
}, pause).then([&mm, &as] {
return do_until([&mm, &as] {
as.check();
return mm.have_schema_agreement();
}, pause);
});
}
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.

View File

@@ -22,7 +22,6 @@
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
#include "service/query_state.hh"
using namespace std::chrono_literals;
@@ -32,6 +31,7 @@ class database;
namespace service {
class migration_manager;
class query_state;
}
namespace cql3 {
@@ -67,8 +67,6 @@ future<> create_metadata_table_if_missing(
std::string_view cql,
::service::migration_manager&) noexcept;
future<> wait_for_schema_agreement(::service::migration_manager&, const replica::database&, seastar::abort_source&);
///
/// Time-outs for internal, non-local CQL queries.
///

View File

@@ -129,7 +129,7 @@ future<> default_authorizer::start() {
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {

View File

@@ -29,6 +29,7 @@
#include "utils/class_registrator.hh"
#include "replica/database.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
namespace auth {
@@ -50,14 +51,23 @@ static const class_registrator<
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return !row.get_or<sstring>(SALTED_HASH, "").empty();
@@ -106,13 +116,17 @@ future<> password_authenticator::migrate_legacy_metadata() const {
}
future<> password_authenticator::create_default_if_missing() const {
return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {
return default_role_row_satisfies(_qp, &has_salted_hash, _superuser).then([this](bool exists) {
if (!exists) {
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);
}
return _qp.execute_internal(
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME},
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
@@ -132,9 +146,9 @@ future<> password_authenticator::start() {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get0()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
@@ -161,6 +175,8 @@ future<> password_authenticator::stop() {
}
db::consistency_level password_authenticator::consistency_for_user(std::string_view role_name) {
// TODO: this is plain dung. Why treat hardcoded default special, but for example a user-created
// super user uses plain LOCAL_ONE?
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}

View File

@@ -14,6 +14,10 @@
#include "auth/authenticator.hh"
namespace db {
class config;
}
namespace cql3 {
class query_processor;
@@ -33,9 +37,11 @@ class password_authenticator : public authenticator {
::service::migration_manager& _migration_manager;
future<> _stopped;
seastar::abort_source _as;
std::string _superuser;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::migration_manager&);

View File

@@ -79,6 +79,13 @@ static permission_set applicable_permissions(const service_level_resource_view &
}
static permission_set applicable_permissions(const functions_resource_view& fv) {
if (fv.function_name() || fv.function_signature()) {
return permission_set::of<
permission::ALTER,
permission::DROP,
permission::AUTHORIZE,
permission::EXECUTE>();
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
@@ -292,7 +299,7 @@ std::optional<std::vector<std::string_view>> functions_resource_view::function_a
std::vector<std::string_view> parts;
if (_resource._parts[3] == "") {
return {};
return parts;
}
for (size_t i = 3; i < _resource._parts.size(); i++) {
parts.push_back(_resource._parts[i]);

View File

@@ -117,20 +117,12 @@ private:
friend class functions_resource_view;
friend bool operator<(const resource&, const resource&);
friend bool operator==(const resource&, const resource&);
friend bool operator==(const resource&, const resource&) = default;
friend resource parse_resource(std::string_view);
};
bool operator<(const resource&, const resource&);
inline bool operator==(const resource& r1, const resource& r2) {
return (r1._kind == r2._kind) && (r1._parts == r2._parts);
}
inline bool operator!=(const resource& r1, const resource& r2) {
return !(r1 == r2);
}
std::ostream& operator<<(std::ostream&, const resource&);
class resource_kind_mismatch : public std::invalid_argument {

View File

@@ -17,10 +17,6 @@ std::ostream& operator<<(std::ostream& os, const role_or_anonymous& mr) {
return os;
}
bool operator==(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return mr1.name == mr2.name;
}
bool is_anonymous(const role_or_anonymous& mr) noexcept {
return !mr.name.has_value();
}

View File

@@ -26,16 +26,11 @@ public:
role_or_anonymous() = default;
role_or_anonymous(std::string_view name) : name(name) {
}
friend bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept = default;
};
std::ostream& operator<<(std::ostream&, const role_or_anonymous&);
bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept;
inline bool operator!=(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return !(mr1 == mr2);
}
bool is_anonymous(const role_or_anonymous&) noexcept;
}

View File

@@ -46,59 +46,43 @@ constexpr std::string_view qualified_name("system_auth.roles");
future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.execute_internal(
query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return make_ready_future<bool>(false);
}
return make_ready_future<bool>(p(results->one()));
});
}
return make_ready_future<bool>(p(results->one()));
});
});
for (auto cl : { db::consistency_level::ONE, db::consistency_level::QUORUM }) {
auto results = co_await qp.execute_internal(query, cl
, internal_distributed_query_state()
, {rolename.value_or(std::string(meta::DEFAULT_SUPERUSER_NAME))}
, cql3::query_processor::cache_internal::yes
);
if (!results->empty()) {
co_return p(results->one());
}
}
co_return false;
}
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
static const sstring query = format("SELECT * FROM {}", meta::roles_table::qualified_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}
auto results = co_await qp.execute_internal(query, db::consistency_level::QUORUM
, internal_distributed_query_state(), cql3::query_processor::cache_internal::no
);
if (results->empty()) {
co_return false;
}
static const sstring col_name = sstring(meta::roles_table::role_col_name);
static const sstring col_name = sstring(meta::roles_table::role_col_name);
return boost::algorithm::any_of(*results, [&p](const cql3::untyped_result_set_row& row) {
const bool is_nondefault = row.get_as<sstring>(col_name) != meta::DEFAULT_SUPERUSER_NAME;
return is_nondefault && p(row);
});
});
co_return boost::algorithm::any_of(*results, [&](const cql3::untyped_result_set_row& row) {
auto superuser = rolename ? std::string_view(*rolename) : meta::DEFAULT_SUPERUSER_NAME;
const bool is_nondefault = row.get_as<sstring>(col_name) != superuser;
return is_nondefault && p(row);
});
}

View File

@@ -43,13 +43,17 @@ constexpr std::string_view role_col_name{"role", 4};
///
future<bool> default_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
///
/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.
///
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
}

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include "auth/resource.hh"
#include "auth/service.hh"
#include <algorithm>
@@ -20,6 +21,7 @@
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "auth/role_or_anonymous.hh"
#include "cql3/functions/function_name.hh"
#include "cql3/functions/functions.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
@@ -66,6 +68,7 @@ private:
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_update_tablet_metadata() override {}
void on_drop_keyspace(const sstring& ks_name) override {
// Do it in the background.
@@ -75,6 +78,12 @@ private:
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);
});
(void)_authorizer.revoke_all(
auth::make_functions_resource(ks_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
@@ -89,8 +98,22 @@ private:
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {
(void)_authorizer.revoke_all(
auth::make_functions_resource(ks_name, function_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);
});
}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {
(void)_authorizer.revoke_all(
auth::make_functions_resource(ks_name, aggregate_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);
});
}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
@@ -155,7 +178,8 @@ future<> service::create_keyspace_if_missing(::service::migration_manager& mm) c
opts,
true);
co_return co_await mm.announce(mm.prepare_new_keyspace_announcement(ksm, ts), std::move(group0_guard));
co_return co_await mm.announce(::service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts),
std::move(group0_guard), format("auth_service: create {} keyspace", meta::AUTH_KS));
}
}
}

View File

@@ -28,6 +28,8 @@
#include "log.hh"
#include "utils/class_registrator.hh"
#include "replica/database.hh"
#include "service/migration_manager.hh"
#include "password_authenticator.hh"
namespace auth {
@@ -127,6 +129,13 @@ static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
}
standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>())
, _superuser(password_authenticator::default_superuser(qp.db().get_config()))
{}
std::string_view standard_role_manager::qualified_java_name() const noexcept {
return "org.apache.cassandra.auth.CassandraRoleManager";
}
@@ -168,7 +177,7 @@ future<> standard_role_manager::create_metadata_tables_if_missing() const {
}
future<> standard_role_manager::create_default_role_if_missing() const {
return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {
return default_role_row_satisfies(_qp, &has_can_login, _superuser).then([this](bool exists) {
if (!exists) {
static const sstring query = format("INSERT INTO {} ({}, is_superuser, can_login) VALUES (?, true, true)",
meta::roles_table::qualified_name,
@@ -178,9 +187,9 @@ future<> standard_role_manager::create_default_role_if_missing() const {
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::no).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
{_superuser},
cql3::query_processor::cache_internal::no).then([this](auto&&) {
log.info("Created default superuser role '{}'.", _superuser);
return make_ready_future<>();
});
}
@@ -232,7 +241,7 @@ future<> standard_role_manager::start() {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {

View File

@@ -34,13 +34,10 @@ class standard_role_manager final : public role_manager {
::service::migration_manager& _migration_manager;
future<> _stopped;
seastar::abort_source _as;
std::string _superuser;
public:
standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
standard_role_manager(cql3::query_processor&, ::service::migration_manager&);
virtual std::string_view qualified_java_name() const noexcept override;

View File

@@ -37,10 +37,8 @@
// The constants q1 and q2 are used to determine the proportional factor at each stage.
class backlog_controller {
public:
struct scheduling_group {
seastar::scheduling_group cpu = default_scheduling_group();
seastar::io_priority_class io = default_priority_class();
};
using scheduling_group = seastar::scheduling_group;
future<> shutdown() {
_update_timer.cancel();
return std::move(_inflight_update);
@@ -58,11 +56,11 @@ protected:
};
scheduling_group _scheduling_group;
timer<> _update_timer;
std::vector<control_point> _control_points;
std::function<float()> _current_backlog;
timer<> _update_timer;
// updating shares for an I/O class may contact another shard and returns a future.
future<> _inflight_update;
@@ -82,9 +80,9 @@ protected:
std::vector<control_point> control_points, std::function<float()> backlog,
float static_shares = 0)
: _scheduling_group(std::move(sg))
, _update_timer([this] { adjust(); })
, _control_points()
, _current_backlog(std::move(backlog))
, _update_timer([this] { adjust(); })
, _inflight_update(make_ready_future<>())
, _static_shares(static_shares)
{

8
bin/cqlsh Executable file
View File

@@ -0,0 +1,8 @@
#!/bin/bash
# Copyright (C) 2023-present ScyllaDB
# SPDX-License-Identifier: AGPL-3.0-or-later
here=$(dirname "$0")
exec "$here/../tools/cqlsh/bin/cqlsh" "$@"

View File

@@ -17,7 +17,7 @@
#include <functional>
#include <compare>
#include "utils/mutable_view.hh"
#include <xxhash.h>
#include "utils/simple_hashers.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::basic_string_view<int8_t>;
@@ -160,18 +160,7 @@ struct appending_hash<bytes_view> {
}
};
struct bytes_view_hasher : public hasher {
XXH64_state_t _state;
bytes_view_hasher(uint64_t seed = 0) noexcept {
XXH64_reset(&_state, seed);
}
void update(const char* ptr, size_t length) noexcept {
XXH64_update(&_state, ptr, length);
}
size_t finalize() {
return static_cast<size_t>(XXH64_digest(&_state));
}
};
using bytes_view_hasher = simple_xx_hasher;
namespace std {
template <>

View File

@@ -53,6 +53,10 @@ public:
using difference_type = std::ptrdiff_t;
using pointer = bytes_view*;
using reference = bytes_view&;
struct implementation {
blob_storage* current_chunk;
};
private:
chunk* _current = nullptr;
public:
@@ -75,11 +79,11 @@ public:
++(*this);
return tmp;
}
bool operator==(const fragment_iterator& other) const {
return _current == other._current;
}
bool operator!=(const fragment_iterator& other) const {
return _current != other._current;
bool operator==(const fragment_iterator&) const = default;
implementation extract_implementation() const {
return implementation {
.current_chunk = _current,
};
}
};
using const_iterator = fragment_iterator;
@@ -432,10 +436,6 @@ public:
return true;
}
bool operator!=(const bytes_ostream& other) const {
return !(*this == other);
}
// Makes this instance empty.
//
// The first buffer is not deallocated, so callers may rely on the

View File

@@ -110,6 +110,9 @@ class cache_flat_mutation_reader final : public flat_mutation_reader_v2::impl {
flat_mutation_reader_v2* _underlying = nullptr;
flat_mutation_reader_v2_opt _underlying_holder;
gc_clock::time_point _read_time;
gc_clock::time_point _gc_before;
future<> do_fill_buffer();
future<> ensure_underlying();
void copy_from_cache_to_buffer();
@@ -178,6 +181,20 @@ class cache_flat_mutation_reader final : public flat_mutation_reader_v2::impl {
const schema& table_schema() {
return *_snp->schema();
}
gc_clock::time_point get_read_time() {
return _read_context.tombstone_gc_state() ? gc_clock::now() : gc_clock::time_point::min();
}
gc_clock::time_point get_gc_before(const schema& schema, dht::decorated_key dk, const gc_clock::time_point query_time) {
auto gc_state = _read_context.tombstone_gc_state();
if (gc_state) {
return gc_state->get_gc_before_for_key(schema.shared_from_this(), dk, query_time);
}
return gc_clock::time_point::min();
}
public:
cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
@@ -196,6 +213,8 @@ public:
, _read_context_holder()
, _read_context(ctx) // ctx is owned by the caller, who's responsible for closing it.
, _next_row(*_schema, *_snp, false, _read_context.is_reversed())
, _read_time(get_read_time())
, _gc_before(get_gc_before(*_schema, dk, _read_time))
{
clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),
fmt::ptr(&*_snp));
@@ -730,9 +749,51 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
}
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
bool remove_row = false;
if (_read_context.tombstone_gc_state() // do not compact rows when tombstone_gc_state is not set (used in some unit tests)
&& !_next_row.dummy()
&& _snp->at_latest_version()
&& _snp->at_oldest_version()) {
deletable_row& row = _next_row.latest_row();
tombstone range_tomb = _next_row.range_tombstone_for_row();
auto t = row.deleted_at();
t.apply(range_tomb);
auto row_tomb_expired = [&](row_tombstone tomb) {
return (tomb && tomb.max_deletion_time() < _gc_before);
};
auto is_row_dead = [&](const deletable_row& row) {
auto& m = row.marker();
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before);
};
if (row_tomb_expired(t) || is_row_dead(row)) {
can_gc_fn always_gc = [&](tombstone) { return true; };
const schema& row_schema = _next_row.latest_row_schema();
_read_context.cache()._tracker.on_row_compacted();
with_allocator(_snp->region().allocator(), [&] {
deletable_row row_copy(row_schema, row);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, always_gc, _gc_before, nullptr);
std::swap(row, row_copy);
});
remove_row = row.empty();
auto tomb_expired = [&](tombstone tomb) {
return (tomb && tomb.deletion_time < _gc_before);
};
auto latests_range_tomb = _next_row.get_iterator_in_latest_version()->range_tombstone();
if (tomb_expired(latests_range_tomb)) {
_next_row.get_iterator_in_latest_version()->set_range_tombstone({});
}
}
}
if (_next_row.range_tombstone_for_row() != _current_tombstone) [[unlikely]] {
auto tomb = _next_row.range_tombstone_for_row();
auto new_lower_bound = position_in_partition::before_key(_next_row.position());
@@ -742,8 +803,31 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
_current_tombstone = tomb;
_read_context.cache()._tracker.on_range_tombstone_read();
}
add_to_buffer(_next_row);
move_to_next_entry();
if (remove_row) {
_read_context.cache()._tracker.on_row_compacted_away();
_lower_bound = position_in_partition::after_key(*_schema, _next_row.position());
partition_snapshot_row_weakref row_ref(_next_row);
move_to_next_entry();
with_allocator(_snp->region().allocator(), [&] {
cache_tracker& tracker = _read_context.cache()._tracker;
if (row_ref->is_linked()) {
tracker.get_lru().remove(*row_ref);
}
row_ref->on_evicted(tracker);
});
_snp->region().allocator().invalidate_references();
_next_row.force_valid();
} else {
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
add_to_buffer(_next_row);
move_to_next_entry();
}
} else {
move_to_next_range();
}
@@ -894,7 +978,7 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
if (!row.dummy()) {
_read_context.cache().on_row_hit();
if (_read_context.digest_requested()) {
row.latest_row().cells().prepare_hash(table_schema(), column_kind::regular_column);
row.latest_row_prepare_hash();
}
add_clustering_row_to_buffer(mutation_fragment_v2(*_schema, _permit, row.row()));
} else {

View File

@@ -68,7 +68,6 @@ public:
_pos = -1;
}
bool operator==(const iterator& o) const { return _pos == o._pos; }
bool operator!=(const iterator& o) const { return _pos != o._pos; }
};
public:
cartesian_product(const std::vector<std::vector<T>>& vec_of_vecs) : _vec_of_vecs(vec_of_vecs) {}

View File

@@ -65,7 +65,6 @@ public:
void ttl(int v) { _ttl = v; }
bool operator==(const options& o) const;
bool operator!=(const options& o) const;
};
} // namespace cdc

View File

@@ -13,6 +13,7 @@
#include <seastar/core/sleep.hh>
#include <seastar/core/coroutine.hh>
#include "gms/endpoint_state.hh"
#include "keys.hh"
#include "schema/schema_builder.hh"
#include "replica/database.hh"
@@ -25,6 +26,7 @@
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
#include "cdc/generation.hh"
@@ -66,10 +68,10 @@ static constexpr auto stream_id_index_shift = stream_id_version_shift + stream_i
static constexpr auto stream_id_random_shift = stream_id_index_shift + stream_id_index_bits;
/**
* Responsibilty for encoding stream_id moved from factory method to
* this constructor, to keep knowledge of composition in a single place.
* Note this is private and friended to topology_description_generator,
* because he is the one who defined the "order" we view vnodes etc.
* Responsibility for encoding stream_id moved from the create_stream_ids
* function to this constructor, to keep knowledge of composition in a
* single place. Note the make_new_generation_description function
* defines the "order" in which we view vnodes etc.
*/
stream_id::stream_id(dht::token token, size_t vnode_index)
: _value(bytes::initialized_later(), 2 * sizeof(int64_t))
@@ -153,18 +155,18 @@ bool token_range_description::operator==(const token_range_description& o) const
&& sharding_ignore_msb == o.sharding_ignore_msb;
}
topology_description::topology_description(std::vector<token_range_description> entries)
topology_description::topology_description(utils::chunked_vector<token_range_description> entries)
: _entries(std::move(entries)) {}
bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const& {
const utils::chunked_vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
utils::chunked_vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
@@ -183,98 +185,48 @@ static std::vector<stream_id> create_stream_ids(
return result;
}
class topology_description_generator final {
const std::unordered_set<dht::token>& _bootstrap_tokens;
const locator::token_metadata_ptr _tmptr;
const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& _get_sharding_info;
// Compute a set of tokens that split the token ring into vnodes
auto get_tokens() const {
auto tokens = _tmptr->sorted_tokens();
auto it = tokens.insert(
tokens.end(), _bootstrap_tokens.begin(), _bootstrap_tokens.end());
std::sort(it, tokens.end());
std::inplace_merge(tokens.begin(), it, tokens.end());
tokens.erase(std::unique(tokens.begin(), tokens.end()), tokens.end());
return tokens;
}
token_range_description create_description(size_t index, dht::token start, dht::token end) const {
token_range_description desc;
desc.token_range_end = end;
auto [shard_count, ignore_msb] = _get_sharding_info(end);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.sharding_ignore_msb = ignore_msb;
return desc;
}
public:
topology_description_generator(
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata_ptr tmptr,
// This function must return sharding parameters for a node that owns the vnode ending with
// the given token. Returns <shard_count, ignore_msb> pair.
const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info)
: _bootstrap_tokens(bootstrap_tokens)
, _tmptr(std::move(tmptr))
, _get_sharding_info(get_sharding_info)
{}
/*
* Generate a set of CDC stream identifiers such that for each shard
* and vnode pair there exists a stream whose token falls into this vnode
* and is owned by this shard. It is sometimes not possible to generate
* a CDC stream identifier for some (vnode, shard) pair because not all
* shards have to own tokens in a vnode. Small vnode can be totally owned
* by a single shard. In such case, a stream identifier that maps to
* end of the vnode is generated.
*
* Then build a cdc::topology_description which maps tokens to generated
* stream identifiers, such that if token T is owned by shard S in vnode V,
* it gets mapped to the stream identifier generated for (S, V).
*/
// Run in seastar::async context.
topology_description generate() const {
const auto tokens = get_tokens();
std::vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
vnode_descriptions.push_back(
create_description(0, tokens.back(), tokens.front()));
for (size_t idx = 1; idx < tokens.size(); ++idx) {
vnode_descriptions.push_back(
create_description(idx, tokens[idx - 1], tokens[idx]));
}
return {std::move(vnode_descriptions)};
}
};
bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper& g) {
auto my_host_id = g.get_host_id(me);
auto& eps = g.get_endpoint_states();
return std::none_of(eps.begin(), eps.end(),
[&] (const std::pair<gms::inet_address, gms::endpoint_state>& ep) {
return my_host_id < g.get_host_id(ep.first);
});
return g.for_each_endpoint_state_until([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {
return stop_iteration(my_host_id < g.get_host_id(node));
}) == stop_iteration::no;
}
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(
bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm) {
if (tm.sorted_tokens().size() != gen.entries().size()) {
// We probably have garbage streams from old generations
cdc_log.info("Generation size does not match the token ring");
return false;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen.entries()) {
gen_ends.insert(entry.token_range_end);
}
for (const auto& metadata_token : tm.sorted_tokens()) {
if (!gen_ends.contains(metadata_token)) {
cdc_log.warn("CDC generation missing token {}", metadata_token);
return false;
}
}
return true;
}
}
static future<utils::chunked_vector<mutation>> get_common_cdc_generation_mutations(
schema_ptr s,
utils::UUID id,
const partition_key& pkey,
noncopyable_function<clustering_key (dht::token)>&& get_ckey_from_range_end,
const cdc::topology_description& desc,
size_t mutation_size_threshold,
api::timestamp_type ts) {
utils::chunked_vector<mutation> res;
res.emplace_back(s, partition_key::from_singular(*s, id));
res.back().set_static_cell(to_bytes("num_ranges"), int32_t(desc.entries().size()), ts);
res.emplace_back(s, pkey);
size_t size_estimate = 0;
size_t total_size_estimate = 0;
for (auto& e : desc.entries()) {
if (size_estimate >= mutation_size_threshold) {
res.emplace_back(s, partition_key::from_singular(*s, id));
total_size_estimate += size_estimate;
res.emplace_back(s, pkey);
size_estimate = 0;
}
@@ -285,16 +237,60 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(
}
size_estimate += e.streams.size() * 20;
auto ckey = clustering_key::from_singular(*s, dht::token::to_int64(e.token_range_end));
auto ckey = get_ckey_from_range_end(e.token_range_end);
res.back().set_cell(ckey, to_bytes("streams"), make_set_value(db::cdc_streams_set_type, std::move(streams)), ts);
res.back().set_cell(ckey, to_bytes("ignore_msb"), int8_t(e.sharding_ignore_msb), ts);
co_await coroutine::maybe_yield();
}
total_size_estimate += size_estimate;
// Copy mutations n times, where n is picked so that the memory size of all mutations together exceeds `max_command_size`.
utils::get_local_injector().inject("cdc_generation_mutations_replication", [&res, total_size_estimate, mutation_size_threshold] {
utils::chunked_vector<mutation> new_res;
size_t number_of_copies = (mutation_size_threshold / total_size_estimate + 1) * 2;
for (size_t i = 0; i < number_of_copies; ++i) {
std::copy(res.begin(), res.end(), std::back_inserter(new_res));
}
res = std::move(new_res);
});
co_return res;
}
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v2(
schema_ptr s,
utils::UUID id,
const cdc::topology_description& desc,
size_t mutation_size_threshold,
api::timestamp_type ts) {
auto pkey = partition_key::from_singular(*s, id);
auto get_ckey = [s] (dht::token range_end) {
return clustering_key::from_singular(*s, dht::token::to_int64(range_end));
};
auto res = co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);
res.back().set_static_cell(to_bytes("num_ranges"), int32_t(desc.entries().size()), ts);
co_return res;
}
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
schema_ptr s,
utils::UUID id,
const cdc::topology_description& desc,
size_t mutation_size_threshold,
api::timestamp_type ts) {
auto pkey = partition_key::from_singular(*s, CDC_GENERATIONS_V3_KEY);
auto get_ckey = [&] (dht::token range_end) {
return clustering_key::from_exploded(*s, {timeuuid_type->decompose(id), long_type->decompose(dht::token::to_int64(range_end))}) ;
};
co_return co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);
}
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
@@ -327,13 +323,47 @@ topology_description limit_number_of_streams_if_needed(topology_description&& de
return topology_description(std::move(entries));
}
std::pair<utils::UUID, cdc::topology_description> make_new_generation_data(
// Compute a set of tokens that split the token ring into vnodes.
static auto get_tokens(const std::unordered_set<dht::token>& bootstrap_tokens, const locator::token_metadata_ptr tmptr) {
auto tokens = tmptr->sorted_tokens();
auto it = tokens.insert(tokens.end(), bootstrap_tokens.begin(), bootstrap_tokens.end());
std::sort(it, tokens.end());
std::inplace_merge(tokens.begin(), it, tokens.end());
tokens.erase(std::unique(tokens.begin(), tokens.end()), tokens.end());
return tokens;
}
static token_range_description create_token_range_description(
size_t index,
dht::token start,
dht::token end,
const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info) {
token_range_description desc;
desc.token_range_end = end;
auto [shard_count, ignore_msb] = get_sharding_info(end);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.sharding_ignore_msb = ignore_msb;
return desc;
}
cdc::topology_description make_new_generation_description(
const std::unordered_set<dht::token>& bootstrap_tokens,
const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,
const locator::token_metadata_ptr tmptr) {
auto gen = topology_description_generator(bootstrap_tokens, tmptr, get_sharding_info).generate();
auto uuid = utils::make_random_uuid();
return {uuid, std::move(gen)};
const auto tokens = get_tokens(bootstrap_tokens, tmptr);
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
vnode_descriptions.push_back(create_token_range_description(0, tokens.back(), tokens.front(), get_sharding_info));
for (size_t idx = 1; idx < tokens.size(); ++idx) {
vnode_descriptions.push_back(create_token_range_description(idx, tokens[idx - 1], tokens[idx], get_sharding_info));
}
return {std::move(vnode_descriptions)};
}
db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milliseconds ring_delay) {
@@ -365,7 +395,9 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};
}
};
auto [uuid, gen] = make_new_generation_data(bootstrap_tokens, get_sharding_info, tmptr);
auto uuid = utils::make_random_uuid();
auto gen = make_new_generation_description(bootstrap_tokens, get_sharding_info, tmptr);
// Our caller should ensure that there are normal tokens in the token ring.
auto normal_token_owners = tmptr->count_normal_token_owners();
@@ -419,8 +451,12 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
auto gen_id_string = g.get_application_state_value(endpoint, gms::application_state::CDC_GENERATION_ID);
static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::endpoint_state& eps) {
const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!gen_id_ptr) {
return std::nullopt;
}
auto gen_id_string = gen_id_ptr->value();
cdc_log.trace("endpoint={}, gen_id_string={}", endpoint, gen_id_string);
return gms::versioned_value::cdc_generation_id_from_string(gen_id_string);
}
@@ -624,21 +660,21 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
for (auto& [_, cf] : _db.get_column_families()) {
auto& s = *cf->schema();
_db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {
auto& s = *t->schema();
auto base = cdc::get_base_table(_db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
continue;
return;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
continue;
return;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});
}
});
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
@@ -726,8 +762,8 @@ future<> generation_service::stop() {
cdc_log.error("CDC stream rewrite failed: ", std::current_exception());
}
if (this_shard_id() == 0) {
co_await _gossiper.unregister_(shared_from_this());
if (_joined && (this_shard_id() == 0)) {
co_await leave_ring();
}
_stopped = true;
@@ -739,7 +775,6 @@ generation_service::~generation_service() {
future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
assert(_sys_ks.local().bootstrap_complete());
_gen_id = std::move(startup_gen_id);
_gossiper.register_(shared_from_this());
@@ -757,18 +792,24 @@ future<> generation_service::after_join(std::optional<cdc::generation_id>&& star
_cdc_streams_rewrite_complete = maybe_rewrite_streams_descriptions();
}
future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {
future<> generation_service::leave_ring() {
assert_shard_zero(__PRETTY_FUNCTION__);
_joined = false;
co_await _gossiper.unregister_(shared_from_this());
}
future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
auto val = ep_state->get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!val) {
return make_ready_future();
}
return on_change(ep, gms::application_state::CDC_GENERATION_ID, *val);
return on_change(ep, gms::application_state::CDC_GENERATION_ID, *val, pid);
}
future<> generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {
future<> generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v, gms::permit_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (app_state != gms::application_state::CDC_GENERATION_ID) {
@@ -788,22 +829,21 @@ future<> generation_service::check_and_repair_cdc_streams() {
}
std::optional<cdc::generation_id> latest = _gen_id;
const auto& endpoint_states = _gossiper.get_endpoint_states();
for (const auto& [addr, state] : endpoint_states) {
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& state) {
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
continue;
return;
}
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto gen_id = get_generation_id_for(addr, _gossiper);
const auto gen_id = get_generation_id_for(addr, state);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
}
});
auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
@@ -858,24 +898,9 @@ future<> generation_service::check_and_repair_cdc_streams() {
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else {
if (tmptr->sorted_tokens().size() != gen->entries().size()) {
// We probably have garbage streams from old generations
cdc_log.info("Generation size does not match the token ring, regenerating");
should_regenerate = true;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen->entries()) {
gen_ends.insert(entry.token_range_end);
}
for (const auto& metadata_token : tmptr->sorted_tokens()) {
if (!gen_ends.contains(metadata_token)) {
cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);
should_regenerate = true;
break;
}
}
}
} else if (!is_cdc_generation_optimal(*gen, *tmptr)) {
should_regenerate = true;
cdc_log.info("CDC generation {} needs repair, regenerating", latest);
}
}
@@ -935,17 +960,13 @@ future<> generation_service::legacy_handle_cdc_generation(std::optional<cdc::gen
co_return;
}
if (!_sys_ks.local().bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// The service should not be listening for generation changes until after the node
// is bootstrapped. Therefore we would previously assume that this condition
// can never become true and call on_internal_error here, but it turns out that
// it may become true on decommission: the node enters NEEDS_BOOTSTRAP
// state before leaving the token ring, so bootstrap_complete() becomes false.
// In that case we can simply return.
co_return;
if (!_sys_dist_ks.local_is_initialized() || !_sys_dist_ks.local().started()) {
on_internal_error(cdc_log, "Legacy handle CDC generation with sys.dist.ks. down");
}
// The service should not be listening for generation changes until after the node
// is bootstrapped and since the node leaves the ring on decommission
if (co_await container().map_reduce(and_reducer(), [ts = get_ts(*gen_id)] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
@@ -1008,12 +1029,12 @@ future<> generation_service::legacy_scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<cdc::generation_id> latest;
for (const auto& ep: _gossiper.get_endpoint_states()) {
auto gen_id = get_generation_id_for(ep.first, _gossiper);
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(node, eps);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
}
});
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
@@ -1090,19 +1111,8 @@ shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks(
return _sys_dist_ks.local_shared();
}
std::ostream& operator<<(std::ostream& os, const generation_id& gen_id) {
std::visit(make_visitor(
[&os] (const generation_id_v1& id) { os << id.ts; },
[&os] (const generation_id_v2& id) { os << "(" << id.ts << ", " << id.id << ")"; }
), gen_id);
return os;
}
db_clock::time_point get_ts(const generation_id& gen_id) {
return std::visit(make_visitor(
[] (const generation_id_v1& id) { return id.ts; },
[] (const generation_id_v2& id) { return id.ts; }
), gen_id);
return std::visit([] (auto& id) { return id.ts; }, gen_id);
}
} // namespace cdc

View File

@@ -92,13 +92,13 @@ struct token_range_description {
* in the `_entries` vector. See the comment above `token_range_description` for explanation.
*/
class topology_description {
std::vector<token_range_description> _entries;
utils::chunked_vector<token_range_description> _entries;
public:
topology_description(std::vector<token_range_description> entries);
topology_description(utils::chunked_vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
const utils::chunked_vector<token_range_description>& entries() const&;
utils::chunked_vector<token_range_description>&& entries() &&;
};
/**
@@ -133,7 +133,28 @@ public:
*/
bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper&);
std::pair<utils::UUID, cdc::topology_description> make_new_generation_data(
/*
* Checks if the CDC generation is optimal, which is true if its `topology_description` is consistent
* with `token_metadata`.
*/
bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm);
/*
* Generate a set of CDC stream identifiers such that for each shard
* and vnode pair there exists a stream whose token falls into this vnode
* and is owned by this shard. It is sometimes not possible to generate
* a CDC stream identifier for some (vnode, shard) pair because not all
* shards have to own tokens in a vnode. Small vnode can be totally owned
* by a single shard. In such case, a stream identifier that maps to
* end of the vnode is generated.
*
* Then build a cdc::topology_description which maps tokens to generated
* stream identifiers, such that if token T is owned by shard S in vnode V,
* it gets mapped to the stream identifier generated for (S, V).
*
* Run in seastar::async context.
*/
cdc::topology_description make_new_generation_description(
const std::unordered_set<dht::token>& bootstrap_tokens,
const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info,
const locator::token_metadata_ptr);
@@ -144,9 +165,20 @@ db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milli
// using `mutation_size_threshold` to decide on the mutation sizes. The partition key of each mutation
// is given by `gen_uuid`. The timestamp of each cell in each mutation is given by `mutation_timestamp`.
//
// Works for only specific schemas: CDC_GENERATIONS_V2 (in system_distributed_keyspace)
// and CDC_GENERATIONS_V3 (in system_keyspace).
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(
// Works only for the CDC_GENERATIONS_V2 schema (in system_distributed keyspace).
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v2(
schema_ptr, utils::UUID gen_uuid, const cdc::topology_description&,
size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);
// The partition key of all rows in the single-partition CDC_GENERATIONS_V3 schema (in system keyspace).
static constexpr auto CDC_GENERATIONS_V3_KEY = "cdc_generations";
// Translates the CDC generation data given by a `cdc::topology_description` into a vector of mutations,
// using `mutation_size_threshold` to decide on the mutation sizes. The first clustering key column is
// given by `gen_uuid`. The timestamp of each cell in each mutation is given by `mutation_timestamp`.
//
// Works only for the CDC_GENERATIONS_V3 schema (in system keyspace).
future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
schema_ptr, utils::UUID gen_uuid, const cdc::topology_description&,
size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);

View File

@@ -28,7 +28,35 @@ struct generation_id_v2 {
using generation_id = std::variant<generation_id_v1, generation_id_v2>;
std::ostream& operator<<(std::ostream&, const generation_id&);
db_clock::time_point get_ts(const generation_id&);
} // namespace cdc
template <>
struct fmt::formatter<cdc::generation_id_v1> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v1& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{}", gen_id.ts);
}
};
template <>
struct fmt::formatter<cdc::generation_id_v2> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v2& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);
}
};
template <>
struct fmt::formatter<cdc::generation_id> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id& gen_id, FormatContext& ctx) const {
return std::visit([&ctx] (auto& id) {
return fmt::format_to(ctx.out(), "{}", id);
}, gen_id);
}
};

View File

@@ -98,19 +98,20 @@ public:
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<cdc::generation_id>&& startup_gen_id);
future<> leave_ring();
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual future<> before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override { return make_ready_future(); }
virtual future<> on_alive(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }
virtual future<> on_dead(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }
virtual future<> on_remove(gms::inet_address) override { return make_ready_future(); }
virtual future<> on_restart(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }
virtual future<> before_change(gms::inet_address, gms::endpoint_state_ptr, gms::application_state, const gms::versioned_value&) override { return make_ready_future(); }
virtual future<> on_alive(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_dead(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_remove(gms::inet_address, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_restart(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_join(gms::inet_address, gms::endpoint_state) override;
virtual future<> on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;
virtual future<> on_join(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, gms::application_state, const gms::versioned_value&, gms::permit_id) override;
future<> check_and_repair_cdc_streams();

View File

@@ -160,7 +160,7 @@ public:
});
}
void on_before_create_column_family(const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
void on_before_create_column_family(const keyspace_metadata& ksm, const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
if (schema.cdc_options().enabled()) {
auto& db = _ctxt._proxy.get_db().local();
auto logname = log_name(schema.cf_name());
@@ -395,9 +395,6 @@ bool cdc::options::operator==(const options& o) const {
return enabled() == o.enabled() && _preimage == o._preimage && _postimage == o._postimage && _ttl == o._ttl
&& _delta_mode == o._delta_mode;
}
bool cdc::options::operator!=(const options& o) const {
return !(*this == o);
}
namespace cdc {
@@ -635,9 +632,6 @@ public:
bool operator==(const collection_iterator& x) const {
return _v == x._v;
}
bool operator!=(const collection_iterator& x) const {
return !(*this == x);
}
private:
void next() {
--_rem;

View File

@@ -40,7 +40,7 @@ static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
const utils::chunked_vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {
on_internal_error(cdc_log, "get_stream: entries empty");

View File

@@ -389,7 +389,7 @@ struct extract_changes_visitor {
}
void partition_delete(const tombstone& t) {
_result[t.timestamp].partition_deletions = {t};
_result[t.timestamp].partition_deletions = partition_deletion{t};
}
constexpr bool finished() const { return false; }

View File

@@ -93,9 +93,6 @@ public:
bool operator==(const iterator& other) const {
return _position == other._position;
}
bool operator!=(const iterator& other) const {
return !(*this == other);
}
};
public:
explicit partition_cells_range(const mutation_partition& mp) : _mp(mp) { }

View File

@@ -21,27 +21,27 @@ public:
: file_impl(*get_file_impl(f)), _error_handler(error_handler), _file(f) {
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, io_intent* intent) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, buffer, len, pc);
return get_file_impl(_file)->write_dma(pos, buffer, len, intent);
});
}
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, io_intent* intent) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, iov, pc);
return get_file_impl(_file)->write_dma(pos, iov, intent);
});
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, io_intent* intent) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, buffer, len, pc);
return get_file_impl(_file)->read_dma(pos, buffer, len, intent);
});
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, io_intent* intent) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, iov, pc);
return get_file_impl(_file)->read_dma(pos, iov, intent);
});
}
@@ -99,9 +99,9 @@ public:
});
}
virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override {
virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, io_intent* intent) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->dma_read_bulk(offset, range_size, pc);
return get_file_impl(_file)->dma_read_bulk(offset, range_size, intent);
});
}
private:

View File

@@ -15,12 +15,6 @@
std::atomic<int64_t> clocks_offset;
std::ostream& operator<<(std::ostream& os, db_clock::time_point tp) {
auto t = db_clock::to_time_t(tp);
::tm t_buf;
return os << std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T");
}
std::string format_timestamp(api::timestamp_type ts) {
auto t = std::time_t(std::chrono::duration_cast<std::chrono::seconds>(api::timestamp_clock::duration(ts)).count());
::tm t_buf;

View File

@@ -75,8 +75,7 @@ public:
const interval::interval_type& iv = *_i;
return position_range{iv.lower().position(), iv.upper().position()};
}
bool operator==(const position_range_iterator& other) const { return _i == other._i; }
bool operator!=(const position_range_iterator& other) const { return _i != other._i; }
bool operator==(const position_range_iterator& other) const = default;
position_range_iterator& operator++() {
++_i;
return *this;

27
cmake/Findrapidxml.cmake Normal file
View File

@@ -0,0 +1,27 @@
#
# Copyright 2023-present ScyllaDB
#
#
# SPDX-License-Identifier: AGPL-3.0-or-later
#
find_path(rapidxml_INCLUDE_DIR
NAMES rapidxml.h rapidxml/rapidxml.hpp)
mark_as_advanced(
rapidxml_INCLUDE_DIR)
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(rapidxml
REQUIRED_VARS
rapidxml_INCLUDE_DIR)
if(rapidxml_FOUND)
if(NOT TARGET rapidxml::rapidxml)
add_library(rapidxml::rapidxml INTERFACE IMPORTED)
set_target_properties(rapidxml::rapidxml
PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${rapidxml_INCLUDE_DIR})
endif()
endif()

View File

@@ -1,20 +1,31 @@
###
### Generate version file and supply appropriate compile definitions for release.cc
###
function(add_version_library name source)
function(generate_scylla_version)
set(version_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-VERSION-FILE)
set(release_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-RELEASE-FILE)
set(product_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-PRODUCT-FILE)
execute_process(
COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN --output-dir "${CMAKE_CURRENT_BINARY_DIR}"
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR})
file(STRINGS ${version_file} scylla_version)
file(STRINGS ${release_file} scylla_release)
file(STRINGS ${product_file} scylla_product)
string(REPLACE "-" "~" scylla_version_tilde ${scylla_version})
set(Scylla_VERSION "${scylla_version_tilde}" CACHE INTERNAL "")
set(Scylla_RELEASE "${scylla_release}" CACHE INTERNAL "")
set(Scylla_PRODUCT "${scylla_product}" CACHE INTERNAL "")
endfunction(generate_scylla_version)
function(add_version_library name source)
add_library(${name} OBJECT ${source})
target_compile_definitions(${name}
PRIVATE
SCYLLA_VERSION=\"${scylla_version}\"
SCYLLA_RELEASE=\"${scylla_release}\")
SCYLLA_VERSION=\"${Scylla_VERSION}\"
SCYLLA_RELEASE=\"${Scylla_RELEASE}\")
target_link_libraries(${name}
PRIVATE
Seastar::seastar)

View File

@@ -5,15 +5,6 @@
# actually compiling a sample program.
function(add_whole_archive name library)
add_library(${name} INTERFACE)
if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.24)
target_link_libraries(${name} INTERFACE
"$<LINK_LIBRARY:WHOLE_ARCHIVE,${library}>")
else()
add_dependencies(${name} ${library})
target_include_directories(${name} INTERFACE
${CMAKE_SOURCE_DIR})
target_link_options(auth INTERFACE
"$<$<CXX_COMPILER_ID:Clang>:SHELL:LINKER:-force_load $<TARGET_LINKER_FILE:${library}>>"
"$<$<CXX_COMPILER_ID:GNU>:SHELL:LINKER:--whole-archive $<TARGET_LINKER_FILE:${library}> LINKER:--no-whole-archive>")
endif()
target_link_libraries(${name} INTERFACE
"$<LINK_LIBRARY:WHOLE_ARCHIVE,${library}>")
endfunction()

View File

@@ -0,0 +1,50 @@
function(build_submodule name dir)
cmake_parse_arguments(parsed_args "NOARCH" "" "" ${ARGN})
set(version_release "${Scylla_VERSION}-${Scylla_RELEASE}")
set(product_version_release
"${Scylla_PRODUCT}-${Scylla_VERSION}-${Scylla_RELEASE}")
set(working_dir ${CMAKE_CURRENT_SOURCE_DIR}/${dir})
if(parsed_args_NOARCH)
set(arch "noarch")
else()
set(arch "${CMAKE_SYSTEM_PROCESSOR}")
endif()
set(reloc_args ${parsed_args_UNPARSED_ARGUMENTS})
set(reloc_pkg "${working_dir}/build/${Scylla_PRODUCT}-${name}-${version_release}.${arch}.tar.gz")
add_custom_command(
OUTPUT ${reloc_pkg}
COMMAND reloc/build_reloc.sh --version ${product_version_release} --nodeps ${reloc_args}
WORKING_DIRECTORY "${working_dir}"
JOB_POOL submodule_pool)
add_custom_target(dist-${name}-tar
DEPENDS ${reloc_pkg})
add_custom_target(dist-${name}-rpm
COMMAND reloc/build_rpm.sh --reloc-pkg ${reloc_pkg}
DEPENDS ${reloc_pkg}
WORKING_DIRECTORY "${working_dir}")
add_custom_target(dist-${name}-deb
COMMAND reloc/build_deb.sh --reloc-pkg ${reloc_pkg}
DEPENDS ${reloc_pkg}
WORKING_DIRECTORY "${working_dir}")
add_custom_target(dist-${name}
DEPENDS dist-${name}-tar dist-${name}-rpm dist-${name}-deb)
endfunction()
macro(dist_submodule name dir pkgs)
# defined as a macro, so that we can append the path to the dist tarball to
# specfied "pkgs"
cmake_parse_arguments(parsed_args "NOARCH" "" "" ${ARGN})
if(parsed_args_NOARCH)
set(arch "noarch")
else()
set(arch "${CMAKE_SYSTEM_PROCESSOR}")
endif()
set(pkg_name "${Scylla_PRODUCT}-${name}-${Scylla_VERSION}-${Scylla_RELEASE}.${arch}.tar.gz")
set(reloc_pkg "${CMAKE_SOURCE_DIR}/tools/${dir}/build/${pkg_name}")
set(dist_pkg "${CMAKE_CURRENT_BINARY_DIR}/${pkg_name}")
add_custom_command(
OUTPUT ${dist_pkg}
COMMAND ${CMAKE_COMMAND} -E copy ${reloc_pkg} ${dist_pkg}
DEPENDS dist-${name}-tar)
list(APPEND ${pkgs} "${dist_pkg}")
endmacro()

View File

@@ -1,7 +1,5 @@
find_program (ANTLR3 antlr3)
if(NOT ANTLR3)
message(FATAL "antlr3 is required")
endif()
find_program (ANTLR3 antlr3
REQUIRED)
# Parse antlr3 grammar files and generate C++ sources
function(generate_cql_grammar)

23
cmake/mode.COVERAGE.cmake Normal file
View File

@@ -0,0 +1,23 @@
set(Seastar_OptimizationLevel_COVERAGE "g")
set(CMAKE_CXX_FLAGS_COVERAGE
""
CACHE
INTERNAL
"")
string(APPEND CMAKE_CXX_FLAGS_COVERAGE
" -O${Seastar_OptimizationLevel_SANITIZE}")
set(Seastar_DEFINITIONS_COVERAGE
SCYLLA_BUILD_MODE=debug
DEBUG
SANITIZE
DEBUG_LSA_SANITIZER
SCYLLA_ENABLE_ERROR_INJECTION)
set(CMAKE_CXX_FLAGS_COVERAGE
" -O${Seastar_OptimizationLevel_COVERAGE} -fprofile-instr-generate -fcoverage-mapping -g -gz")
set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE
"-fprofile-instr-generate -fcoverage-mapping")
set(stack_usage_threshold_in_KB 40)

View File

@@ -12,16 +12,15 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64|aarch64")
else()
set(clang_inline_threshold 2500)
endif()
string(APPEND CMAKE_CXX_FLAGS_RELEASE
" $<$<CXX_COMPILER_ID:GNU>:--param inline-unit-growth=300"
" $<$<CXX_COMPILER_ID:Clang>:-mllvm -inline-threshold=${clang_inline_threshold}>"
add_compile_options(
"$<$<CXX_COMPILER_ID:GNU>:--param;inline-unit-growth=300>"
"$<$<CXX_COMPILER_ID:Clang>:-mllvm;-inline-threshold=${clang_inline_threshold}>"
# clang generates 16-byte loads that break store-to-load forwarding
# gcc also has some trouble: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554
" -fno-slp-vectorize")
set(Seastar_DEFINITIONS_DEBUG
"-fno-slp-vectorize")
set(Seastar_DEFINITIONS_RELEASE
SCYLLA_BUILD_MODE=release)
set(CMAKE_STATIC_LINKER_FLAGS_RELEASE
"-Wl,--gc-sections")
add_link_options("LINKER:--gc-sections")
set(stack_usage_threshold_in_KB 13)

17
cmake/mode.SANITIZE.cmake Normal file
View File

@@ -0,0 +1,17 @@
set(Seastar_OptimizationLevel_SANITIZE "s")
set(CMAKE_CXX_FLAGS_SANITIZE
""
CACHE
INTERNAL
"")
string(APPEND CMAKE_CXX_FLAGS_SANITIZE
" -O${Seastar_OptimizationLevel_SANITIZE}")
set(Seastar_DEFINITIONS_SANITIZE
SCYLLA_BUILD_MODE=sanitize
DEBUG
SANITIZE
DEBUG_LSA_SANITIZER
SCYLLA_ENABLE_ERROR_INJECTION)
set(stack_usage_threshold_in_KB 50)

View File

@@ -1,9 +1,7 @@
set(disabled_warnings
c++11-narrowing
mismatched-tags
missing-braces
overloaded-virtual
parentheses-equality
unsupported-friend)
include(CheckCXXCompilerFlag)
foreach(warning ${disabled_warnings})
@@ -13,27 +11,117 @@ foreach(warning ${disabled_warnings})
endif()
endforeach()
list(TRANSFORM _supported_warnings PREPEND "-Wno-")
string(JOIN " " CMAKE_CXX_FLAGS "-Wall" "-Werror" ${_supported_warnings})
add_compile_options(
"-Wall"
"-Werror"
"-Wno-error=deprecated-declarations"
"-Wimplicit-fallthrough"
${_supported_warnings})
function(default_target_arch arch)
set(x86_instruction_sets i386 i686 x86_64)
if(CMAKE_SYSTEM_PROCESSOR IN_LIST x86_instruction_sets)
set(${arch} "westmere" PARENT_SCOPE)
elseif(CMAKE_SYSTEM_PROCESSOR EQUAL "aarch64")
elseif(CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")
# we always use intrinsics like vmull.p64 for speeding up crc32 calculations
# on the aarch64 architectures, and they require the crypto extension, so
# we have to add "+crypto" in the architecture flags passed to -march. the
# same applies to crc32 instructions, which need the ARMv8-A CRC32 extension
# please note, Seastar also sets -march when compiled with DPDK enabled.
set(${arch} "armv8-a+crc+crypto" PARENT_SCOPE)
else()
set(${arch} "" PARENT_SCOPE)
endif()
endfunction()
function(pad_at_begin output fill str length)
# pad the given `${str} with `${fill}`, right aligned. with the syntax of
# fmtlib:
# fmt::print("{:#>{}}", str, length)
# where `#` is the `${fill}` char
string(LENGTH "${str}" str_len)
math(EXPR padding_len "${length} - ${str_len}")
if(padding_len GREATER 0)
string(REPEAT ${fill} ${padding_len} padding)
endif()
set(${output} "${padding}${str}" PARENT_SCOPE)
endfunction()
# The relocatable package includes its own dynamic linker. We don't
# know the path it will be installed to, so for now use a very long
# path so that patchelf doesn't need to edit the program headers. The
# kernel imposes a limit of 4096 bytes including the null. The other
# constraint is that the build-id has to be in the first page, so we
# can't use all 4096 bytes for the dynamic linker.
# In here we just guess that 2000 extra / should be enough to cover
# any path we get installed to but not so large that the build-id is
# pushed to the second page.
# At the end of the build we check that the build-id is indeed in the
# first page. At install time we check that patchelf doesn't modify
# the program headers.
function(get_padded_dynamic_linker_option output length)
set(dynamic_linker_option "-dynamic-linker")
# capture the drive-generated command line first
execute_process(
COMMAND ${CMAKE_C_COMPILER} "-###" /dev/null -o t
ERROR_VARIABLE driver_command_line
ERROR_STRIP_TRAILING_WHITESPACE)
# extract the argument for the "-dynamic-linker" option
if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"? \"?([^ \"]*)\"? .*")
set(dynamic_linker ${CMAKE_MATCH_1})
else()
message(FATAL_ERROR "Unable to find ${dynamic_linker_option} in driver-generated command: "
"${driver_command_line}")
endif()
# prefixing a path with "/"s does not actually change it means
pad_at_begin(padded_dynamic_linker "/" "${dynamic_linker}" ${length})
set(${output} "${dynamic_linker_option}=${padded_dynamic_linker}" PARENT_SCOPE)
endfunction()
add_compile_options("-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")
default_target_arch(target_arch)
if(target_arch)
string(APPEND CMAKE_CXX_FLAGS " -march=${target_arch}")
add_compile_options("-march=${target_arch}")
endif()
math(EXPR _stack_usage_threshold_in_bytes "${stack_usage_threshold_in_KB} * 1024")
set(_stack_usage_threshold_flag "-Wstack-usage=${_stack_usage_threshold_in_bytes}")
check_cxx_compiler_flag(${_stack_usage_threshold_flag} _stack_usage_flag_supported)
if(_stack_usage_flag_supported)
string(APPEND CMAKE_CXX_FLAGS " ${_stack_usage_threshold_flag}")
add_compile_options("${_stack_usage_threshold_flag}")
endif()
# Force SHA1 build-id generation
add_link_options("LINKER:--build-id=sha1")
include(CheckLinkerFlag)
set(Scylla_USE_LINKER
""
CACHE
STRING
"Use specified linker instead of the default one")
if(Scylla_USE_LINKER)
set(linkers "${Scylla_USE_LINKER}")
else()
set(linkers "lld" "gold")
endif()
foreach(linker ${linkers})
set(linker_flag "-fuse-ld=${linker}")
check_linker_flag(CXX ${linker_flag} "CXX_LINKER_HAVE_${linker}")
if(CXX_LINKER_HAVE_${linker})
add_link_options("${linker_flag}")
break()
elseif(Scylla_USE_LINKER)
message(FATAL_ERROR "${Scylla_USE_LINKER} is not supported.")
endif()
endforeach()
if(DEFINED ENV{NIX_CC})
get_padded_dynamic_linker_option(dynamic_linker_option 0)
else()
# gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to
# that. The 512 includes the null at the end, hence the 511 bellow.
get_padded_dynamic_linker_option(dynamic_linker_option 511)
endif()
add_link_options("${dynamic_linker_option}")

File diff suppressed because it is too large Load Diff

View File

@@ -13,8 +13,8 @@
#include "compaction/compaction_descriptor.hh"
#include "gc_clock.hh"
#include "compaction_weight_registration.hh"
#include "service/priority_manager.hh"
#include "utils/UUID.hh"
#include "utils/pretty_printers.hh"
#include "table_state.hh"
#include <seastar/core/thread.hh>
#include <seastar/core/abort_source.hh>
@@ -25,21 +25,6 @@ namespace sstables {
bool is_eligible_for_compaction(const sstables::shared_sstable& sst) noexcept;
class pretty_printed_data_size {
uint64_t _size;
public:
pretty_printed_data_size(uint64_t size) : _size(size) {}
friend std::ostream& operator<<(std::ostream&, pretty_printed_data_size);
};
class pretty_printed_throughput {
uint64_t _size;
std::chrono::duration<float> _duration;
public:
pretty_printed_throughput(uint64_t size, std::chrono::duration<float> dur) : _size(size), _duration(std::move(dur)) {}
friend std::ostream& operator<<(std::ostream&, pretty_printed_throughput);
};
// Return the name of the compaction type
// as used over the REST api, e.g. "COMPACTION" or "CLEANUP".
sstring compaction_name(compaction_type type);
@@ -63,6 +48,7 @@ struct compaction_info {
};
struct compaction_data {
uint64_t compaction_size = 0;
uint64_t total_partitions = 0;
uint64_t total_keys_written = 0;
sstring stop_requested;
@@ -92,12 +78,15 @@ struct compaction_stats {
uint64_t start_size = 0;
uint64_t end_size = 0;
uint64_t validation_errors = 0;
// Bloom filter checks during max purgeable calculation
uint64_t bloom_filter_checks = 0;
compaction_stats& operator+=(const compaction_stats& r) {
ended_at = std::max(ended_at, r.ended_at);
start_size += r.start_size;
end_size += r.end_size;
validation_errors += r.validation_errors;
bloom_filter_checks += r.bloom_filter_checks;
return *this;
}
friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {
@@ -112,12 +101,27 @@ struct compaction_result {
compaction_stats stats;
};
class read_monitor_generator;
class compaction_progress_monitor {
std::unique_ptr<read_monitor_generator> _generator = nullptr;
uint64_t _progress = 0;
public:
void set_generator(std::unique_ptr<read_monitor_generator> generator);
void reset_generator();
// Returns number of bytes processed with _generator.
uint64_t get_progress() const;
friend class compaction;
friend future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor, compaction_data&, table_state&, compaction_progress_monitor&);
};
// Compact a list of N sstables into M sstables.
// Returns info about the finished compaction, which includes vector to new sstables.
//
// compaction_descriptor is responsible for specifying the type of compaction, and influencing
// compaction behavior through its available member fields.
future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s);
future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, compaction_progress_monitor& progress_monitor);
// Return list of expired sstables for column family cf.
// A sstable is fully expired *iff* its max_local_deletion_time precedes gc_before and its
@@ -130,7 +134,4 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
// For tests, can drop after we virtualize sstables.
flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors);
// For tests, can drop after we virtualize sstables.
future<uint64_t> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 rd, const compaction_data& info);
}

View File

@@ -12,7 +12,6 @@
#include <memory>
#include <seastar/core/shared_ptr.hh>
#include "sstables/shared_sstable.hh"
#include "sstables/progress_monitor.hh"
#include "timestamp.hh"
class compaction_backlog_manager;
@@ -60,18 +59,20 @@ public:
using ongoing_compactions = std::unordered_map<sstables::shared_sstable, backlog_read_progress_manager*>;
struct impl {
virtual void replace_sstables(std::vector<sstables::shared_sstable> old_ssts, std::vector<sstables::shared_sstable> new_ssts) = 0;
// FIXME: Should provide strong exception safety guarantees
virtual void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts) = 0;
virtual double backlog(const ongoing_writes& ow, const ongoing_compactions& oc) const = 0;
virtual ~impl() { }
};
compaction_backlog_tracker(std::unique_ptr<impl> impl) : _impl(std::move(impl)) {}
compaction_backlog_tracker(compaction_backlog_tracker&&);
compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) noexcept;
compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) = delete;
compaction_backlog_tracker(const compaction_backlog_tracker&) = delete;
~compaction_backlog_tracker();
double backlog() const;
// FIXME: Should provide strong exception safety guarantees
void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts);
void register_partially_written_sstable(sstables::shared_sstable sst, backlog_write_progress_manager& wp);
void register_compacting_sstable(sstables::shared_sstable sst, backlog_read_progress_manager& rp);

View File

@@ -18,7 +18,6 @@
#include "sstables/sstable_set.hh"
#include "utils/UUID.hh"
#include "dht/i_partitioner.hh"
#include "compaction_weight_registration.hh"
#include "compaction_fwd.hh"
namespace sstables {
@@ -73,6 +72,12 @@ public:
only, // scrub only quarantined sstables
};
quarantine_mode quarantine_operation_mode = quarantine_mode::include;
using quarantine_invalid_sstables = bool_class<class quarantine_invalid_sstables_tag>;
// Should invalid sstables be moved into quarantine.
// Only applies to validate-mode.
quarantine_invalid_sstables quarantine_sstables = quarantine_invalid_sstables::yes;
};
struct reshard {
};
@@ -109,8 +114,8 @@ public:
return compaction_type_options(upgrade{});
}
static compaction_type_options make_scrub(scrub::mode mode) {
return compaction_type_options(scrub{mode});
static compaction_type_options make_scrub(scrub::mode mode, scrub::quarantine_invalid_sstables quarantine_sstables = scrub::quarantine_invalid_sstables::yes) {
return compaction_type_options(scrub{.operation_mode = mode, .quarantine_sstables = quarantine_sstables});
}
template <typename... Visitor>
@@ -118,6 +123,11 @@ public:
return std::visit(std::forward<Visitor>(visitor)..., _options);
}
template <typename OptionType>
const auto& as() const {
return std::get<OptionType>(_options);
}
const options_variant& options() const { return _options; }
compaction_type type() const;
@@ -151,12 +161,12 @@ struct compaction_descriptor {
compaction_type_options options = compaction_type_options::make_regular();
// If engaged, compaction will cleanup the input sstables by skipping non-owned ranges.
compaction::owned_ranges_ptr owned_ranges;
// Required for reshard compaction.
const dht::sharder* sharder;
compaction_sstable_creator_fn creator;
compaction_sstable_replacer_fn replacer;
::io_priority_class io_priority = default_priority_class();
// Denotes if this compaction task is comprised solely of completely expired SSTables
sstables::has_only_fully_expired has_only_fully_expired = has_only_fully_expired::no;
@@ -166,7 +176,6 @@ struct compaction_descriptor {
static constexpr uint64_t default_max_sstable_bytes = std::numeric_limits<uint64_t>::max();
explicit compaction_descriptor(std::vector<sstables::shared_sstable> sstables,
::io_priority_class io_priority,
int level = default_level,
uint64_t max_sstable_bytes = default_max_sstable_bytes,
run_id run_identifier = run_id::create_random_id(),
@@ -178,18 +187,15 @@ struct compaction_descriptor {
, run_identifier(run_identifier)
, options(options)
, owned_ranges(std::move(owned_ranges_))
, io_priority(io_priority)
{}
explicit compaction_descriptor(sstables::has_only_fully_expired has_only_fully_expired,
std::vector<sstables::shared_sstable> sstables,
::io_priority_class io_priority)
std::vector<sstables::shared_sstable> sstables)
: sstables(std::move(sstables))
, level(default_level)
, max_sstable_bytes(default_max_sstable_bytes)
, run_identifier(run_id::create_random_id())
, options(compaction_type_options::make_regular())
, io_priority(io_priority)
, has_only_fully_expired(has_only_fully_expired)
{}

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More