Commit Graph

40928 Commits

Author SHA1 Message Date
Kefu Chai
819fc95a67 reader: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17036
2024-01-29 16:21:42 +02:00
Kefu Chai
43094d2023 db: add formatter for db::read_repair_decision
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `db::read_repair_decision`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17033
2024-01-29 15:43:51 +02:00
Botond Dénes
d202d32f81 Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb/scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call
2024-01-29 15:06:04 +02:00
Beni Peled
8009170d3a docs: update the installation instructions with the new gpg 2024 key
Closes scylladb/scylladb#17019
2024-01-29 14:37:25 +02:00
Kefu Chai
6f55d68dd9 .git: add more skip words
these words are either

* shortened words: strategy => strat, read_from_primary => fro
* or acronyms: node_or_data => nd

before we rename them with better names, let's just add them to the
ignore word list.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17002
2024-01-29 14:37:03 +02:00
Kefu Chai
0cbf8f75f0 db: add formatter for dht::decorated_key and repair_sync_boundary
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for dht::decorated_key and
repair_sync_boundary.

please note, before this change, repair_sync_boundary was using
the operator<< based formatter of `dht::decorated_key`, so we are
updating both of them in a single commit.

because we still use the homebrew generic formatter of vector<>
in to format vector<repair_sync_boundary> and vector<dht::decorated_key>,
so their operator<< are preserved.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16994
2024-01-29 11:11:41 +02:00
Tzach Livyatan
06a9a925a5 Update link to sizing / pricing calc
Closes scylladb/scylladb#17015
2024-01-29 11:07:20 +02:00
Kefu Chai
b5ff098f28 thrift: add formatter for cassandra::ConsistencyLevel::type
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for
cassandra::ConsistencyLevel::type.
please note, the operator<< for `cassandra::ConsistencyLevel::type`
is generated using `thrift` command line tool, which does not emit
specialization for fmt::formatter yet, so we need to use
`fmt::ostream_formatter` to implement the formatter for this type.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17013
2024-01-29 10:10:35 +02:00
Pavel Emelyanov
3abdb3c7ee tablets: Remove tablet_aware_replication_strategy::parse_initial_tablets
It's now unused, string with initial tablets its parsed elsewhere

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17010
2024-01-29 10:03:38 +02:00
Kefu Chai
912c588975 thrift: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17012
2024-01-29 10:02:30 +02:00
Kefu Chai
abb12979f8 raft: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17011
2024-01-29 10:00:56 +02:00
Kefu Chai
8f38bd5376 commitlog: add formatter for db::replay_position
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `db::replay_position`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17014
2024-01-29 09:59:30 +02:00
Botond Dénes
d3c1be9107 Merge 'alternator: enable tablets by default if experimental feature is enabled' from Nadav Har'El
This series does a similar change to Alternator as was done recently to CQL:

1. If the "tablets" experimental feature in enabled, new Alternator tables will use tablets automatically, without requiring an option on each new table. A default choice of initial_tablets is used. These choices can still be overridden per-table if the user wants to.
3. In particular, all test/alternator tests will also automatically run with tablets enabled
4. However, some tests will fail on tablets because they use features that haven't yet been implemented with tablets - namely Alternator Streams (Refs #16317) and Alternator TTL (Refs #16567). These tests will - until those features are implemented with tablets - continue to be run without tablets.
5. An option is added to the test/alternator/run to allow developers to manually run tests without tablets enabled, if they wish to (this option will be useful in the short term, and can be removed later).

Fixes #16355

Closes scylladb/scylladb#16900

* github.com:scylladb/scylladb:
  test/alternator: add "--vnodes" option to run script
  alternator: use tablets by default, if available
  test/alternator: run some tests without tablets
2024-01-29 09:22:13 +02:00
Kefu Chai
cb5453d534 .git: only allow codespell to run on master branch
so that non-master branches are not read by 3rd-party tools unless
they are audited.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16999
2024-01-29 09:04:20 +02:00
Kefu Chai
f96d25a0a7 tool: check for existence of keyspace before getting it
in general, user should save output of `DESC foo.bar` to a file,
and pass the path to the file as the argument of `--schema-file`
option of `scylla sstable` commands. the CQL statement generated
from `DESC` command always include the keyspace name of the table.
but in case user create the CQL statement manually and misses
the keyspace name. he/she would have following assertion failure
```
scylla: cql3/statements/cf_statement.cc:49: virtual const sstring &cql3::statements::raw::cf_statement::keyspace() const: Assertion `_cf_name->has_keyspace()' failed.
```
this is not a great user experience.

so, in this change, we check for the existence of keyspace before
looking it up. and throw a runtime error with a better error mesage.
so when the CQL statement does not have the keyspace name, the new
error message would look like:
```
error processing arguments: could not load schema via schema-file: std::runtime_error (tools::do_load_schemas(): CQL statement does not have keyspace specified)
```

since this check is only performed by `do_load_schemas()` which
care about the existence of keyspace, and it only expects the
CQL statement to create table/keyspace/type, we just override the
new `has_keyspace()` method of the corresponding types derived
from `cf_statement`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16981
2024-01-29 09:02:01 +02:00
Anna Stuchlik
dfa88ccc28 doc: document nodetool resetlocalschema
This adds the documentation for the nodetool resetlocalschema
command.
The syntax description is based on the description for Cassandra
and the ScyllaDB help for nodetool.

Fixes https://github.com/scylladb/scylladb/issues/16286

Closes scylladb/scylladb#16790
2024-01-28 21:09:02 +01:00
Kefu Chai
fe3bc00045 topology_coordinator: fix misspellings in log
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17006
2024-01-26 16:50:39 +02:00
Dawid Medrek
b92fb3537a main: Postpone start-up of hint manager
In this commit, we postpone the start-up
of the hint manager until we obtain information
about other nodes in the cluster.

When we start the hint managers, one of the
things that happen is creating endpoint
managers -- structures managed by
db::hints::manager. Whether we create
an instance of endpoint manager depends on
the value returned by host_filter::can_hint_for,
which, in turn, may depend on the current state
of locator::topology.

If locator::topology is incomplete, some endpoint
managers may not be started even though they
should (because the target node IS part of the
cluster and we SHOULD send hints to it if there
are some).

The situation like that can happen because we
start the hint managers too early. This commit
aims to solve that problem. We only start
the hint managers when we've gathered information
about the other nodes in the cluster and created
the locator::topology using it.

Hinted Handoff is not negatively affected by these
changes since in between the previous point of
starting the hint managers and the current one,
all of the mutations performed by
service::storage_proxy target the local node, so
no hints would need to be generated anyway.

Fixes scylladb/scylladb#11870
Closes scylladb/scylladb#16511
2024-01-26 12:49:40 +01:00
Botond Dénes
c6fd4dffbb Merge 'Remove anonymous namespaces from headers' from Patryk Wróbel
Anonymous namespace implies internal linkage for its members.
When it is defined in a header, then each translation unit,
which includes such header defines its own unique instance
of members of the unnamed namespace that are ODR-used within
that translation unit.

This can lead to unexpected results including code bloat
or undefined behavior due to ODR violations.

This PR removes unnamed namespaces from header files.

References:

- [CppCoreGuidelines: "SF.21: Don’t use an unnamed (anonymous) namespace in a header"](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#sf21-dont-use-an-unnamed-anonymous-namespace-in-a-header)

- [SEI CERT C++: "DCL59-CPP. Do not define an unnamed namespace in a header file"](https://wiki.sei.cmu.edu/confluence/display/cplusplus/DCL59-CPP.+Do+not+define+an+unnamed+namespace+in+a+header+file)

Closes scylladb/scylladb#16998

* github.com:scylladb/scylladb:
  utils/config_file_impl.hh: remove anonymous namespace from header
  mutation/mutation.hh: remove anonymous namespace from header
2024-01-26 13:20:17 +02:00
Kefu Chai
a9d781d70f test/nodetool: only test "storage_service/cleanup_all" with scylla
this RESTful API is a scylla specific extension and is only used
by scylla-nodetool. currently, the java-based nodetool does not use
it at all, so mark it with "scylla_only".

one can verify this change with:
```
pytest --mode=debug --nodetool=cassandra test_cleanup.py::test_cleanup
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17001
2024-01-26 13:19:15 +02:00
Botond Dénes
582ddc70ec Merge 'test/nodetool: return a randomized address if not running with unshare' from Kefu Chai
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.

Closes scylladb/scylladb#16996

* github.com:scylladb/scylladb:
  test/nodetool: return a randomized address if not running with unshare
  test/nodetool: return an address from loopback_network fixture
2024-01-26 13:15:58 +02:00
Kefu Chai
9ee6c00c84 docs: fix misspellings
these misspellings are identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17005
2024-01-26 13:14:21 +02:00
Kefu Chai
72cec22932 repair: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16993
2024-01-26 13:12:38 +02:00
Kamil Braun
4f736894e1 Merge 'Add maintenance mode' from Mikołaj Grzebieluch
In this mode, the node is not reachable from the outside, i.e.
* it refuses all incoming RPC connections,
* it does not join the cluster, thus
  * all group0 operations are disabled (e.g. schema changes),
  * all cluster-wide operations are disabled for this node (e.g. repair),
  * other nodes see this node as dead,
  * cannot read or write data from/to other nodes,
* it does not open Alternator and Redis transport ports and the TCP CQL port.

The only way to make CQL queries is to use the maintenance socket. The node serves only local data.

To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file.

REST API works as usual, but some routes are disabled:
* authorization_cache
* failure_detector
* hinted_hand_off_manager

This PR also updates the maintenance socket documentation:
* add cqlsh usage to the documentation
* update the documentation to use `WhiteListRoundRobinPolicy`

Fixes #5489.

Closes scylladb/scylladb#15346

* github.com:scylladb/scylladb:
  test.py: add test for maintenance mode
  test.py: generalize usage of cluster_con
  test.py: when connecting to node in maintenance mode use maintenance socket
  docs: add maintenance mode documentation
  main: add maintenance mode
  main: move some REST routes initialization before joining group0
  message_service: add sanity check that rpc connections are not created in the maintenance mode
  raft_group0_client: disable group0 operations in the maintenance mode
  service/storage_service: add start_maintenance_mode() method
  storage_service: add MAINTENANCE option to mode enum
  service/maintenance_mode: add maintenance_mode_enabled bool class
  service/maintenance_mode: move maintenance_socket_enabled definition to seperate file
  db/config: add maintenance mode flag
  docs: add cqlsh usage to maintenance socket documentation
  docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy
2024-01-26 11:02:34 +01:00
Kefu Chai
637dd73079 sstable/storage: use fs::path to represent _dir and _temp_dir
they are directories, and we are concating strings to build the paths
to the sstable components. so it would be more elegant to use fs::path
for manipulating paths.

this change was inspired by the discussion on passing the relative
path to sstable to `scylla sstables`, where we use the
`path::parent_path()` as the dir of sstable, and then concatenate
it with the filename component. but if the `parent_path()` method
returns an empty string, we end up with a path like
"/me-42-big-TOC.txt", which is not reachable. what we should be
reading is "me-42-big-TOC.txt". so, we should better off either
using `fs::path` or enforcing the absolute path.

since we already using "/" as separator, and concatenating strings,
this is an opportunity to switch over to `fs::path` to address
the problem and to avoid the string concatenating.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16982
2024-01-26 09:54:41 +02:00
Patryk Wrobel
6faa178f10 utils/config_file_impl.hh: remove anonymous namespace from header
Anonymous namespace implies internal linkage for its members.
When it is defined in a header, then each translation unit,
which includes such header defines its own unique instance
of members of the unnamed namespace that are ODR-used within
that translation unit.

This can lead to unexpected results including code bloat
or undefined behavior due to ODR violations.

This change aligns the code with the following guidelines:
 - CppCoreGuidelines: "SF.21: Don’t use an unnamed (anonymous)
                       namespace in a header"
 - SEI CERT C++: "DCL59-CPP. Do not define an unnamed namespace
                  in a header file"

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-26 08:44:44 +01:00
Patryk Wrobel
c218333afb cql3/type_json.cc: move stringstream content instead of copying it
C++20 introduced a new overload of std::ofstringstream::str()
that is selected when the mentioned member function is called
on r-value.

The new overload returns a string, that is move-constructed
from the underlying string instead of being copy-constructed.

This change applies std::move() on stringstream objects before
calling str() member function to avoid copying of the underlying
buffer.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#16990
2024-01-26 09:41:09 +02:00
Kefu Chai
36e81f93d2 .git: do not apply codespell to licenses
we should keep the licenses as they are, even with misspellings.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16992
2024-01-26 09:39:27 +02:00
Patryk Wrobel
ba488b10ec mutation/mutation.hh: remove anonymous namespace from header
Anonymous namespace implies internal linkage for its members.
When it is defined in a header, then each translation unit,
which includes such header defines its own unique instance
of members of the unnamed namespace that are ODR-used within
that translation unit.

This can lead to unexpected results including code bloat
or undefined behavior due to ODR violations.

This change aligns the code with the following guidelines:
 - CppCoreGuidelines: "SF.21: Don’t use an unnamed (anonymous)
                       namespace in a header"
 - SEI CERT C++: "DCL59-CPP. Do not define an unnamed namespace
                  in a header file"

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-26 08:38:39 +01:00
Kefu Chai
01727a5399 test/nodetool: return a randomized address if not running with unshare
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-26 13:32:47 +08:00
Kefu Chai
358d30fd29 test/nodetool: return an address from loopback_network fixture
* rename "maybe_setup_loopback_network" to "server_address"
* return an address from the fixture

this change prepares for bringing back the randomized IP and port,
in case users run this test without test.py, by randomizing the
IP and port, they would have better chance to complete the test
without running into used port problem.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-26 13:20:37 +08:00
Avi Kivity
03313d359e Merge ' db: commitlog_replayer: ignore mutations affected by (tablet) cleanups ' from Michał Chojnowski
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.

This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.

Fixes #16752

Closes scylladb/scylladb#16888

* github.com:scylladb/scylladb:
  test: test_tablets: add a test for cleanup after migration
  test: pylib: add ScyllaCluster.wipe_sstables
  test: boost: add commitlog_cleanup_test
  db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
  replica: table: garbage-collect irrelevant system.commitlog_cleanups records
  db: commitlog: add min_position()
  replica: table: populate system.commitlog_cleanups on tablet cleanup
  db: system_keyspace: add system.commitlog_cleanups
  replica: table: refresh compound sstable set after tablet cleanup
2024-01-25 20:51:03 +02:00
Patryk Wrobel
a858daf038 service/client_state.cc: remove redundant copying
db::schema_tables::all_table_names() returns std::vector<sstring>.
Usage of range-for loop without reference results in copying each
of the elements of the traversed container. Such copying is redundant.

This change introduces usage of const reference to avoid copying.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#16983
2024-01-25 20:35:05 +02:00
Kamil Braun
543ad0987a Merge 'raft topology: send barrier_and_drain to a decommissioning node' from Patryk Jędrzejczak
We didn't send the `barrier_and_drain` command to a
decommissioning node that could still be coordinating requests. It
could happen that a decommissioning node sent a request with an
old topology version after normal nodes received the new fence
version. Then, the request would fail on replicas with the stale
topology exception.

This PR fixes this problem by modifying `exec_global_command`.
From now on, it sends `barrier_and_drain` to a decommissioning
node.

We also stop filtering stale topology exceptions in
`test_topology_ops`. We added this filter after detecting the bug
fixed by this PR.

Fixes scylladb/scylladb#15804
Fixes scylladb/scylladb#16579
Fixes scylladb/scylladb#16642

Closes scylladb/scylladb#16797

* github.com:scylladb/scylladb:
  test: test_topology_ops: remove failed mutations filter
  raft topology: send barrier_and_drain to a decommissioning node
  raft topology: ensure at most one transitioning node
2024-01-25 16:09:02 +01:00
Kefu Chai
ee28cf2285 test.py: s/defalt/default/
this typo was identified by codespell

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16980
2024-01-25 16:54:07 +02:00
Botond Dénes
6d5ee6d48a Merge 'test/nodetool: run nodetool tests using "unshare"' from Kefu Chai
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.

in this change, we

* specify the "launcher" option of nodetool
  test suite to "unshare", so that all its tests are launched
  in separated namespaces.
* do not use a random address for the mock server, as the network
  namespaces are separated.

Fixes #16542

Closes scylladb/scylladb#16773

* github.com:scylladb/scylladb:
  test/nodetool: run nodetool tests using "unshare"
  test.py: add "launcher" option support
2024-01-25 16:53:49 +02:00
Mikołaj Grzebieluch
763911af5b test.py: add test for maintenance mode
The test checks that in maintenance mode server A is not available for other
nodes and for clients. It is possible to connect by the maintenance socket
to server A and perform local CQL operations.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
ca35e352f5 test.py: generalize usage of cluster_con
Add option to pass load_balancing policy.
Change hosts type to list of IPs or cassandra.Endpoint.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
77a656bfd6 test.py: when connecting to node in maintenance mode use maintenance socket
A node in the maintenance socket hasn't an opened regular CQL port.
To connect to the node, the scylla cluster needs to use the node's maintenance socket.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
9c07a189e8 docs: add maintenance mode documentation 2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
0bdbd6e8f5 main: add maintenance mode
In maintenance mode:
* Group0 doesn't start and the node doesn't join the token ring to behave as a dead
node to others,
* Group0 operations are disabled and result in an error,
* Only the maintenance socket listens for CQL requests,
* The storage service initialises token_metadata with the local node as the only node
on the token ring.

Maintenance mode is enabled by passing the --maintenance-mode flag.

Maintenance mode starts before the group0 is initialised.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
617adde9c9 main: move some REST routes initialization before joining group0
Move REST endpoints that don't need connection with other nodes, before joining the group0.
This way, they can be initialized in the maintenance mode.

Move `snapshot_ctl` along with routes because of snapshots API and tasks API.
Its constructor is a noop, so it is safe to move it.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
d8de209dcf message_service: add sanity check that rpc connections are not created in the maintenance mode
In maintenance mode, a node shouldn't be able to communicate with other nodes.

To make sure this does not happen, the sanity check is added.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
c08266cfe5 raft_group0_client: disable group0 operations in the maintenance mode
In maintenance mode, the node doesn't communicate with other nodes, so it doesn't
start or apply group0 operations. Users can still try to start it, e.g. change
the schema, and the node can't allow it.

Init _upgrade_state with recovery in the maintenance mode.
Throw an error if the group0 operation is started in maintenance mode.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
97641f646a service/storage_service: add start_maintenance_mode() method
In the maintenance mode, other nodes won't be available thus we disabled joining
the token ring and the token metadata won't be populated with the local node's endpoint.
When a CQL query is executed it checks the `token_metadata` structure and fails if it is empty.

Add a method that initialises `token_metadata` with the local node as the only node in the token ring.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
c530756837 storage_service: add MAINTENANCE option to mode enum
join_cluster and start_maintenance_mode are incompatible.
To make sure that only one is called when the node starts, add the MAINTENANCE option.

start_maintenance_mode sets _operation_mode to MAINTENANCE.
join_cluster sets _operation_mode to STARTING.

set_mode will result in an internal error if:
* it tries to set MAINTENANCE mode when the _operation_mode is other than NONE,
  i.e. start_maintenance_mode is called after join_cluster (or it is called during
  the drain, but it also shouldn't happen).
* it tries to set STARTING mode when the mode is set to MAINTENANCE,
  i.e. join_cluster is called after start_maintenance_mode.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
d4c22fc86c service/maintenance_mode: add maintenance_mode_enabled bool class 2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
8b2f0e38d9 service/maintenance_mode: move maintenance_socket_enabled definition to seperate file 2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
e6a83b9819 db/config: add maintenance mode flag 2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
81ef9fc91e docs: add cqlsh usage to maintenance socket documentation
After https://github.com/scylladb/scylla-cqlsh/pull/67, the user can use
cqlsh to connect to the node by maintenance socket.
2024-01-25 15:27:53 +01:00