Commit Graph

40181 Commits

Author SHA1 Message Date
Mikołaj Grzebieluch
e327478bb5 test.py: enable maintenance socket in tests by default 2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
21b3ba4927 docs: add maintenance socket documentation 2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
f96d30c2b5 main: add maintenance socket
Add initialization of maintenance_auth_service and cql_maintenance_server_ctl.

Create maintenance socket which enables interaction with the node through
CQL protocol without authentication. The maintenance port is available
by Unix domain socket. It gives full-permission access.
It is created before the node joins the cluster.
2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
16ab2c28e4 main: refactor initialization of cql controller and auth service
Move initialization of cql controller and auth service to functions.
It will make it easier to create a new cql controller with a seperate auth service,
for example for the maintenance socket.

Make it possible to initialize new services before joining group0.
2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
999be1d14b auth/service: don't create system_auth keyspace when used by maintenance socket
The maintenance socket is created before joining the cluster. When maintenance auth service
is started it creates system_auth keyspace if it's missing. It is not synchronized
with other nodes, because this node hasn't joined the group0 yet. Thus a node has
a mismatched schema and is unable to join the cluster.

The maintenance socket doesn't use role management, thus the problem is solved
by not creating system_auth keyspace when maintenance auth service is created.

The logic of regular CQL port's auth service won't be changed. For the maintenance
socket will be created a new separate auth service.
2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
2b9a88d17a cql_controller: maintenance socket: fix indentation 2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
ac61d0f695 cql_controller: add option to start maintenance socket
Add an option to listen on the maintenance socket. It is set up on an unix domain socket
and the metrics are disabled.
This enables having an independent authentication mechanism for this socket.

To start the maintenance socket, a new cql_controller has to be created
with
`db::maintenance_socket_enabled::yes` argument.

Creating maintenance socket will raise an exception if
* the path is longer than 107 chars (due to linux limits),
* a file or a directory already exists in the path.

The indentation is fixed in the next commit.
2023-12-18 17:58:13 +01:00
Mikołaj Grzebieluch
cf43787295 db/config: add maintenance_socket_enabled bool class 2023-12-18 11:42:40 +01:00
Mikołaj Grzebieluch
11a2748d7f auth: add maintenance_socket_role_manager
Add `maintenance_socket_role_manager` which will disable all operations
associated with roles to not depend on system_auth keyspace, which may
be not yet created when the maintenance socket starts listening
2023-12-18 11:42:40 +01:00
Mikołaj Grzebieluch
e682e362a3 db/config: add maintenance_socket variable
If set to "ignore", maintenance socket will be disabled.
If set to "workdir", maintenance socket will be opened on <scylla's
workdir>/cql.m.
Otherwise it will be opened on path provided by maintenance_socket
variable.

It is set by default to 'ignore'.
2023-12-18 11:42:05 +01:00
Alexander Turetskiy
f30b5473ab cql: Reject empty options while altering a keyspace
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.

Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).

fixes #10036

Closes scylladb/scylladb#16221
2023-12-10 17:44:35 +02:00
Kefu Chai
818343b57d build: build session.cc in CMake building system
this source file was added in d3d83869. so let's update cmake
as well.

sessions_tests was added in the same commit, so add it as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16344
2023-12-09 22:14:47 +02:00
Avi Kivity
d62a5fc60b Merge 'tools/scylla-nodetool: implement additional commands, part 5/N ' from Botond Dénes
This PR implements the following new nodetool commands:
* decomission
* rebuild
* removenode
* getlogginglevels
* setlogginglevel
* move
* refresh

All commands come with tests and all tests pass with both the new and the current nodetool implementations.

Refs: https://github.com/scylladb/scylladb/issues/15588

Closes scylladb/scylladb#16348

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement the refresh command
  tools/scylla-nodetool: implement the move command
  tools/scylla-nodetool: implement setlogginglevel command
  tools/sclla-sstable: implement the getlogginglevels command
  tools/scylla-nodetool: implement the removenode command
  tools/scylla-nodetool: implement the rebuild command
  tools/scylla-nodetool: implement the decommission command
2023-12-09 21:47:22 +02:00
Pavel Emelyanov
5e69415387 guardrails: Do not validate initial_tablets as replication factor
When checking replication strategy options the code assumes (and it's
stated in the preceeding code comment) that all options are replication
factors. Nowadays it's no longer so, the initial_tablets one is not
replication factor and should be skipped

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#16335
2023-12-09 15:56:41 +02:00
Botond Dénes
496459165e tools/scylla-nodetool: implement the refresh command 2023-12-08 08:58:16 -05:00
Botond Dénes
ad148a9dbc tools/scylla-nodetool: implement the move command
In the java nodetool, this command ends up calling an API endpoint which
just throws an exception saying moving tokens is not supported. So in
the native implementation we just throw an exception to the same effect
in scylla-nodetool itself.
2023-12-08 08:29:39 -05:00
Botond Dénes
58d3850da1 tools/scylla-nodetool: implement setlogginglevel command 2023-12-08 08:18:56 -05:00
Botond Dénes
3a8590e1af tools/sclla-sstable: implement the getlogginglevels command 2023-12-08 07:32:45 -05:00
Botond Dénes
c35ed794de tools/scylla-nodetool: implement the removenode command 2023-12-08 07:32:31 -05:00
Botond Dénes
9a484cb145 tools/scylla-nodetool: implement the rebuild command 2023-12-08 07:05:30 -05:00
Botond Dénes
ea62f7c848 tools/scylla-nodetool: implement the decommission command 2023-12-08 06:14:36 -05:00
Kefu Chai
893f319004 sstables: add formatter for index_consume_entry_context_state
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we

* move the enum definition out of the class and before the
  class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().

as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16204
2023-12-08 12:45:38 +02:00
Kurashkin Nikita
c071cd92b5 cql3:statement_restrictions.cc add more conditions to prevent "allow filtering" error to pop up in delete/update statements
Modified Cassandra tests to check for Scylla's error messages
Fixes #12474

Closes scylladb/scylladb#15811
2023-12-07 21:25:18 +02:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Avi Kivity
4b1ef00dbb Merge 'File stream for tablet preparation' from Asias He
This series adds preparation patches for file stream tablet implementation in enterprise branch. It minimizes the differences between those two branches.

Closes scylladb/scylladb#16297

* github.com:scylladb/scylladb:
  messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
  compaction_group_for_token: Handle minimum_token and maximum_token token
  serializer: Add temporary_buffer support
  cql_test_env: Allow messaging_service to start listen
2023-12-07 16:26:22 +02:00
Avi Kivity
ed2a9b8750 Merge 'Commitlog: Fix reading/writing position calculations and allocation size checks' from Calle Wilund
Fixes #16298

The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.

However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.

Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.

However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.

Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.

Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.

Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.

Fixes #16301

The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.

Closes scylladb/scylladb#16302

* github.com:scylladb/scylladb:
  commitlog: Fix allocation size check to take sector overhead into account.
  commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
2023-12-07 12:27:54 +02:00
Botond Dénes
fb9379edf1 test/cql-pytest: test_select_from_mutation_fragments: bump timeout for slow test
The test test_many_partitions is very slow, as it tests a slow scan over
a lot of partitions. This was observed to time out on the slower ARM
machines, making the test flaky. To prevent this, create an
extra-patient cql connection with a 10 minutes timeout for the scan
itself.

Fixes: #16145

Closes scylladb/scylladb#16303
2023-12-07 11:55:53 +02:00
Yaniv Kaul
862909ee4f Typos: fix typos in documentation
Using codespell, went over the docs and fixed some typos.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#16275
2023-12-07 11:10:17 +02:00
Anna Stuchlik
8b01cb7fb8 doc: set 5.4 as the latest stable version
This commit updates the configuration for
ScyllaDB documentation so that:
- 5.4 is the latest version.
- 5.4 is removed from the list of unstable versions.

It must be merged when ScyllaDB 5.4 is released.

No backport is required.

Closes scylladb/scylladb#16308
2023-12-07 10:04:26 +02:00
Calle Wilund
dba39b47bd commitlog: Fix allocation size check to take sector overhead into account.
Fixes #16301

The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
2023-12-07 07:36:27 +00:00
Calle Wilund
0d35c96ef4 commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
Fixes #16298

The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.

However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.

Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.

However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.

Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.

Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.

Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
2023-12-07 07:36:27 +00:00
Asias He
6beadab9e6 messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
They will be used to implement file stream for tablet in the future. Reserve
the verb ID.
2023-12-07 14:54:12 +08:00
Asias He
67cfa12c7d compaction_group_for_token: Handle minimum_token and maximum_token token
The following error was seen:

[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token

Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
2023-12-07 14:54:12 +08:00
Asias He
974b28a750 serializer: Add temporary_buffer support
It will be used by file stream for tablet.
2023-12-07 09:46:37 +08:00
Asias He
faaf58f62c cql_test_env: Allow messaging_service to start listen
This is needed for rpc calls to work in the tests. With this patch, by
default, messaging_service does not listen as it was before.

This is useful for file stream for tablet test.
2023-12-07 09:46:36 +08:00
Avi Kivity
92d61def57 Merge 'scylla_swap_setup: run error check before allocating swap and increase swap allocation speed' from Takuya ASADA
This patch fixes error check and speed up swap allocation.

Following patches are included:
 - scylla_swap_setup: run error check before allocating swap
   avoid create swapfile before running error check
 - scylla_swap_setup: use fallocate on ext4
   this inclease swap allocation speed on ext4

Closes scylladb/scylladb#12668

* github.com:scylladb/scylladb:
  scylla_swap_setup: use fallocate on ext4
  scylla_swap_setup: run error check before allocating swap
2023-12-06 21:40:10 +02:00
Avi Kivity
55dacb8480 Merge 'Generalize atomic sstables deletion' from Pavel Emelyanov
The current implementation starts in sstables_manager that gets the deletion function from storage which, in turn, should atomically do sst.unlink() over a list of sstables (s3 driver is still not atomic though #13567).

This PR generalizes the atomic deletion inside sstables_manager method and removes the atomic deletor function that nobody liked when it was introduced (#13562)

Closes scylladb/scylladb#16290

* github.com:scylladb/scylladb:
  sstables/storage: Drop atomic deleter
  sstables/storage: Reimplement atomic deletion in sstables_manager
  sstables/storage: Add prepare/complete skaffold for atomic deletion
2023-12-06 19:48:07 +02:00
Tomasz Grabiec
7d0f4c10a2 test: tablets: Add test for failed streaming being fenced away 2023-12-06 18:37:01 +01:00
Tomasz Grabiec
083a0279a9 error_injection: Introduce poll_for_message()
To allow more complex waiting, which involves other exit conditions.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
ce0dc9e940 error_injection: Make is_enabled() public 2023-12-06 18:36:17 +01:00
Tomasz Grabiec
733eb21601 api: Add API to kill connection to a particular host
For testing failure scenarios.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
9dac0febce range_streamer: Do not block topology change barriers around streaming
Streaming was keeping effective_replication_map_ptr around the whole
process, which blocks topology change barriers.

This will inhibit progress of tablet load balancer or concurrent
migrations, resulting in worse performance.

Fix by switching to the most recent erm on sharder
calls. multishard_writer calls shard_of() for each new partition.

A better way would be to switch immediately when topology version
changes, but this is left for later.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
c228f2c940 range_streamer, tablets: Do not keep token metadata around streaming
It holds back global token metadata barrier during streaming, which
limits parallelism of load balancing.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
7a59acf248 tablets: Fail gracefully when migrating tablet has no pending replica
Before the patch we SIGSEGV trying to access pending replica in this
case. Fail early instead.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
31c995332c storage_service, raft topology: Run streaming under session topology guard
Prevents stale streaming operation from running beyond topology
operation they were started in. After the session field is cleared, or
changed to something else, the old topology_guard used by streaming is
interrupted and fenced and the next barrier will join with any
remaining work.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
080169cad6 storage_service, tablets: Use session to guard tablet streaming 2023-12-06 18:36:17 +01:00
Tomasz Grabiec
5381792401 tablets: Add per-tablet session id field to tablet metadata
range_streamer will pick it up when creating topology_guard.

It's materialized in memory only for migrating tablets in
tablet_transition_info.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
fd3c089ccc service: range_streamer: Propagate topology_guard to receivers 2023-12-06 18:36:16 +01:00