Add initialization of maintenance_auth_service and cql_maintenance_server_ctl.
Create maintenance socket which enables interaction with the node through
CQL protocol without authentication. The maintenance port is available
by Unix domain socket. It gives full-permission access.
It is created before the node joins the cluster.
Move initialization of cql controller and auth service to functions.
It will make it easier to create a new cql controller with a seperate auth service,
for example for the maintenance socket.
Make it possible to initialize new services before joining group0.
The maintenance socket is created before joining the cluster. When maintenance auth service
is started it creates system_auth keyspace if it's missing. It is not synchronized
with other nodes, because this node hasn't joined the group0 yet. Thus a node has
a mismatched schema and is unable to join the cluster.
The maintenance socket doesn't use role management, thus the problem is solved
by not creating system_auth keyspace when maintenance auth service is created.
The logic of regular CQL port's auth service won't be changed. For the maintenance
socket will be created a new separate auth service.
Add an option to listen on the maintenance socket. It is set up on an unix domain socket
and the metrics are disabled.
This enables having an independent authentication mechanism for this socket.
To start the maintenance socket, a new cql_controller has to be created
with
`db::maintenance_socket_enabled::yes` argument.
Creating maintenance socket will raise an exception if
* the path is longer than 107 chars (due to linux limits),
* a file or a directory already exists in the path.
The indentation is fixed in the next commit.
Add `maintenance_socket_role_manager` which will disable all operations
associated with roles to not depend on system_auth keyspace, which may
be not yet created when the maintenance socket starts listening
If set to "ignore", maintenance socket will be disabled.
If set to "workdir", maintenance socket will be opened on <scylla's
workdir>/cql.m.
Otherwise it will be opened on path provided by maintenance_socket
variable.
It is set by default to 'ignore'.
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.
Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).
fixes#10036Closesscylladb/scylladb#16221
this source file was added in d3d83869. so let's update cmake
as well.
sessions_tests was added in the same commit, so add it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16344
This PR implements the following new nodetool commands:
* decomission
* rebuild
* removenode
* getlogginglevels
* setlogginglevel
* move
* refresh
All commands come with tests and all tests pass with both the new and the current nodetool implementations.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#16348
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement the refresh command
tools/scylla-nodetool: implement the move command
tools/scylla-nodetool: implement setlogginglevel command
tools/sclla-sstable: implement the getlogginglevels command
tools/scylla-nodetool: implement the removenode command
tools/scylla-nodetool: implement the rebuild command
tools/scylla-nodetool: implement the decommission command
When checking replication strategy options the code assumes (and it's
stated in the preceeding code comment) that all options are replication
factors. Nowadays it's no longer so, the initial_tablets one is not
replication factor and should be skipped
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16335
In the java nodetool, this command ends up calling an API endpoint which
just throws an exception saying moving tokens is not supported. So in
the native implementation we just throw an exception to the same effect
in scylla-nodetool itself.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we
* move the enum definition out of the class and before the
class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().
as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16204
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
This series adds preparation patches for file stream tablet implementation in enterprise branch. It minimizes the differences between those two branches.
Closesscylladb/scylladb#16297
* github.com:scylladb/scylladb:
messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
compaction_group_for_token: Handle minimum_token and maximum_token token
serializer: Add temporary_buffer support
cql_test_env: Allow messaging_service to start listen
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Closesscylladb/scylladb#16302
* github.com:scylladb/scylladb:
commitlog: Fix allocation size check to take sector overhead into account.
commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
The test test_many_partitions is very slow, as it tests a slow scan over
a lot of partitions. This was observed to time out on the slower ARM
machines, making the test flaky. To prevent this, create an
extra-patient cql connection with a 10 minutes timeout for the scan
itself.
Fixes: #16145Closesscylladb/scylladb#16303
This commit updates the configuration for
ScyllaDB documentation so that:
- 5.4 is the latest version.
- 5.4 is removed from the list of unstable versions.
It must be merged when ScyllaDB 5.4 is released.
No backport is required.
Closesscylladb/scylladb#16308
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
The following error was seen:
[shard 0] table - compaction_group_for_token: compaction_group idx=0 range=(minimum
token,-6917529027641081857] does not contain token=minimum token
Since minimum_token or maximum_token will not be inside a token range. Skip
the in token range check.
This is needed for rpc calls to work in the tests. With this patch, by
default, messaging_service does not listen as it was before.
This is useful for file stream for tablet test.
This patch fixes error check and speed up swap allocation.
Following patches are included:
- scylla_swap_setup: run error check before allocating swap
avoid create swapfile before running error check
- scylla_swap_setup: use fallocate on ext4
this inclease swap allocation speed on ext4
Closesscylladb/scylladb#12668
* github.com:scylladb/scylladb:
scylla_swap_setup: use fallocate on ext4
scylla_swap_setup: run error check before allocating swap
The current implementation starts in sstables_manager that gets the deletion function from storage which, in turn, should atomically do sst.unlink() over a list of sstables (s3 driver is still not atomic though #13567).
This PR generalizes the atomic deletion inside sstables_manager method and removes the atomic deletor function that nobody liked when it was introduced (#13562)
Closesscylladb/scylladb#16290
* github.com:scylladb/scylladb:
sstables/storage: Drop atomic deleter
sstables/storage: Reimplement atomic deletion in sstables_manager
sstables/storage: Add prepare/complete skaffold for atomic deletion
Streaming was keeping effective_replication_map_ptr around the whole
process, which blocks topology change barriers.
This will inhibit progress of tablet load balancer or concurrent
migrations, resulting in worse performance.
Fix by switching to the most recent erm on sharder
calls. multishard_writer calls shard_of() for each new partition.
A better way would be to switch immediately when topology version
changes, but this is left for later.
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.
Also will be used in tests to ensure tablets stick to expected locations.
Prevents stale streaming operation from running beyond topology
operation they were started in. After the session field is cleared, or
changed to something else, the old topology_guard used by streaming is
interrupted and fenced and the next barrier will join with any
remaining work.