In this PR, we ensure unpublished CDC generation's data is
never removed, which was theoretically possible. If it happened,
it could cause problems. CDC generation publisher would then try
to publish the generation with its data removed. In particular, the
precondition of calling `_sys_ks.read_cdc_generation` wouldn't be
satisfied.
We also add a test that passes only after the fix. However, this test
needs to block execution of the CDC generation publisher's loop
twice. Currently, error injections with handlers do not allow it
because handlers always share received messages. Apart from the
first created handler, all handlers would be instantly unblocked by
a message from the past that has already unblocked the first
handler. This seems like a general limitation that could cause
problems in the future, so in this PR, we extend injections with
handlers to solve it once and for all. We add the `share_messages`
parameter to the `inject` (with handler) function. Depending on its
value, handlers will share messages (as before) or not.
Fixesscylladb/scylladb#17497Closesscylladb/scylladb#17934
* github.com:scylladb/scylladb:
topology_coordinator: clean_obsolete_cdc_generations: fix log
topology_coordinator: do not clear unpublished CDC generation's data
topology_coordinator: cdc_generation_publisher_fiber injection: make handlers share messages
error_injection: allow injection handlers to not share messages
In this PR we add timeouts support to raft groups registry. We introduce
the `raft_server_with_timeouts` class, which wraps the `raft::server`
add exposes its interface with additional `raft_timeout` parameter. If
it's set, the wrapper cancels the `abort_source` after certain amount of
time. The value of the timeout can be specified either in the
`raft_timeout` parameter, or the default value can be set in `the
raft_server_with_timeouts` class constructor.
The `raft_group_registry` interface is extended with
`group0_with_timeouts()` method. It returns an instance of
`raft_server_with_timeouts` for group0 raft server. The timeout value
for it is configured in `create_server_for_group0`. It's one minute by
default and can be overridden for tests with
`group0-raft-op-timeout-in-ms` parameter.
The new api allows the client to decide whether to use timeouts or not.
In this PR we are reviewing all the group0 call sites and add
`raft_timeout` if that makes sense. The general principle is that if the
code is handling a client request and the client expects a potential
error, we use timeouts. We don't use timeouts for background fibers
(such as topology coordinator), since they wouldn't add much value. The
only thing the background fiber can do with a timeout is to retry, and
this will have the same end effect as not having a timeout at all.
Fixesscylladb/scylladb#16604Closesscylladb/scylladb#17590
* github.com:scylladb/scylladb:
migration_manager: use raft_timeout{}
storage_service::join_node_response_handler: use raft_timeout{}
storage_service::start_upgrade_to_raft_topology: use raft_timeout{}
storage_service::set_tablet_balancing_enabled: use raft_timeout{}
storage_service::move_tablet: use raft_timeout{}
raft_check_and_repair_cdc_streams: use raft_timeout{}
raft_timeout: test that node operations fail properly
raft_rebuild: use raft_timeout{}
do_cluster_cleanup: use raft_timeout{}
raft_initialize_discovery_leader: use raft_timeout{}
update_topology_with_local_metadata: use with_timeout{}
raft_decommission: use raft_timeout{}
raft_removenode: use raft_timeout{}
join_node_request_handler: add raft_timeout to make_nonvoters and add_entry
raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
raft_group0: make_raft_config_nonvoter: add abort_source parameter
manager_client: server_add with start=false shouldn't call driver_connect
scylla_cluster: add seeds parameter to the add_server and servers_add
raft_server_with_timeouts: report the lost quorum
join_node_request_handler: add raft_timeout{} for start_operation
skip_mode: add platform_key
auth: use raft_timeout{}
raft_group0_client: add raft_timeout parameter
raft_group_registry: add group0_with_timeouts
utils: add composite_abort_source.hh
error_injection: move api registration to set_server_init
error_injection: add inject_parameter method
error_injection: move injection_name string into injection_shared_data
error_injection: pass injection parameters at startup
Reduce the sprawl of sstables::test_env in .cc and .hh files, to ease
maintenance and reduce recompilations.
Closesscylladb/scylladb#17965
* github.com:scylladb/scylladb:
test: sstables::test_env: complete pimplification
test/lib: test_env: move test_env::reusable_sst() to test_services.cc
When a keyspace uses tablets, then effective ownership
can be obtained per table. If the user passes only a
keyspace, then /storage_service/ownership/{keyspace}
returns an error.
This change:
- adds an additional positional parameter to 'status'
command that allows a user to query status for table
in a keyspace
- makes usage of /storage_service/ownership/{keyspace}
optional to avoid errors when user tries to obtain
effective ownership of a keyspace that uses tablets
- implements new frontend tests in 'test_status.py'
that verify the new logic
Refs: scylladb#17405
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#17827
Migrate data from `system_distributes.service_levels` to
`system.service_levels_v2` during raft topology upgrade.
Migration process reads data from old table with CL ALL
and inserts the data to the new table via raft.
Adjust service_level_controller and
service_level_controller::service_level_distributed_data_accessor
interfaces to take `group0_guard` while adding/altering/dropping a
service level.
sstables::test_env uses the pimpl idiom, but incompletely. This
prevents reaping some of the benefits.
Complete the pimplification:
- the `impl` nested struct is moved out-of-line
- all non-template member functions are moved out-of-line
- a destructor is declared and defined out-of-line
- the move constructor is also defined (necessary after the destructor is
defined)
After this, we can forward-declare more components.
test_env implementation is scattered around two .cc, concentrate it
in test_services.cc, which happens to be the file that doesn't cause
link errors.
Move toc_filename with it, as it is its only caller and it is static.
Fixes#16912
By default, ScyllaDB stores the maintenance socket in the workdir. Test.py by default uses the location for the ScyllaDB workdir as testlog/{mode}/scylla-#. The Usual location for cloning the repo is the user's home folder. In some cases, it can lead the socket path being too long and the test will start to fail. The simple way is to move the maintenance socket to /tmp folder to eliminate such a possibility.
Closesscylladb/scylladb#17941
In this commit, we ensure unpublished CDC generation's data is
never removed, which was theoretically possible. If it happened,
it could cause problems. CDC generation publisher would then try
to publish the generation with its data removed. In particular, the
precondition of calling `_sys_ks.read_cdc_generation` wouldn't be
satisfied.
We also add a test that passes only after the fix.
In the following commit, we add a test that needs to block the CDC
generation publisher's loop twice. We allow it in this commit by
making handlers of the `cdc_generation_publisher_fiber` injection
share messages. From now on, unblocking every step of the loop will
require sending a new message from the test.
This change breaks the test already using the
`cdc_generation_publisher_fiber` injection, so we adjust the test.
For a single injection, all created injection handlers share all
received messages. In particular, it means that one received message
unblocks all handlers waiting for the first message. This behavior
is often desired, for example, if multiple fibers execute the
injected code and we want to unblock them all with a single message.
However, there is a problem if we want to block every execution
of the injected code. Apart from the first created handler, all
handlers will be instantly unblocked by messages from the past that
have already unblocked the first handler.
In one of the following commits, we add a test that needs to block
the CDC generation publisher's loop twice. Since it looks like there
are no good workarounds for this arguably general problem, we extend
injections with handlers in a way that solves it. We introduce the
new `share_messages` parameter. Depending on its value, handlers
will share messages or not. The details are described in the new
comments in `error_injection.hh`.
We also add some basic unit tests for the new funcionality.
Checking all the call sites of the migration manager shows
that all of them are initiated by user requests,
not background activities. Therefore, we add a global
raft_timeout{} here.
We also add a specific test_quorum_lost_during_node_join. It
exercises the case when the quorum is lost after start_operation
but before these methods are called.
If the server is not started there is not point
in starting the driver, it would fail because there
are no nodes to connect to. On the other hand, we
should connect the driver in server_start()
if it's not connected yet.
If this parameter is set, we use its value for
the scylla.yaml of the new node, otherwise we
use IPs of all running nodes as before.
We'll need this parameter in subsequent commits to
restrict the communication between nodes.
We remove default values for _create_server_add_data parameters
since they are redundant - in the two call sites we pass all
of them.
In this commit we extend the timeout error message with
additional context - if we see that there is no quorum of
available nodes, we report this as the most likely
cause of the error.
We adjust the test by adding this new information to the
expected_error. We need raft-group-registry-fd-threshold-in-ms
to make _direct_fd threshold less than
group0-raft-op-timeout-in-ms.
In the test, we use the group0-raft-op-timeout-in-ms parameter to
reduce the timeout to one second so as not to waste time.
The join_node_request_handler method contains other group0 calls
which should have timeouts (make_nonvoters and add_entry). They
will be handled in a separate commit.
In subsequent commits we are going to add test.py
tests for raft_timeout{} feature. The problem is that
aarch/debug configuration is infamously slow. Timeout
settings used in tests work for all platforms but aarch/debug.
In this commit we extend the skip_mode attribute with the
platform_key property. We'll use @skip_mode('debug', platform_key='aarch64')
to skip the tests for this specific configuration.
The tests will still be run for aarch64/release.
There was two more things missing:
* Allow global options to be positioned before the operation/command option (https://github.com/scylladb/scylladb/issues/16695)
* Ignore JVM args (https://github.com/scylladb/scylladb/issues/16696)
This PR fixes both. With this, hopefully we are fully compatible with nodetool as far as command line parsing is concerned.
After this PR goes in, we will need another fix to tools/java/bin/nodetool-wrapper, to allow user to benefit from this fix. Namely, after this PR, we can just try to invoke scylla-nodetool first with all the command-line args as-is. If it returns with exit-code 100, we fall back to nodetool. We will not need the current trick with `--help $1`. In fact, this trick doesn't work currently, because `$1` is not guaranteed to be the command in the first place.
In addition to the above, this PR also introduces a new option, to help us in the switching process. This is `--rest-api-port`, which can also be provided as `-Dcom.scylladb.apiPort`. When provided, this option takes precedence over `--port|-p`. This is intended as a bridge for `scylla-ccm`, which currently provides the JMX port as `--port`. With this change, it can also provided the REST API port as `-Dcom.scylladb.apiPort`. The legacy nodetool will ignore this, while the native nodetool will use it to connect to the correct REST API address. After the switch we can ditch these options.
Fixes: https://github.com/scylladb/scylladb/issues/16695
Fixes: https://github.com/scylladb/scylladb/issues/16696
Refs: https://github.com/scylladb/scylladb/issues/16679
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#17168
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add --rest-api-port option
tools/scylla-nodetool: ignore JVM args
tools/utils: make finding the operation command line option more flexible
tools/utils: get_selected_operation(): remove alias param
tools: add constant with current help command-line arguments
The recently added test_tablets_migration dominates with it run-time (10
minutes). Also update other tests, e.g. test_read_repair is not in top-7
for any mode, test_replace and test_raft_recovery_majority_loss are both
not notably slower than most of other tests (~40 sec both). On the other
hand, the test_raft_recovery_basic and test_group0_schema_versioning are
both 1+ minute
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#17927
The initial version used a redundant method, and it did not cover all
cases. So that leads to the flakiness of the test that used this method.
Switching to the cluster_con() method removes flakiness since it's
written more robustly.
Fixesscylladb/scylladb#17914Closesscylladb/scylladb#17932
The group0 state machine calls `merge_topology_snapshot` from
`transfer_snapshot`. It feeds it with `raft_topology_snapshot` returned
from `raft_pull_topology_snapshot`. This snapshot includes the entire
`system.cdc_generations_v3` table. It can be huge and break the
commitlog `max_record_size` limit.
The `system.cdc_generations_v3` is a single-partition table, so all the
data is contained in one mutation object. To fit the commitlog limit we
split this mutation into many smaller ones and apply them in separate
`database::apply` calls. That means we give up the atomicity guarantee,
but we actually don't need it for `system.cdc_generations_v3` and
`system.topology_requests`.
This PR fixes the dtest
`update_cluster_layout_tests.py::TestLargeScaleCluster::test_add_many_nodes_under_load`
Fixesscylladb/scylladb#17545Closesscylladb/scylladb#17632
* github.com:scylladb/scylladb:
test_cdc_generation_data: test snapshot transfer
storage_service::merge_topology_snapshot: handle big cdc_generations_v3 mutations
mutation: add split_mutation function
storage_service::merge_topology_snapshot: fix indentation
sstables::test_env is intended for sstable unit tests, but to satisfy its
dependency of an sstables_registry we instantiate an entire database.
Remove the dependency by having a mock implementation of sstables_registry
and using that instead.
Closesscylladb/scylladb#17895
If there is a bug in the tablet scheduler which makes it never
converge for a given state of topology, rebalance_tablets() will never
complete and will generate a huge amounts of logs. This patch adds a
sanity limit so that we fail earlier.
This was observed in one of the test_load_balancing_with_random_load runs in CI.
Fixesscylladb/scylladb#17894.
Closesscylladb/scylladb#17916
The series marks nodes to be non expiring in the address map earlier, when
they are placed in the topology.
Fixes: scylladb/scylladb#16849
* 'gleb/16849-fix-v2' of github.com:scylladb/scylla-dev:
test: add test to check that address cannot expire between join request placemen and its processing
topology_coordinator: set address map entry to nonexpiring when a node is added to the topology
raft_group0: add modifiable_address_map() function
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.
in this change, we define formatters for
* raft_call
* raft_read
* network_majority_grudge
* reconfiguration
* stop_crash
* operation::thread_id
* append_seq
* AppendReg::append
* AppendReg::ret
* operation::either_of<Ops...>
* operation::exceptional_result<Op>
* operation::completion<Op>
* operation::invocable<Op>
and drop their operator<<:s.
in which,
* `operator<<` for append_entry is never used. so it is removed.
* `operator<<` for `std::monostate` and `std::variant` are dropped. as we are now using their counterparts in {fmt}.
* stop_crash::result_type 's `fmt::formatter` is not added, as we cannot define a partial specialization of `fmt::formatter` for a nested class for a template class. we will tackle this struct in another change.
Refs #13245Closesscylladb/scylladb#17884
* github.com:scylladb/scylladb:
test: raft: generator: add fmt::formatter:s
test: randomized_nemesis_test: add fmt::formatter for some types
test: randomized_nemesis_test: add fmt::formatter for seastar::timed_out_error
raft: add fmt::formatter for error classes
The test only looked at the initial cdc_generation
generation. It made the changes bigger to go
past the raft max_command_size limit.
It then made sure this large mutation set is saved
in several raft commands.
In this commit we enhance the test to check that the
mutations are properly handled during snapshot transfer.
The problem is that the entire system.cdc_generations_v3
table is read into the topology_snapshot and it's total
size can exceed the commitlog max_record_size limit.
We need a separate injection since the compaction
could nullify the effects of the previous injection.
The test fails without the fix from the previous commit.
The function splits the source mutation into multiple
mutations so that their size does not exceed the
max_size limit. The size of a mutation is calculated
as the sum of the memory_usage() of its constituent
mutation_fragments.
The implementation is taken from view_updating_consumer.
We use mutation_rebuilder_v2 to reconstruct mutations from
a stream of mutation fragments and recreate the output
mutation whenever we reach the limit.
We'll need this function in the next commit.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* operation::either_of<Ops...>
* operation::exceptional_result<Op>
* operation::completion<Op>
* operation::invocable<Op>
and drop their operator<<:s.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* raft_call
* raft_read
* network_majority_grudge
* reconfiguration
* stop_crash
* operation::thread_id
* append_seq
* append_entry
* AppendReg::append
* AppendReg::ret
and drop their operator<<:s.
in which,
* `operator<<` for `std::monostate` and `std::variant` are dropped.
as we are now using their counterparts in {fmt}.
* stop_crash::result_type 's `fmt::formatter` is not added, as we
cannot define a partial specialization of `fmt::formatter` for
a nested class for a template class. we will tackle this struct
in another change.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatter for `seastar::timed_out_error`,
which will be used by the `fmt::formatter` for `std::variant<...>`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The token ring table is a virtual table (`system.token_ring`), which contains the ring information for all keyspaces in the system. This is essentially an alternative to `nodetool describering`, but since it is a virtual table, it allows for all the usual filtering/aggregation/etc. that CQL supports.
Up until now, this table only supported keyspaces which use vnodes. This PR adds support for tablet keyspaces. To accommodate these keyspaces a new `table_name` column is added, which is set to `ALL` for vnodes keyspaces. For tablet keyspaces, this contains the name of the table.
Simple sanity tests are added for this virtual table (it had none).
Fixes: #16850Closesscylladb/scylladb#17351
* github.com:scylladb/scylladb:
test/cql-pytest: test_virtual_tables: add test for token_ring table
db/virtual_tables: token_ring_table: add tablet support
db/virtual_tables: token_ring_table: add table_name column
db/virtual_tables: token_ring_table: extract ring emit
service/storage_service: describe_ring_for_table(): use topology to map hostid to ip
In topology on raft, management of CDC generations is moved to the topology coordinator.
We need to verify that the CDC keeps working correctly during the upgrade for topology on the raft.
A similar change will be made in the topology recovery test. It will reuse
the `start_writes_to_cdc_table` function.
Ref #17409Closesscylladb/scylladb#17828
Affects load-and-stream for tablets only.
The intention is that only this loop is responsible for detecting
exhausted sstables and then discarding them for next iterations:
while (sstable_it != _sstables.rend() && exhausted(*sstable_it)) {
sstable_it++;
}
But the loop which consumes non exhausted sstables, on behalf of
each tablet, was incorrectly advancing the iterator, despite the
sstable wasn't considered exhausted.
Fixes#17733.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17899