Override methods returning expected children number and job size
in repair tasks. With them get_progress method would be able to
return more precise progress value.
The cql_test_env has a virtual require_column_has_value() helper that better fits cql_assertions crowd. Also, the helper in question duplicates some existing code, so it can also be made shorter (and one class table helper gets removed afterwards)
Closes#15208
* github.com:scylladb/scylladb:
cql_assertions: Make permit from env
table: Remove find_partition_slow() helper
sstable_compaction_test: Do not re-decorate key
cql_test_env: Move .require_column_has_value
cql_test_env: Use table.find_row() shortcut
Motivation:
The user can bootstrap 3 different clusters and then connect them
(#14448). When these clusters start gossiping, their token rings will be
merged, but there will be 3 different group 0s in there. It results in a
corrupted cluster.
We need to prevent such situations from happening in clusters which
don't use Raft-based topology.
-------
Gossiper service sets its group0 id on startup if it is stored in
`scylla_local` or sets it during joining group0.
Send group0_id (if it is set) when the node tries to initiate the gossip
round. When a node gets gossip_digest_syn it checks if its group0 id
equals the local one and if not, the message is discarded.
Fixes#14448
Performed manual tests with the following scenario:
1. setup a cluster of two nodes (one compiled with and one without this patch)
2. setup a new node
3. create a basic keyspace and table
4. execute simple select and insert queries
Tested 4 scenarios: the seed node was with or without this patch, and the third node was with or without this patch.
These tests didn't detect any errors.
Closes#15004
* github.com:scylladb/scylladb:
tests: raft: cluster of nodes with different group0 ids
gossip: add group0_id attribute to gossip_digest_syn
This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb
(see also ae6fdf1599)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15203
To call table::find_row() one needs to provide a permit. Tests have
short and neat helper to create one from cql_test_env
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
these methods are only used by the public methods of this class and
its derived class "memtable_encoding_stats_collector".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15190
The is_partition_dead() local helper accepts partition key argument and
decorates it. Howerver, its caller gets partition key from decorated key
itself, and can just pass it along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The require_column_has_value() finds the cell in three steps -- finds
partition, then row, then cell. The class table already has a method to
facilitate row finding by partition and clustering key
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main goal of this PR is to stop cdc_generation_service from calling
system_keyspace::bootstrap_complete(). The reason why it's there is that
gen. service doesn't want to handle generation before node joined the
ring or after it was decommissioned. The cleanup is done with the help
of storage_service->cdc_generation_service explicit dependency brought
back and this, in turn, suddenly freed the raft and API code from the
need to carry cdc gen. service reference around.
Closes#15047
* github.com:scylladb/scylladb:
cdc: Remove bootstrap state assertion from after_join()
cdc: Rework gen. service check for bootstrap state
api: Don't carry cdc gen. service over
storage_service: Use local cdc gen. service in join_cluster()
storage_service: Remove cdc gen. service from raft_state_monitor_fiber()
raft: Do not carry cdc gen. service over
storage_service: Use local cdc gen. service in topo calls
storage_service: Bring cdc_generation_service dependency back
The reproducer for #14448.
The test starts two nodes with different group0_ids. The second node
is restarted and tries to join the cluster consisting of the first node.
gossip_digest_syn message should be rejected by the first node, so
the second node will not be able to join the cluster.
This test uses repair-based node operations to make this test easier.
If the second node successfully joins the cluster, their tokens metadata
will be merged and the repair service will allow to decommission the second node.
If not - decommissioning the second node will fail with an exception
"zero replica after the removal" thrown by the repair service.
Gossiper service sets its group0 id on startup if it is stored in `scylla_local`
or sets it during joining group0.
Send group0_id (if it is set) when the node tries to initiate the gossip round.
When a node gets gossip_digest_syn it checks if its group0 id equals the local
one and if not, the message is discarded.
Fixes#14448.
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixes#12261Fixes#15088
Also, add rest_api unit test for those apis
Closes#15093
* github.com:scylladb/scylladb:
test: rest_api: add test_gossiper
gossiper: add get_unreachable_members_synchronized
As was described in the previous patch, this method is explicitly called
by storage service after updating the bootstrap state, so it's unneeded
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The legacy_handle_cdc_generation() checks if the node had bootstrapped
with the help of system_keyspace method. The former is called in two
cases -- on boot via cdc_generation_service::after_join() and via
gossiper on_...() notifications. The notifications, in turn, are set up
in the very same after_join().
The after_join(), in turn, is called from storage_service explicitly
after the bootstrap state is updated to be "complete", so the check for
the state in legacy_handle_...() seems unnecessary. However, there's
still the case when it may be stepped on -- decommission. When performed
it calls storage_service::leave_ring() which udpates the bootstrap state
to be "needed", thus preventing the cdc gen. service from doing anything
inside gossiper's on_...() notifications.
It's more correct to stop cdc gen. service handling gossiper
notifications by unsubscribing it, but by adding fragile implicit
dependencies on the bootstrap state.
Checks for sys.dist.ks in the legacy_handle_...() are kept in a form
of on-internal-error. The system distributed keyspace is activated by
storage service even before the bootstrap state is updated and is
never deactivated, but it's anyway good to have this assertion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a storage_service/cdc_streams_check_and_repair endpoint that
needs to provide cdc gen. service to call storage_service method on. Now
the latter has its own reference to the former and API can stop taking
care of that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question accepts cdc_generation_service ref argument from
main and cql_test_env, but storage service now has local cdcv gen.
service reference, so this argument and its propagation down the stack
can be removed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a cdc_generation_service ref sitting on group0_state_machine and
the only reason it's there is to call storage_service::topology_...()
mathods. Now when storage service can access cdc gen. service on its
own, raft code can forget about cdc
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology_state_load() and topology_transition() both take cdc gen.
service an an argument, but can work with the local reference. This
makes the next patch possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It sort of reverts the 5a97ba7121 commit, because storage service now
uses the cdc generation service to serve raft topo updates which, in
turn, takes the cdc gen. service all over the raft code _just_ to make
it as an argument to storage service topo calls.
Also there's API carrying cdc gen. service for the single call and also
there's an implicit need to kick cdc gen. service on decommission which
also needs storage service to reference cdc gen. after boot is complete
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.
Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138
The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141
Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15142
This function is too flaky with the 30 seconds timeout.
For example, the following was seen locally with
`test_updated_shards_during_add_decommission_node` in dev mode:
alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node6.log:
```
INFO 2023-08-27 15:47:25,753 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:30,754 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:35,761 [shard 0] gossip - (rate limiting dropped 495 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:40,766 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:45,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:50,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
ERROR 2023-08-27 15:47:55,758 [shard 0] gossip - Timed out waiting for 2 live nodes to show up in gossip
INFO 2023-08-27 15:47:55,759 [shard 0] init - Shutting down group 0 service
```
alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node1.log:
```
INFO 2023-08-27 15:48:02,532 [shard 0] gossip - InetAddress 127.0.43.6 is now UP, status = UNKNOWN
...
WARN 2023-08-27 15:48:03,552 [shard 0] gossip - failure_detector_loop: Send echo to node 127.0.43.6, status = failed: seastar::rpc::closed_error (connection is closed)
```
Note that node1 saw node6 as UP after node6 already timed out
and was shutting down.
Increase the timeout to 3 minutes in all modes to reduce flakiness.
Fixes#15185
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15186
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
It is too early to require that all nodes in normal state
have a non-null host_id.
The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.
In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.
Fixesscylladb/scylladb#15181
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15184
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixes scylladb/scylladb#15138
Closes#15141
* github.com:scylladb/scylladb:
storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
if the endpoint specified when creating a KEYSPACE is not found,
when flushing a memtable, we would throw an `std::out_of_range`
exception when looking up the client in `storage_manager::_s3_endpoints`
by the name of endpoint. and scylla would crash because of it. so
far, we don't have a good way to error out early. since the
storage option for keyspace is still experimental, we can live
with this, but would be better if we can spot this error in logging
messages when testing this feature.
also, in this change, `std::invalid_argument` is thrown instead of
`std::out_of_range`. it's more appropriate in this circumstance.
Refs #15074
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15075
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.
Fixes scylladb/scylladb#15147
Closes#15149
* github.com:scylladb/scylladb:
storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler
storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
Since 75d1dd3a76
gossiper::convict will no longer call `mark_dead`
(e.g. when called from the failure detection loop
after a node is stopped following decommission)
and therefore the on_dead notification won't get called.
To make that explicit, if the node was alive before
remove_endpoint erased it from _live_endpoint,
and it has an endpoint_state, call the on_dead notifications.
These are imporant to clean up after the node is dead
e.g. in storage_proxy::on_down which cancels all
respective write handlers.
This preferred over going through `mark_dead` as the latter
marks the endpoint as unreachable, which is wrong in this
case as the node left the cluster.
Fixes#15178
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15179
Disabling fstrim.timer was for avoid running fstrim on /var/lib/scylla from
both scylla-fstrim.timer and fstrim.timer, but fstrim.timer actually never do
that, since it is only looking on fstab entries, not our systemd unit.
To run fstrim correctly on rootfs and other filesystems not related
scylla, we should stop disabling fstrim.timer.
Fixes#15176
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#15177
seastar::format() just forward the parameters to be formatted to
`fmt::format_to()`, which is able to format `std::string`, so there is
no need to cast the `std::string` instance to `std::string_view` for
formatting it.
in this change, the cast is dropped. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15143
Add a class that handles log file browsing with the following features:
* mark: returns "a mark" to the current position of the log.
* wait_for: asynchronously checks if the log contains the given message.
* grep: returns a list of lines matching the regular expression in the log.
Add a new endpoint in `ManagerClient` to obtain the scylla logfile path.
Fixes#14782Closes#14834
New file streaming for tablets will require integration with compaction
groups. So this patch introduces a way for streaming to take a storage
snapshot of a given tablet using its token range. Memtable is flushed
first, so all data of a tablet can be streamed through its sstables.
The interface is compaction group / tablet agnostic, but user can
easily pick data from a single tablet by using the range in tablet
metadata for a given tablet.
E.g.:
auto erm = table.get_effective_replication_map();
auto& tm = erm->get_token_metadata();
auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id());
for (auto tid : tablet_map.tablet_ids()) {
auto tr = tmap.get_token_range(tid);
auto ssts = co_await table.take_storage_snapshot(tr);
...
}
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15128
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
It's possible that compaction task is preempted after completion and
before reevaluation, causing pending_tasks to be > 1.
Let's only exit the loop if there are no pending tasks, and also
reduce 100ms sleep which is an eternity for this test.
Fixes#14809.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15059
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixesscylladb/scylladb#15138
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixesscylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.
Fixesscylladb/scylladb#15147
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.
It will be used in the next patch to get a respective
effective_replication_map_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>