The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get local ranges and sstables.
In addition, this patch removes the dependency to global storage_service object.
Fixes#6662
(cherry picked from commit 07e253542d)
The get token range API can become big which can cause large allocation
and stalls.
This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.
Fixes#6297
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 7c4562d532)
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
(cherry picked from commit 02e046608f)
Fixes#6336.
The global get_highest_supported_format helper and its declaration
are scattered all over the code, so clean this up and prepare the
ground for moving _sstables_format from the storage_service onto
the sstables_manager (not this set).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are cases when it is useful to delete specific table from a
snapshot.
An example is when a snapshot is used for backup. Backup can take a long
period of time, during that time, each of the tables can be deleted once
it was backup without waiting for the entire backup process to
completed.
This patch adds such an option to the database and to the storage_service
wrapping method that calls it.
If a table is specified a filter function is created that filter only
the column family with that given name.
This is similar to the filtering at the keyspace level.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
" from Botond
Nodetool scrub rewrites all sstables, validating their data. If corrupt
data is found the scrub is aborted. If the skip-corrupted flag is set,
corrupt data is instead logged (just the keys) and skipped.
The scrubbing algorithm itself is fairly simple, especially that we
already have a mutation stream validator that we can use to validate the
data. However currently scrub is piggy-backed on top of cleanup
compaction. To implement this flag, we have to make scrub a separate
compaction type and propagate down the flag. This required some
massaging of the code:
* Add support for more than two (cleanup or not) compaction types.
* Allow passing custom options for each compaction type.
* Allow stopping a compaction without the manager retrying it later.
Additionally the validator itself needed some changes to allow different
ways to handle errors, as needed by the scrub.
Fixes: #5487
* https://github.com/denesb/nodetool-scrub-skip-corrupted/v7:
table: cleanup_sstables(): only short-circuit on actual cleanup
compaction: compaction_type: add Upgrade
compaction: introduce compaction_options
compaction: compaction_descriptor: use compaction options instead of
cleanup flag
compaction_manager: collect all cleanup related logic in
perform_cleanup()
sstables: compaction_stop_exception: add retry flag
mutation_fragment_stream_validator: split into low-level and
high-level API
compaction: introduce scrub_compaction
compaction_manager: scrub: don't piggy-back on upgrade_sstables()
test: sstable_datafile_test: add scrub unit test
Now that we have the necessary infrastructure to do actual scrubbing,
don't rely on `upgrade_sstables()` anymore behind the scenes, instead do
an actual scrub.
Also, use the skip-corrupted flag.
The list_snapshot API, uses http stream to stream the result to the
caller.
It needs to keep all objects and stream alive until the stream is closed.
This patch adds do_with to hold these objects during the lifetime of the
function.
Fixes#5752
get_snapshot should use http stream to reduce memory allocation and
stalls.
This patch change the implementation so it would stream each of the
snapshot object instead of creating a single response and return it.
Fixes#5468
Depends on scylladb/seastar#723
In storage_service's snapshot code there are checks for
_operation_mode being _not_ JOINING to proceed. The intention
is apparently to allow for snapshots only after the cluster
join. However, here's how the start-up code looks like
- _operation_mode = STARTING in storage_service::constructor
- snapshot API registered in api::set_server_storage_service
- _operation_mode = JOINING in storage_service::join_token_ring
So in between steps 2 and 3 snapshots can be taken.
Although there's a quick and simple fix for that (check for the
_operation_mode to be not STARTING either) I think it's better
to register the snapshot API later instead. This will help
greatly to de-bload the storage_service, in particular -- to
incapsulate the _operation_mode properly.
Note, though the check for _operation_mode is made only for
taking snapshot, I move all snapshot ops registration to the
later phase.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparation for the next patch -- the lambda in
question (and the used type) will be needed in two
functions, so make the lambda a "real" function.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a lonely get_load_map() call on storage_service that
needs only load broadcaster, always runs on shard 0 and that's it.
Next patch will move this whole stuff into its own helper no-shard
container and this is preparation for this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The option in question apparently does not work, several sharded objects
are start()-ed (and thus instanciated) in join_roken_ring, while instances
themselves of these objects are used during init of other stuff.
This leads to broken seastar local_is_initialized assertion on sys_dist_ks,
but reading the code shows more examples, e.g. the auth_service is started
on join, but is used for thrift and cql servers initialization.
The suggestion is to remove the option instead of fixing. The is_joined
logic is kept since on-start joining still can take some time and it's safer
to report real status from the API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191203140717.14521-1-xemul@scylladb.com>
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
Assembles information and attributes of sstables in one or more
column families.
v2:
* Use (not really legal) nested "type" in json
* Rename "table" param to "cf" for consistency
* Some comments on data sizes
* Stream result to avoid huge string allocations on final json
ignore_ready_future in load_new_ss_tables broke
migration_test:TestMigration_with_*.migrate_sstable_with_counter_test_expect_fail dtests.
The java.io.NotSerializableException in nodetool was caused by exceptions that
were too long.
This fix prints the problematic file names onto the node system log
and includes the casue in the resulting exception so to provide the user
with information about the nature of the error.
Fixes#4375
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190331154006.12808-1-bhalevy@scylladb.com>
Fixes#4245
Implemented as a compation barrier (forcing previous compactions to
finish) + parameterized "cleanup", with sstable list based on
parameters.
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
This patch changes how list of tokens returned from the storage_service
API.
Instead of create a vector and construct a json object of it, use the
streaming capabilities of the http.
This is important for large cluster and prevent large allocations.
Fixes#3701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180820195631.26792-1-amnon@scylladb.com>
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series adds an API to return the active repairs by their IDs.
After this series a call to:
curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/"
Will return an array with the ids of the active repairs.
Fixes#3193"
* 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev:
API: Add get active repair api
repair: Add a get_active_repairs function to return the active repair
The token_to_endpoint map can get big that trying to convert it to a
vector will cause large allocation warning.
This patch replace the implementation, so the return json array will be
created directly from the map by using stream_range_as_array helper
function.
Fixes#3185
Message-Id: <20180207153306.30921-1-amnon@scylladb.com>
Returning token_endpoints when there are many tokens and end points can
take a long time.
This patch uses output stream to return the result.
Instead of returning a vector, it uses the streaming functionality in
json layer.
Fixes#2476
Message-Id: <20180103081907.5175-1-amnon@scylladb.com>
This patch change the implementation of storage_service
repair_async_status to throw an exception, this way a 400 return code
will be returned.
Fixes#2794
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917080533.6612-1-amnon@scylladb.com>
This patch implements the missing API to terminate all repairs.
For example:
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"
With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.
Fixes#2105
The api /storage_service/force_terminate is supposed to be
/storage_service/force_terminate_repair.
scylla-jmx uses /storage_service/force_terminate api.
So instead of renaming it, it is better to add a new name for it.
Following C* API there are two APIs for getting the load from
storage_service:
/storage_service/metrics/load
/storage_service/load
This patch adds the implementation for
/storage_service/metrics/load
The alternative, is to drop on of the API and modify the JMX
implementation to use the same API.
Fixes#2245
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170401181520.19506-1-amnon@scylladb.com>
Commit d41cd48a made the is_joined() method a future<bool> because
only cpu 0 knows its real value. This makes this function inconvenient
to use. So this patch reverts commit d41cd48a, and instead sets this
flag's value on all shards, so each shard can read its value locally
(and immediately).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161228160450.5831-1-nyh@scylladb.com>
- Change an exception type thrown by a tracing::tracing::set_trace_probability()
to make it different from the one thrown by an std::stod() when it fails to
parse a given string.
- Catch the std::out_of_range exception thrown by a tracing::tracing::set_trace_probability() and
wrap the exception string into the httpd::bad_param_exception() object.
- Throw a httpd::bad_param_exception() with a
"Bad format in a probability value: <a user given probability string value>"
message if std::invalid_argument is caught.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465300738-1557-1-git-send-email-vladz@cloudius-systems.com>
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.
If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
They are used by nodetool removenode:
$ nodetool removenode force
$ nodetool removenode status
For example:
$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
$ nodetool removenode force
RemovalStatus: No token removals in process.
Tested with:
1)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
2)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
will never send the replication confirmation to node1
- run nodetool removenode force on node1
nodetool removenode completes with the following error:
$ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
nodetool: Scylla API server HTTP POST to URL
'/storage_service/remove_node' failed: nodetool removenode force is called by user
nodetool removenode force completes sucessfully
$ nodetool removenode force
RemovalStatus: Removing token (-9171569494049085776). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
Fixes 1135.
'nodetool enable/disablebackup' callback was modifying only the
existing keyspaces and column families configurations.
However new keyspaces/column families were using
the original 'incremental_backups' configuration value which could
be different from the value configured by 'nodetool enable/disablebackup'
user command.
This patch updates the database::_enable_incremental_backups per-shard
value in addition to updating the existing keyspaces and column families
configurations.
Fixes#845
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>