Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
The get_range_to_endpoint_map method, takes a keyspace and returns a map
between the token ranges and the endpoint.
It is used by some external tools for repair.
Token ranges are codes as size-2 array, if start or end are empty, they will be
added as an empty string.
The implementation uses get_range_to_address_map and re-pack it
accordingly.
The use of stream_range_as_array it to reduce the risk of large
allocations and stalls.
Relates to scylladb/scylla-jmx#36
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7329
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.
The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).
tests: unit(dev), nodetool upgradesstables
"
* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
storage_service: Remove get_local_ranges helper
compaction: Use database from options to get local ranges
compaction: Keep database reference on upgrade options
compaction: Keep database reference on cleanup options
db: Factor out get_local_ranges helper
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.
Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls. To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.
This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.
Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"
* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
api/http_context: keep a const sharded<locator::token_metadata>&
gossiper: keep a const token_metadata&
storage_service: separate get_mutable_token_metadata
range_streamer: keep a const token_metadata&
storage_proxy: delete unused get_restricted_ranges declaration
storage_proxy: keep a const token_metadata&
storage_proxy: get rid of mutable get_token_metadata getter
database: keep const token_metadata&
database: keyspace_metadata: pass const locator::token_metadata& around
everywhere_replication_strategy: move methods out of line
replication_strategy: keep a const token_metadata&
abstract_replication_strategy: get_ranges: accept const token_metadata&
token_metadata: rename calculate_pending_ranges to update_pending_ranges
token_metadata: mark const methods
token_ranges: pending_endpoints_for: return empty vector if keyspace not found
token_ranges: get_pending_ranges: return empty vector if keyspace not found
token_ranges: get rid of unused get_pending_ranges variant
replication_strategy: calculate_natural_endpoints: make token_metadata& param const
token_metadata: add get_datacenter_racks() const variant
The only place that creates them is the API upgrade_sstables call.
The created options object doesn't over-survive the returned
future, so it's safe to keep this reference there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This unset the roll-back of the correpsonding _set-s. The messaging
service will be (already is, but implicitly) used in repair API
callbacks, so make sure they are unset before the messaging service
is stopped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will be the unset part soon, this is the preparation. No functional
changes in api/storage_server.cc, just move the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
API is one of the subsystems that work with messaging service. To keep
the dependencies correct the related API stuff should be stopped before
the messaging service stops.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.
Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following stall was seen during a cleanup operation:
scylla: Reactor stalled for 16262 ms on shard 4.
| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
| (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
| (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
| (inlined by) operator() at ./sstables/compaction_manager.cc:691
| (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
| (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158
To fix, we furturize the function to get local ranges and sstables.
In addition, this patch removes the dependency to global storage_service object.
Fixes#6662
The results of get_snapshot_details() is saved in do_with, then is
captured on the json callback by reference, then the do_with's
future returns, so by the time callback is called the map is already
free and empty.
Fix by capturing the result directly on the callback.
Fixes recently merged b6086526.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes
- rename namespace in snapshot-ctl.[cc|hh]
- move methods from storage_service to snapshot_ctl
- move snapshot_details struct
- temporarily make storage_service._snapshot_lock and ._snapshot_ops public
- replace two get_local_storage_service() occurrences with this._db
The latter is not 100% clear as the code that does this references "this"
from another shard, but the _db in question is the distributed object, so
they are all the same on all instances.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A placeholder for snapshotting code that will be moved into it
from the storage_service.
Also -- pass it through the API for future use.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now with the seastar httpd routes unset() at hands we
can shut down individual API endpoints. Do this for
snapshot calls, this will make snapshot controller stop
safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda calls the core snapshot method deep inside the
json marshalling callback. This will bring problems with
stopping the snapshot controller in the next patches.
To prepare for this -- call the .get_snapshot_details()
first, then keep the result in do_with() context. This
change doesn't affect the issue the lambde in question is
about to solve as the whole result set is anyway kept in
memory while being streamed outside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.
There should be no reason why a user would want to set that up this
early. So we'll disallow it.
To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:
The command regenerates streams when:
generations corresponding to a gossiped timestamp cannot be
fetched from system_distributed table,
or when generation token ranges do not align with token metadata.
In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
Fixes#6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.
Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes#5815Fixes#4746
"
* amnonh-quicker_estimated_histogram:
storage_proxy: use time_estimated_histogram for latencies
test/boost/estimated_histogram_test
utils/histogram_metrics_helper Adding histogram converter
utils/estimated_histogram: Adding approx_exponential_histogram
Remove the on-storage_service instance and make everybody use
th standalone one.
Stopping the thrift is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make the relevant endpoints work on standalone
thrift controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Remove the on-storage_service instance and make everybody use
th standalone one.
Stopping the server is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to make the relevant endpoints work on standalone
cql controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currntly API endpoints to start and stop cql_server and thrift
are registered right after the storage service is started, but
much earlier than those services are. In between these two
points a lot of other stuff gets initialized. This opens a small
window during which cql_server and thrift can be started by
hand too early.
The most obvious problem is -- the storage_service::join_cluster()
may not yet be called, the auth service is thus not started, but
starting cql/thrift needs auth.
Another problem is those endpoints are not unregistered on stop,
thus creating another way to start cql/thrif at wrong time.
Also the endpoints registration change helps further patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We were not consistent about using '#include "foo.hh"' instead of
'#include <foo.hh>' for scylla's own headers. This patch fixes that
inconsistency and, to enforce it, changes the build to use -iquote
instead of -I to find those headers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200608214208.110216-1-espindola@scylladb.com>
The command regenerates streams when:
- generations corresponding to a gossiped timestamp cannot be
fetched from `system_distributed` table,
- or when generation token ranges do not align with token metadata.
In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
"
This series changes the describe_ring API to use HTTP stream instead of serializing the results and send it as a single buffer.
While testing the change I hit a 4-year-old issue inside service/storage_proxy.cc that causes a use after free, so I fixed it along the way.
Fixes#6297
"
* amnonh-stream_describe_ring:
api/storage_service.cc: stream result of token_range
storage_service: get_range_to_address_map prevent use after free
The get token range API can become big which can cause large allocation
and stalls.
This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.
Fixes#6297
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This series adds support for taking a snapshot of multiple tables.
Fixes#6333
* amnonh-snapshot_keyspace_table:
api/storage_service.cc: Snapshot, support multiple tables
service/storage_service: Take snapshot of multiple tables