Commit Graph

496 Commits

Author SHA1 Message Date
Amnon Heiman
4498bb0a48 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007
2021-02-08 12:11:30 +02:00
Piotr Sarna
d395305ddd api: fix retrieving replied RPC messages
The API call referred to a nonexistent callback,
which is now renamed to better match the API path
and actually implemented.

Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>
2021-02-03 09:42:17 +02:00
Wojciech Mitros
a1f93e4297 api: use a list instead of a vector to remove a large allocation in api handler
Follow-up to #7917

The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7973
2021-01-27 16:02:07 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Asias He
4d32d03172 storage_service: Introduce load_and_stream
=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831
2021-01-18 16:32:33 +08:00
Wojciech Mitros
93613e20a3 api: remove potential large allocation in /column_family/ GET request handler
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.

Fixes #7916

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7917
2021-01-13 12:04:18 +02:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Piotr Wojtczak
c09ab3b869 api: Add cardinality to toppartitions results
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.

Fixes #4089
Closes #7766
2020-12-08 09:38:59 +01:00
Asias He
0a3a2a82e1 api: Add force_remove_endpoint for gossip
It is used to force remove a node from gossip membership if something
goes wrong.

Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.

For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.

$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"

Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.

Fixes #2134

Closes #5436
2020-11-29 13:58:46 +02:00
Tomasz Grabiec
d3a5814f4f api: Connect nodetool resetlocalschema to schema version recalculation
It doesn't really do what the nodetool command is docuemented to do,
which is to truncate local schema tables, but it is still an
improvement.

Message-Id: <1605740190-30332-1-git-send-email-tgrabiec@scylladb.com>
2020-11-19 13:55:09 +02:00
Piotr Dulikowski
0fd36e2579 api: allow changing hinted handoff configuration
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.

To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
6465dd160b storage_proxy: fix wrong return type in swagger
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
cefe5214ff config: plug in hints::host_filter object into configuration
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.

Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.

Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
2020-11-17 10:24:42 +01:00
Benny Halevy
29ed59f8c4 main: start a shared_token_metadata
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Avi Kivity
82f79c0077 api: column_family: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:25:05 +03:00
Amnon Heiman
48c3c94aa6 api/storage_service.cc: Add the get_range_to_endpoint_map
The get_range_to_endpoint_map method, takes a keyspace and returns a map
between the token ranges and the endpoint.

It is used by some external tools for repair.

Token ranges are codes as size-2 array, if start or end are empty, they will be
added as an empty string.

The implementation uses get_range_to_address_map and re-pack it
accordingly.

The use of stream_range_as_array it to reduce the risk of large
allocations and stalls.

Relates to scylladb/scylla-jmx#36

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #7329
2020-10-08 12:09:09 +03:00
Avi Kivity
907b775523 Merge "Free compaction from storage service" from Pavel E
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.

The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).

tests: unit(dev), nodetool upgradesstables
"

* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
  storage_service: Remove get_local_ranges helper
  compaction: Use database from options to get local ranges
  compaction: Keep database reference on upgrade options
  compaction: Keep database reference on cleanup options
  db: Factor out get_local_ranges helper
2020-08-23 17:58:32 +03:00
Avi Kivity
0dcb16c061 Merge "Constify access to token_metadata" from Benny
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.

Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls.  To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.

This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.

Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"

* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
  api/http_context: keep a const sharded<locator::token_metadata>&
  gossiper: keep a const token_metadata&
  storage_service: separate get_mutable_token_metadata
  range_streamer: keep a const token_metadata&
  storage_proxy: delete unused get_restricted_ranges declaration
  storage_proxy: keep a const token_metadata&
  storage_proxy: get rid of mutable get_token_metadata getter
  database: keep const token_metadata&
  database: keyspace_metadata: pass const locator::token_metadata& around
  everywhere_replication_strategy: move methods out of line
  replication_strategy: keep a const token_metadata&
  abstract_replication_strategy: get_ranges: accept const token_metadata&
  token_metadata: rename calculate_pending_ranges to update_pending_ranges
  token_metadata: mark const methods
  token_ranges: pending_endpoints_for: return empty vector if keyspace not found
  token_ranges: get_pending_ranges: return empty vector if keyspace not found
  token_ranges: get rid of unused get_pending_ranges variant
  replication_strategy: calculate_natural_endpoints: make token_metadata& param const
  token_metadata: add get_datacenter_racks() const variant
2020-08-22 20:47:45 +03:00
Pavel Emelyanov
8333fed8aa compaction: Keep database reference on upgrade options
The only place that creates them is the API upgrade_sstables call.

The created options object doesn't over-survive the returned
future, so it's safe to keep this reference there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Benny Halevy
436babdb3d api/http_context: keep a const sharded<locator::token_metadata>&
It has no need of changing token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Pavel Emelyanov
285648620b repair: Keep sharded messaging service reference on repair_info
This reference comes from the API that already has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
8b4820b520 repair: Keep sharded messaging service in API
The reference will be needed in repair_start, so prepare one in advance

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
126dac8ad1 repair: Unset API endpoints on stop
This unset the roll-back of the correpsonding _set-s. The messaging
service will be (already is, but implicitly) used in repair API
callbacks, so make sure they are unset before the messaging service
is stopped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
fe2c479c04 repair: Setup API endpoints in separate helper
There will be the unset part soon, this is the preparation. No functional
changes in api/storage_server.cc, just move the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
b895c2971a api: Use local reference to messaging_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
d477bd562d api: Unregister messaging endpoints on stop
API is one of the subsystems that work with messaging service. To keep
the dependencies correct the related API stuff should be stopped before
the messaging service stops.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Avi Kivity
3530e80ce1 Merge "Support md format" from Benny
"
This series adds support for the "md" sstable format.

Support is based on the following:

* do not use clustering based filtering in the presence
  of static row, tombstones.
* Disabling min/max column names in the metadata for
  formats older than "md".
* When updating the metadata, reset and disable min/max
  in the presence of range tombstones (like Cassandra does
  and until we process them accurately).
* Fix the way we maintain min/max column names by:
  keeping whole clustering key prefixes as min/max
  rather than calculating min/max independently for
  each component, like Cassandra does in the "md" format.

Fixes #4442

Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"

* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
  config: enable_sstables_md_format by default
  test: cql_query_test: add test_clustering_filtering unit tests
  table: filter_sstable_for_reader: allow clustering filtering md-format sstables
  table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
  table: filter_sstable_for_reader: adjust to md-format
  table: filter_sstable_for_reader: include non-scylla sstables with tombstones
  table: filter_sstable_for_reader: do not filter if static column is requested
  table: filter_sstable_for_reader: refactor clustering filtering conditional expression
  features: add MD_SSTABLE_FORMAT cluster feature
  config: add enable_sstables_md_format
  database: add set_format_by_config
  test: sstable_3_x_test: test both mc and md versions
  test: Add support for the "md" format
  sstables: mx/writer: use version from sstable for write calls
  sstables: mx/writer: update_min_max_components for partition tombstone
  sstables: metadata_collector: support min_max_components for range tombstones
  sstable: validate_min_max_metadata: drop outdated logic
  sstables: rename mc folder to mx
  sstables: may_contain_rows: always true for old formats
  sstables: add may_contain_rows
  ...
2020-08-11 13:29:11 +03:00
Piotr Jastrzebski
80e3923b3c codebase wide: replace find(...) != end() with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:

<collection>.find(<element>) != <collection>.end()

In C++20 the same can be expressed with:

<collection>.contains(<element>)

This is not only more concise but also expresses the intend of the code
more clearly.

This commit replaces all the occurences of the old pattern with the new
approach.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
2020-08-11 13:28:50 +03:00
Pekka Enberg
a37eaaa022 sstables: Add support for the "md" format enum value
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Asias He
271fac56a3 repair: Add synchronous API to query repair status
This new api blocks until the repair job is either finished or failed or timeout.

E.g.,

- Without timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

- With timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5

The timeout is in second.

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes #6445
2020-07-14 11:20:15 +03:00
Amnon Heiman
186301aff8 per table metrics: change estimated_histogram to time_estimated_histogram
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.

Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-07-14 11:17:43 +03:00
Asias He
07e253542d compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662
2020-07-01 15:03:50 +08:00
Pavel Emelyanov
d0d2da6ccb api: Remove excessive capture
The "result" in this lambda is already not used and can be removed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:59 +03:00
Pavel Emelyanov
4f5ffa980d api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:59 +03:00
Pavel Emelyanov
d99969e0e0 api: Fix wrongly captured map of snapshots
The results of get_snapshot_details() is saved in do_with, then is
captured on the json callback by reference, then the do_with's
future returns, so by the time callback is called the map is already
free and empty.

Fix by capturing the result directly on the callback.
Fixes recently merged b6086526.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:21 +03:00
Pavel Emelyanov
d674baacef snapshot: Move all code into db::snapshot_ctl class
This includes
- rename namespace in snapshot-ctl.[cc|hh]
- move methods from storage_service to snapshot_ctl
- move snapshot_details struct
- temporarily make storage_service._snapshot_lock and ._snapshot_ops public
- replace two get_local_storage_service() occurrences with this._db

The latter is not 100% clear as the code that does this references "this"
from another shard, but the _db in question is the distributed object, so
they are all the same on all instances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 19:59:53 +03:00
Pavel Emelyanov
d989d9c1c7 snapshots: Initial skeleton
A placeholder for snapshotting code that will be moved into it
from the storage_service.

Also -- pass it through the API for future use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 19:54:14 +03:00
Pavel Emelyanov
9a8a1635b7 snapshots: Properly shutdown API endpoints
Now with the seastar httpd routes unset() at hands we
can shut down individual API endpoints. Do this for
snapshot calls, this will make snapshot controller stop
safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 17:27:45 +03:00
Pavel Emelyanov
b608652622 api: Rewrap set_server_snapshot lambda
The lambda calls the core snapshot method deep inside the
json marshalling callback. This will bring problems with
stopping the snapshot controller in the next patches.

To prepare for this -- call the .get_snapshot_details()
first, then keep the result in do_with() context. This
change doesn't affect the issue the lambde in question is
about to solve as the whole result set is anyway kept in
memory while being streamed outside.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 17:27:45 +03:00
Avi Kivity
e5be3352cf database, streaming, messaging: drop streaming memtables
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.

Remove this code.
2020-06-25 15:25:54 +02:00
Glauber Costa
bb07678346 api: do not allow user to meddle with auto compaction too early
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.

There should be no reason why a user would want to set that up this
early. So we'll disallow it.

To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:25 -04:00
Nadav Har'El
86a4dfcd29 merge: api: Command to check and repair cdc streams
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:

The command regenerates streams when:

    generations corresponding to a gossiped timestamp cannot be
    fetched from system_distributed table,
    or when generation token ranges do not align with token metadata.

In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.

Fixes #6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
2020-06-15 14:17:35 +03:00
Avi Kivity
d17b05e911 Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.

Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes #5815
Fixes #4746
"

* amnonh-quicker_estimated_histogram:
  storage_proxy: use time_estimated_histogram for latencies
  test/boost/estimated_histogram_test
  utils/histogram_metrics_helper Adding histogram converter
  utils/estimated_histogram: Adding approx_exponential_histogram
2020-06-15 10:19:36 +03:00
Amnon Heiman
6e1f042b93 storage_proxy: use time_estimated_histogram for latencies
This patch change storage_proxy to use time_estimated_histogram.

Besides the type, it changes how values are inserted and how the
histogram is used by the API.

An example how a metric looks like after the change:
scylla_storage_proxy_coordinator_write_latency_bucket{le="640.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="896.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1024.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1280.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1792.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2048.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2560.000000",scheduling_group_name="statement",shard="0",type="histogram"} 3
scylla_storage_proxy_coordinator_write_latency_bucket{le="3072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="3584.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="4096.000000",scheduling_group_name="statement",shard="0",type="histogram"} 7
scylla_storage_proxy_coordinator_write_latency_bucket{le="5120.000000",scheduling_group_name="statement",shard="0",type="histogram"} 8
scylla_storage_proxy_coordinator_write_latency_bucket{le="6144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9
scylla_storage_proxy_coordinator_write_latency_bucket{le="7168.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="8192.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="10240.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19
scylla_storage_proxy_coordinator_write_latency_bucket{le="12288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 49
scylla_storage_proxy_coordinator_write_latency_bucket{le="14336.000000",scheduling_group_name="statement",shard="0",type="histogram"} 132
scylla_storage_proxy_coordinator_write_latency_bucket{le="16384.000000",scheduling_group_name="statement",shard="0",type="histogram"} 294
scylla_storage_proxy_coordinator_write_latency_bucket{le="20480.000000",scheduling_group_name="statement",shard="0",type="histogram"} 1035
scylla_storage_proxy_coordinator_write_latency_bucket{le="24576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2790
scylla_storage_proxy_coordinator_write_latency_bucket{le="28672.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5788
scylla_storage_proxy_coordinator_write_latency_bucket{le="32768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9815
scylla_storage_proxy_coordinator_write_latency_bucket{le="40960.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19821
scylla_storage_proxy_coordinator_write_latency_bucket{le="49152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 30063
scylla_storage_proxy_coordinator_write_latency_bucket{le="57344.000000",scheduling_group_name="statement",shard="0",type="histogram"} 38642
scylla_storage_proxy_coordinator_write_latency_bucket{le="65536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 44987
scylla_storage_proxy_coordinator_write_latency_bucket{le="81920.000000",scheduling_group_name="statement",shard="0",type="histogram"} 51821
scylla_storage_proxy_coordinator_write_latency_bucket{le="98304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 54197
scylla_storage_proxy_coordinator_write_latency_bucket{le="114688.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55054
scylla_storage_proxy_coordinator_write_latency_bucket{le="131072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55363
scylla_storage_proxy_coordinator_write_latency_bucket{le="163840.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55520
scylla_storage_proxy_coordinator_write_latency_bucket{le="196608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55545
scylla_storage_proxy_coordinator_write_latency_bucket{le="229376.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="262144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="327680.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="393216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="458752.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="524288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="655360.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="786432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="917504.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1048576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1310720.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1572864.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1835008.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2097152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2621440.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3145728.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3670016.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="4194304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="5242880.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="6291456.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="7340032.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="8388608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="10485760.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="12582912.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="14680064.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="16777216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="20971520.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="25165824.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="29360128.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="33554432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="+Inf",scheduling_group_name="statement",shard="0",type="histogram"} 55549

Fixes #4746

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:23:02 +03:00
Pavel Emelyanov
a1df24621c thrift_controller: Switch on standalone
Remove the on-storage_service instance and make everybody use
th standalone one.

Stopping the thrift is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
c26943e7b5 thrift_controller: Pass one through management API
The goal is to make the relevant endpoints work on standalone
thrift controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
1d5cdfe3c6 cql_controller: Switch on standalone
Remove the on-storage_service instance and make everybody use
th standalone one.

Stopping the server is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:09 +03:00
Pavel Emelyanov
7ebe44f33d cql_controller: Pass one through management API
The goal is to make the relevant endpoints work on standalone
cql controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:09 +03:00
Pavel Emelyanov
6a89c987e4 api: Tune reg/unreg of client services control endpoints
Currntly API endpoints to start and stop cql_server and thrift
are registered right after the storage service is started, but
much earlier than those services are. In between these two
points a lot of other stuff gets initialized. This opens a small
window  during which cql_server and thrift can be started by
hand too early.

The most obvious problem is -- the storage_service::join_cluster()
may not yet be called, the auth service is thus not started, but
starting cql/thrift needs auth.

Another problem is those endpoints are not unregistered on stop,
thus creating another way to start cql/thrif at wrong time.

Also the endpoints registration change helps further patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 18:47:24 +03:00