Perform offstrategy compaction via the REST API with
a new `keyspace_offstrategy_compaction` option.
This is useful for performing offstrategy compaction
post repair, after repairing all token ranges.
Otherwise, offstrategy compaction will only be
auto-triggered after a 5 minutes idle timeout.
Like major compaction, the api call returns the offstrategy
compaction task future, so it's waited on.
The `long` result counts the number of tables that required
offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow stopping compaction by type on a given keyspace
and list of tables.
Add respective rest_api test.
Fixes#9700
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In definition for /column_family/major_compaction/{name} there is an
incorrect use of "bool" instead of "boolean".
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#9516
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.
To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.
Fixes: #7736
Tests: unit(dev), manual(REST API)
"
* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
compaction/compaction_descriptor: add comment to Validation compaction type
compaction/compaction_descriptor: compaction_options: remove validate
api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
compaction/compaction_manager: hide perform_sstable_validation()
compaction: validation compaction -> scrub compaction (validate mode)
compaction/compaction_descriptor: compaction_options: add options() accessor
compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
Adds HTTP endpoints for manipulating hint sync points:
- /hinted_handoff/sync_point (POST) - creates a new sync point for
hints towards nodes listed in the `target_hosts` parameter
- /hinted_handoff/sync_point (GET) - checks the status of the sync
point. If a non-zero `timeout` parameter is given, it waits until the
sync point is reached or the timeout expires.
Note: I tried adding the option and calling it "skip_flush"
but I couldn't make it work with scylla-jmx, hence it's
called by the abbreviated name - "sf".
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add direct support to the newly added scrub mode enum. Instead of the
legacy `skip_corrupted` flag, one can now select the desired mode from
the mode enum. `skip_corrupted` is still supported for backwards
compatibility but it is ignored when the mode enum is set.
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.
Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.
In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.
In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.
Refs: #7806Closes#8233
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.
Unlike lsa/compact, doesn't cause reactor stalls.
The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.
Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831
Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
The patch implements:
- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint
Those APIs allow to control and request the status of background
compaction jobs for the existing tables.
The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.
New members
===
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
$ ninja build/dev/test/boost/sstable_datafile_test
$ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000
The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.
Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.
API change
===
- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table
Fixes
===
Fixes#1488Fixes#1808Fixes#440
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
This reverts commit 1c444b7e1e. The test
it adds sometimes fails as follows:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
Ivan is working on a fix, but let's revert this commit to avoid blocking
next promotion failing from time to time.
This patch adds API endpoint /column_family/autocompaction/{name}
that listen to GET and POST requests to pick and control table
background compactions.
To implement that the patch introduces "_compaction_disabled_by_user"
flag that affects if CompactionManager is allowed to push background
compactions jobs into the work.
It introduces
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
to control auto compaction state.
Fixes#1488Fixes#1808Fixes#440
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
This implements support for triggering major compations through the REST
API. Please note that "split_output" is not supported and Glauber Costa
confirmed this this is fine:
"We don't support splits, nor do I think we should."
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Simple REST API for error injection is implemented.
The API allow the following operations:
* injecting an error at given injection name
* listing injections
* disabling an injection
* disabling all injections
Currently the API enables/disables on all shards.
Closes#3295
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Presently lightweight transactions piggy back the old
row value on prepare round response. If one of the participants
did not provide the old value or the values from peers don't match,
we perform a full read round which will repair the Paxos table and the
base table, if necessary, at all participants.
Capture the fact that read optimization has failed in a metric.
Message-Id: <20200304192955.84208-2-kostja@scylladb.com>
In swagger 1.2 int is defined as int32.
We originally used int following the jmx definition, in practice
internally we use uint and int64 in many places.
While the API format the type correctly, an external system that uses
swagger-based code generator can face a type issue problem.
This patch replace all use of int in a return type with long that is defined as int64.
Changing the return type, have no impact on the system, but it does help
external systems that use code generator from swagger.
Fixes#5347
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Assembles information and attributes of sstables in one or more
column families.
v2:
* Use (not really legal) nested "type" in json
* Rename "table" param to "cf" for consistency
* Some comments on data sizes
* Stream result to avoid huge string allocations on final json
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.
Without this patch, listsnapshots is broken in all versions.
Fixes: #3845
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.
To get the v2 of the swager file:
http://localhost:10000/v2
If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"
* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
Explanation about the API V2
API: add the config API as part of the v2 API.
Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.
The config.json file is used by the code generator to define a path that is
used to register the handler function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In Swagger 2.0 all the API is exported as a single file.
The header part of the file, contains general information. It is stored
as an external file so it will be easy to modify when needed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>