For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.
This is useful to prevent creation of small sstables
due to excessive memtable flushing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
*) Problem:
We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.
*) Solution:
To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.
- No token range to repair is needed by the user. It will repair all token
ranges automatically.
- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.
- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.
*) Performance:
1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```
The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X
2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}
Same test as above except 5ms delay is added to simulate multiple dc
network latency:
The time to repair is reduced from 333s to 0.2s.
333.26758 s / 0.22625381s = 1472.98
3)
3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency
Before:
```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```
After:
```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```
The time to repair is reduced from 986s (16 minutes) to 1.6s
*) Summary
So, a more than 1000X difference is observed for this common usage of
system table repair procedure.
Fixes#16011
Refs #15159
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.
The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
This patch adds the ranges_parallelism option to repair restful API.
Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.
Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.
To make it safer, unlike the PR #4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.
Fixes#4847
This patch changes the base path of the V2 of the API to be '/'. That
means that the v2 prefix will be part of the path definition.
Currently, it only affect the config API that is created from code.
The motivation for the change is for Swagger definitions that are read
from a file. Currently, when using the swagger-ui with a doc path set
to http://localhost:10000/v2 and reading the Swagger from a file swagger
ui will concatenate the path and look for
http://localhost:10000/v2/v2/{path}
Instead, the base path is now '/' and the /v2 prefix will be added by
each endpoint definition.
From the user perspective, there is no change in current functionality.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".
this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.
this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.
But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13834
Task ttl can be set with task manager test api, which is disabled
in release mode.
Move get_and_update_ttl from task manager test api to task manager
api, so that it can be used in release mode.
Closes#12894
Sometimes to debug some task manager module, we may want to inspect
the whole tree of descendants of some task.
To make it easier, an api call getting a list of statuses of the requested
task and all its descendants in BFS order is added.
The PR adds changes to task manager that allow more convenient integration with modules.
Introduced changes:
- adds internal flag in task::impl that allows user to filter too specific tasks
- renames `parent_data` to more appropriate name `task_info`
- creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager
- adds more flexible version of `make_task` method
Closes#11821
* github.com:scylladb/scylladb:
tasks: add alternative make_task method
tasks: rename parent_data to task_info and move it
tasks: move task_id to tasks/types.hh
tasks: add internal flag for task_manager::task::impl
The current summary of the operation is obscure.
It refers to a token in the ring and the endpoint associated with it,
while the operation uses a host_id to identify a whole node.
Instead, clarify the summary to refer to a node in the cluster,
consistent with the description for the host_id parameter.
Also, describe the effect the call has on the data the removed node
logically owned.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the api is inconsistent: requiring a uuid for the
host_id of the node to be removed, while the ignored nodes list
is given as comma-separated ip addresses.
Instead, support identifying the ignored_nodes either
by their host_id (uuid) or ip address.
Also, require all ignore_nodes to be of the same kind:
either UUIDs or ip addresses, as a mix of the 2 is likely
indicating a user error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is convenient to create many different tasks implementations
representing more and more specific parts of the operation in
a module. Presenting all of them through the api makes it cumbersome
for user to navigate and track, though.
Flag internal is added to task_manager::task::impl so that the tasks
could be filtered before they are sent to user.
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).
In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.
I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.
Fixes#11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783Closes#11384
* github.com:scylladb/scylladb:
test/alternator: insert test names into Scylla logs
rest api: add a new /system/log operation
alternator ttl: log warning if scan took too long.
alternator,ttl: allow sub-second TTL scanning period, for tests
test/alternator: skip fewer Alternator TTL tests
test/alternator: test Alternator TTL metrics
Add a new REST API operation, taking a log level and a message, and
printing it into the Scylla log.
This can be useful when a test wants to mark certain positions in the
log (e.g., to see which other log messages we get between the two
positions). An alternative way to achieve this could have been for the
test to write directly into the log file - but an on-disk log file is
only one of the logging options that Scylla support, and the approach
in this patch allows to add log message regardless of how Scylla keeps
the logs.
In motivation of this feature is that in the following patch the
test/alternator framework will add log messages when starting and
ending tests, which can help debug test failures.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test api that helps testing task manager api. It can be used
to simulate the operations that can happen on modules and theirs
task. Through the api user can: register and unregister the test
module and the tasks belonging to the module, and finish the tasks
with success or custom error.
The task manager api layer. It can be used to list the modules
registered in task_manager, list tasks belonging to the given
module, abort, wait for or retrieve a status of the given task.
For cases where we have very high values set to permissions_cache validity and
update interval (E.g.: 1 day), whenever a change to permissions is made it's
necessary to update scylla config and decrease these values, since waiting for
all this time to pass wouldn't be viable.
This patch adds an API for resetting the authorization cache so that changing
the config won't be mandatory for these cases.
Usage:
$ curl -X POST http://localhost:10000/authorization_cache/reset
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Perform offstrategy compaction via the REST API with
a new `keyspace_offstrategy_compaction` option.
This is useful for performing offstrategy compaction
post repair, after repairing all token ranges.
Otherwise, offstrategy compaction will only be
auto-triggered after a 5 minutes idle timeout.
Like major compaction, the api call returns the offstrategy
compaction task future, so it's waited on.
The `long` result counts the number of tables that required
offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow stopping compaction by type on a given keyspace
and list of tables.
Add respective rest_api test.
Fixes#9700
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In definition for /column_family/major_compaction/{name} there is an
incorrect use of "bool" instead of "boolean".
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#9516
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.
To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.
Fixes: #7736
Tests: unit(dev), manual(REST API)
"
* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
compaction/compaction_descriptor: add comment to Validation compaction type
compaction/compaction_descriptor: compaction_options: remove validate
api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
compaction/compaction_manager: hide perform_sstable_validation()
compaction: validation compaction -> scrub compaction (validate mode)
compaction/compaction_descriptor: compaction_options: add options() accessor
compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
Adds HTTP endpoints for manipulating hint sync points:
- /hinted_handoff/sync_point (POST) - creates a new sync point for
hints towards nodes listed in the `target_hosts` parameter
- /hinted_handoff/sync_point (GET) - checks the status of the sync
point. If a non-zero `timeout` parameter is given, it waits until the
sync point is reached or the timeout expires.
Note: I tried adding the option and calling it "skip_flush"
but I couldn't make it work with scylla-jmx, hence it's
called by the abbreviated name - "sf".
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add direct support to the newly added scrub mode enum. Instead of the
legacy `skip_corrupted` flag, one can now select the desired mode from
the mode enum. `skip_corrupted` is still supported for backwards
compatibility but it is ignored when the mode enum is set.
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864