Commit Graph

106 Commits

Author SHA1 Message Date
Taras Veretilnyk
fa98239ed8 rest_api: add endpoint which drops all quarantined sstables
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.

Fixes scylladb/scylladb#19061
2025-07-28 16:55:17 +02:00
Gleb Natapov
c8ce9d1c60 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.
2025-07-07 10:46:03 +03:00
Robert Bindar
ca1a9c8d01 Add support for nodetool refresh --skip-reshape
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24409
Fixes: #24365
2025-06-10 12:52:13 +03:00
Robert Bindar
c570941692 Add nodetool refresh --scope option
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.

The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23861
2025-05-29 16:12:09 +03:00
Pavel Emelyanov
1b1f653699 api: Introduce skip_cleanup query parameter
Just copy the load_and_stream and primary_replica_only logic, this new
option is the same in this sense.

Throw if it's specified with the load_and_stream one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-05-13 17:06:28 +03:00
Michał Chojnowski
a19d6d95f7 api: add the estimate_compression_ratios API call
Add an API call which estimates the effectiveness of possible
compression config changes.

This can be used to make an informed decision about whether to
change the compression method, without actually recompressing
any SSTables.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
58ae278d10 api: add the retrain_dict API call
Add an API call which will retrain the SSTable compression dictionary
for a given table.

Currently, it needs all nodes to be alive to succeed. We can relax this later.
2025-04-01 00:07:29 +02:00
Asias He
3f59a89e85 repair: Fix return type for storage_service/tablets/repair API
The API returns the repair task UUID. For example:

{"tablet_task_id":"3597e990-dc4f-11ef-b961-95d5ead302a7"}

Fixes #23032

Closes scylladb/scylladb#23050
2025-02-27 12:38:12 +02:00
Asias He
5545289bfa repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417
2025-02-14 09:13:11 +01:00
Asias He
fb318d0c81 repair: Add await_completion option for tablet_repair api
Set true to wait for the repair to complete. Set false to skip waiting
for the repair to complete. When the option is not provided, it defaults
to false.

It is useful for management tool that wants the api to be async.

Fixes #22418

Closes scylladb/scylladb#22436
2025-02-11 12:49:12 +02:00
Kefu Chai
8080658df7 backup_task: remove a component once it is uploaded
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).

This change

- adds an optional query parameter named "move_files" to
  "/storage_service/backup" API. if it is set to "true", SSTable
  components are removed once they are backed up to object storage.
- conditionally removes SSTable components from the snapshot directory once
  they are successfully uploaded to the target location. This prevents
  re-uploading the same files and reduces disk usage.

This change only "Refs" #20655, because, we can move further optimize
the backup process, consider:

- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.

Fixes #21799
Refs #20655

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-22 11:17:01 +08:00
Łukasz Paszkowski
adef719c43 api/storage_service: Remove unimplemented truncate API
The API /storage_service/truncate/{ks} returns an unimplemented
error when invoked. As we already have a CQL command,
`TRUNCATE TABLE ks.cf` that causes the table to be truncated on all
nodes, the API can be dropped. Due to the error, it is unused.

Fixes https://github.com/scylladb/scylladb/issues/10520

No backport is required. A small cleanup of not working API.

Closes scylladb/scylladb#22258
2025-01-17 11:21:05 +03:00
Pavel Emelyanov
a24dc02255 api: New "scope" API param to load-and-stream calls
There are two of those -- the POST /storage_service/keyspace that loads
and streams new sstables from /upload and POST /storage_service/restore
that does the same, but gets sstables from object store.

The new optional parameter allow users to tun the streaming phase
behavior. The test/pylib client part is also updated here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:28:05 +03:00
Asias He
844129227e repair: Add restful API for tablet repair
It allows user to add and del a tablet repair request. The request is
executed by the tablet repair scheduler.
2024-11-20 09:42:41 +08:00
Kefu Chai
787ea4b1d4 treewide: accept list of sstables in "restore" API
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.

in order to fill the gap, in this change:

* add the "prefix" parameter for specifying the shared prefix of
  sstables
* add the "sstables" parameter for specifying the list of  TOC
  components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
  on scylla's end anymore.
* make the "table" parameter mandatory.

Fixes scylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-01 23:24:56 +08:00
Kefu Chai
d663b6c13b treewide: add "table" parameter to "backup" API
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.

in this change:

* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
  `snapshot_ctl::start_backup`. since we have all the information
  for composing the snapshot directory, and what the `backup_task_impl`
  class is interested is but the snapshot directory, we just pass
  the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
  only scan the specified snapshot directory.

Fixes scylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-25 09:11:26 +08:00
Ernest Zaslavsky
924325fd25 treewide: add "prefix" parameter to backup API
Allow the caller to pass the prefix when performing backup and restore

Fixes scylladb/scylladb#20335

Closes scylladb/scylladb#20413
2024-09-18 08:25:00 +03:00
Lakshmi Narayanan Sreethar
84d06a13c7 api: compaction: add consider_only_existing_data option
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:

- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
  checking the SSTables being compacted. Memtables, commit logs, and
  other SSTables not part of the compaction will not be checked, as they
  will only contain newer data that arrived after the compaction
  started.

The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.

The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.

Fixes #19728

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Pavel Emelyanov
11a04bfb66 code: Introduce restore API method
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-28 15:42:49 +03:00
Pavel Emelyanov
a812f13ddd code: Introduce backup API method
The method starts a task that uploads all files from the given
keyspace's snapshot to the requested endpoint/bucket. The task runs in
the background, its task_id is returned from the method once it's
spawned and it should be used via /task_manager API to track the task
execution and completion (hint: it's good to have non-zero TTL value to
make sure fast backups don't finish before the caller manages to call
wait_task API).

If snapshot doesn't exist, nothing happens (FIXME, need to return back
an error in that case).

If endpoint is not configured locally, the API call resolves with
bad-request instantly.

Sstables components are scanned for all tables in the keyspace and are
uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.

Task is not abortable (FIXME -- to be added) and doesn't really report
its progress other than running/done state (FIXME -- to be added too).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Benny Halevy
0419b1d522 nodetool: rebuild: add force option
To be used to force usage of source_dc, even
when it is unsafe for rebuild.

Update docs and add test/nodetool/test_rebuild.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:20:12 +03:00
Kefu Chai
c75442bc2a api: s/rpc/thrift/
replace all occurrences of "rpc" in function names and debugging
messages to "thrift", as "rpc" is way too general, and since we
are removing "thrift" support, let's take this opportunity to
use a more specific name.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-07 09:23:10 +08:00
Kefu Chai
ad649be1bf treewide: drop thrift support
thrift support was deprecated since ScyllaDB 5.2

> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.

so let's drop it. in this change,

* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is
  preserved for backward compatibility, as we could load
  from an existing system.local table which still contains
  this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for
  backward compatibility with java-based nodetool.
* `rpc_port` and `start_rpc` options are preserved, but
  they are marked as "Unused". so that the new release
  of scylladb can consume existing scylla.yaml configurations
  which might contain these settings. by making them
  deprecated, user will be able get warned, and update
  their configurations before we actually remove them
  in the next major release.

Fixes #3811
Fixes #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-07 06:44:59 +08:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Botond Dénes
fd25bb6f9f api/storage_service: add tablet support for /storage_service/tokens_endpoint
Add a keyspace and cf parameter. When specified, the endpoint will
return token -> primary replica mapping for the table's tablet tokens,
not the vnodes.
2024-05-13 07:09:20 -04:00
Pavel Emelyanov
8bad828208 api: Add method to delete replica from tablet
Copied from the add_replica counterpart

TODO: Generalize common parts of move_tablet and add_|del_tablet_replica

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Pavel Emelyanov
79ad760e95 api: Add method to add replica to a tablet
The new API submits rebuild transition with new replicas set to be old
(current) replicas plus the provided one. It looks and acts like the
move_tablet API call with several changes:

- lacks the "source" replica argument
- submits "rebuild" transition kind
- cross racks checks are not performed

The 'force' argument is inherited from move_tablet, but is unused now
and is left for future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-04 09:22:16 +03:00
Benny Halevy
530d270828 api: /storage_service/tablets/balancing: fix incorrect operation summary
It was probably copy-pasted from /storage_service/tablets/move

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#17811
2024-03-14 22:52:57 +01:00
Patryk Wrobel
51da80da7d storage_service/ownership/{keyspace}: handle requests when tablets are enabled
Before this change, when user tried to utilize
'storage_service/ownership/{keyspace}' API with
keyspace parameter that uses tablets, then internal
error was thrown. The code was calling a function,
that is intended for vnodes: get_vnode_effective_replication_map().

This commit introduces graceful handling of such scenario and
extends the API to allow passing 'cf' parameter that denotes
table name.

Now, when keyspace uses tablets and cf parameter is not passed
a descriptive error message is returned via BAD_REQUEST.
Users cannot query ownership for keyspace that uses tablets,
but they can query ownership for a table in a given keyspace that uses tablets.

Also, new tests have been added to test/rest_api/test_storage_service.py and
to test/topology_experimental_raft/test_tablets.py in order to verify the behavior
with and without tablets enabled.

Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-03-11 09:52:23 +01:00
Botond Dénes
050c6dcad7 api: storage_service/keyspaces: add replication filter
To allow to filter the returned keyspaces based by the replication they
use: tablets or vnodes.
The filter can be disabled by omitting the parameter or passing "all".
The default is "all".

Fixes: #16509

Closes scylladb/scylladb#17319
2024-02-20 09:04:41 +01:00
Patryk Wrobel
3842bf18a7 storage_service/range_to_endpoint_map: allow API to properly handle tablets
This API endpoint was failing when tablets were enabled
because of usage of get_vnode_effective_replication_map().
Moreover, it was providing an error message that was not
user-friendly.

This change extends the handler to properly service the incoming requests.
Furthermore, it introduces two new test cases that verify the behavior of
storage_service/range_to_endpoint_map API. It also adjusts the test case
of this endpoint for vnodes to succeed when tablets are enabled by default.

The new logic is as follows:
 - when tablets are disabled then users may query endpoints
   for a keyspace or for a given table in a keyspace
 - when tablets are enabled then users have to provide
   table name, because effective replication map is per-table

When user does not provide table name when tablets are enabled
for a given keyspace, then BAD_REQUEST is returned with a
meaningful error message.

Fixes: scylladb#17343

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#17372
2024-02-18 19:21:53 +02:00
Kamil Braun
e9e24f47ec Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.

Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.

Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.

The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.

Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.

The procedure is explained in more detail in `topology-over-raft.md`.

Fixes: https://github.com/scylladb/scylladb/issues/15008

Closes scylladb/scylladb#17077

* github.com:scylladb/scylladb:
  test/topology_custom: upgrade/recovery tests for topology on raft
  cdc/generation_service: in legacy mode, fall back to raft tables
  system_keyspace: add read_cdc_generation_opt
  cdc/generation_service: turn off gossip notifications in raft topo mode
  cql_test_env: move raft_topology_change_enabled var earlier
  group0_state_machine: pull snapshot after raft topology feature enabled
  storage_service: disable persistent feature enabler on upgrade
  storage_service: replicate raft features to system.peers
  storage_service: gossip tokens and cdc generation in raft topology mode
  API: add api for triggering and monitoring topology-on-raft upgrade
  storage_service: infer which topology operations to use on startup
  storage_service: set the topology kind value based on group 0 state
  raft_group0: expose link to the upgrade doc in the header
  feature_service: fall back to checking legacy features on startup
  storage_service: add fiber for tracking the topology upgrade progress
  gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
  topology_coordinator: implement core upgrade logic
  topology_coordinator: extract top-level error handling logic
  storage_service: initialize discovery leader's state earlier
  topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
  topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
  topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
  topology_state_machine: introduce upgrade_state
  storage_service: disallow topology ops when upgrade is in progress
  raft_group0_client: add in_recovery method
  storage_service: introduce join_node_query verb
  raft_group0: make discover_group0 public
  raft_group0: filter current node's IP in discover_group0
  raft_group0: remove my_id arg from discover_group0
  storage_service: make _raft_topology_change_enabled more advanced
  docs: document raft topology upgrade and recovery
2024-02-09 11:54:53 +01:00
Kefu Chai
c1c96bbc16 api/storage_service: drop /storage_service/describe_ring/ API
per its description, "`/storage_service/describe_ring/`" returns the
token ranges of an arbitrary keyspace. actually, it returns the
first keyspace which is of non-local-vnode-based-strategy. this API
is not used by nodetool, neither is it exercised in dtest.
scylla-manager has a wrapper for this API though, but that wrapper
is not used anywhere.

in this change, this API is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17197
2024-02-09 12:49:21 +02:00
Piotr Dulikowski
a672383c2a API: add api for triggering and monitoring topology-on-raft upgrade
Implements the /storage_service/raft_topology/upgrade route. The route
supports two methods: POST, which triggers the cluster-wide upgrade to
topology-on-raft, and GET which reports the status of the upgrade.
2024-02-08 19:12:28 +01:00
Botond Dénes
35da9551fb Merge 'storage_service: Add describe_ring support for tablet table' from Asias He
The table query param is added to get the describe_ring result for a
given table.

Both vnode table and tablet table can use this table param, so it is
easier for users to user.

If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.

E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"

Refs #16509

Closes scylladb/scylladb#17118

* github.com:scylladb/scylladb:
  tablets: Convert to use the new version of for_each_tablet
  storage_service: Add describe_ring support for tablet table
  storage_service: Mark host2ip as const
  tablets: Add for_each_tablet_gently
2024-02-07 10:41:36 +02:00
Asias He
04773bd1df storage_service: Add describe_ring support for tablet table
The table query param is added to get the describe_ring result for a
given table.

Both vnode table and tablet table can use this table param, so it is
easier for users to user.

If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.

E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"

Refs #16509
2024-02-05 18:11:07 +08:00
Aleksandra Martyniuk
89c683f51a api: service: add force param to move_tablet api
Force flag is added to /storage_service/tablets/move. If force is set
to true, replication strategy constraints regarding racks and dcs can
be broken.
2024-02-02 19:08:01 +01:00
Botond Dénes
1a0300dba6 Merge 'compaction_manager: flush tables before cleanup' from Kefu Chai
according to the document "nodetool cleanup"

> Triggers removal of data that the node no longer owns

currently, scylla performs cleanup by rewriting the sstables. but
commitlog segments may still contain the mutations to the tables
which are dropped during sstable rewriting. when scylla server
restarts, the dirty mutations are replayed to the memtable. if
any of these dirty mutations changes the tables cleaned up. the
stale data are reapplied. this would lead to data resurrection.

so, in this change we following the same model of major compaction
where we

1. forcing new active segment,
2. flushing tables being cleaned up
3. perform cleanup using compaction

Fixes #4734

Closes scylladb/scylladb#16757

* github.com:scylladb/scylladb:
  storage_service: fall back to local cleanup in cleanup_all
  compaction: format flush_mode without the helper
  compaction_manager: flush all tables before cleanup
  replica: table: pass do_flush to table::perform_cleanup_compaction()
  api, compaction: promote flush_mode
2024-02-01 13:47:45 +02:00
Kefu Chai
4ec104e086 api: storage_service: correct a typo
s/a any keyspace/a given keyspace/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17098
2024-02-01 10:55:58 +02:00
Kefu Chai
5e0b3671d3 storage_service: fall back to local cleanup in cleanup_all
before this change, if no keyspaces are specified,
scylla-nodetool just enumerate all non-local keyspaces, and
call "/storage_service/keyspace_cleanup" on them one after another.
this is not quite efficient, as each this RESTful API call
force a new active commitlog segment, and flushes all tables.
so, if the target node of this command has N non-local keyspaces,
it would repeat the steps above for N times. this is not necessary.
and after a topology change, we would like to run a global
"nodetool cleanup" without specifying the keyspace, so this
is a typical use case which we do care about.

to address this performance issue, in this change, we improve
an existing RESTful API call "/storage_service/cleanup_all", so
if the topology coordinator is not enabled, we fall back to
a local cleanup to cleanup all non-local keyspaces.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-01 11:25:53 +08:00
Kefu Chai
0dbb0ed09f api: storage_service: correct a typo
s/trough/through/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16870
2024-01-19 10:21:41 +02:00
Gleb Natapov
97ab3f6622 storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
2024-01-14 15:45:53 +02:00
Kefu Chai
8c4576f55d api: storage_service: correct the descriptions of two APIs
this change is more about documentation of the RESTful API of
storage_service. as we define the API using Swagger 2.0 format, and
generate the API document from the definitions. so would be great
if the document matches with the API.

in this change, since the keyspace is not queried but mutated. so
changed to a more accurate description.

from the code perspective, it is but cosmetic. as we don't read the
description fields or verify them in our tests.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16637
2024-01-11 08:28:14 +02:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Yaniv Kaul
ae2ab6000a Typos: fix typos in code
Fixes some more typos as found by codespell run on the code.
In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255
2023-12-05 15:18:11 +02:00
Benny Halevy
b12b142232 api: add /storage_service/compact
For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Benny Halevy
1b576f358b api: add /storage_service/flush
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00