If the get_snapshot_details() lambda throws, the output stream remains
non-closed which is bad. Close it regardless of what.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a0c1552cea)
Otherwise close() may throw and this is what next patch will want not to
happen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1fd886608)
The existing inet_address::to_string() calls fmt::format("{}", *this)
anyway. However, the to_string() method is declared in .cc file, while
form formatter is in the header and is equipeed with constexprs so
that converting an address to string is done as much as possible
compile-time.
Also, though minor, fmt::to_string(foo) is believed to be even faster
than fmt::format("{}", foo).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18712
There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header.
Closesscylladb/scylladb#18703
* github.com:scylladb/scylladb:
api,database: Move auto-compaction toggle guard
api: Move some table manipulation helpers from storage_service
api: Move table-related calls from storage_service domain
api: Reimplement some endpoints using existing helpers
api: Lost unset of tombstone-gc endpoints
Continuation of the previous patch -- helpers toggling tombstone_gc and
auto_compaction on tables should live in the same file that uses them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_service/(enable|disable)_(tombstone_gc|auto_compaction)
endpoints are not handled by storage_service _service_ and should rather
live in the column_family/ domain which is handler by replica::database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The (enable|disable)_(tombstone_gc|auto_compaction) endpoints living in
column_family domain can benefit from the helpers that do the same in
the storage_service domain. The "difference" is that c.f. endpoints do
it per-table, while s.s. ones operate on a vector of tables, so the
former is a corner case of the latter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a keyspace and cf parameter. When specified, the endpoint will
return token -> primary replica mapping for the table's tablet tokens,
not the vnodes.
The API req->param["name"] to access parameters in the path part of the
URL was buggy - it forgot to do URL decoding and the result of our use
of it in Scylla was bugs like #5883 - where special characters in certain
REST API requests got botched up (encoded by the client, then not
decoded by the server).
The solution is to replace all uses of req->param["name"] by the new
req->get_path_param("name"), which does the decoding correctly.
Unfortunately we needed to change 104 (!) callers in this patch, but the
transformation is mostly mechanical and there is no functional changes in
this patch. Another set of changes was to bring req, not req->param, to
a few functions that want to get the path param.
This patch avoids the numerous deprecation warnings we had before, and
more importantly, it fixes#5883.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
in `set_repair()`, despite that the repair is performed asynchronously,
we check the options specified by client immediately, and throw
`std::runtime_error`, if any of them is not supported.
before this change, these unhandled exceptions are translated to HTTP
500 error but the underlying HTTP router. but this is misleading, as
these errors are caused by client, not server. and the error message
is missing in the HTTP error message when performing the translation.
in this change, we handle the `runtime_error`, and translate them
into `httpd::bad_param_exception`, so that the client can have
HTTP 400 (Bad Request) instead of HTTP 500 (Internal Server Error),
and with informative error message.
for instance, if we apply repair with "small_table_optimization" enabled
on a keyspace with tablets enabled. we should have an HTTP error 400
with "The small_table_optimization option is not supported for tablet repair"
as the body of the error. this would much more helpful.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, `set_repair()` uses a lambda for handling
the client-side requests. and this works great. but the underlying
`repair_start()` throws if any of the given options is not sane.
and we don't handle any of these throw exceptions in `set_repair()`,
from client's point of view, it would get an HTTP 500 error code,
which implies an "Internal Server Error". but actually, we should
blame the client for the error, not the server.
so, to prepare the error handling, let's take the opportunity to
coroutinize the lambda handling the request, so that we can handle
the exception in a more elegant way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Walking per-table snapshot directory without lock is racy. There's
snapshot-ctl locking that's used to get db-wide snapshot details, it
should be used to get per-table snapshot details too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that they are collected in one place and to facilitate next patch
that's going to use snapshot-ctl for per-table API too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several *snapshot_details* aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls.
Closesscylladb/scylladb#18317
* github.com:scylladb/scylladb:
snapshot_ctl: Brush up true_snapshots_size() internals
snapshot_ctl: Remove unused details struct
snapshot_ctl: No double recoding of details
database,snapshots: Move database::snapshot_details into snapshot_ctl
database,snapshots: Make database::get_snapshot_details() return map, not vector
table,snapshots: Move table::snapshot_details into snapshot_ctl
Currently database::get_snapshot_details() returns a collection of
snapshots. The snapshot_ctl converts this collection into similarly
looking one with slightly different structures inside. The resulting
collection is converted one more time on the API layer into another
similarly looking map.
This patch removes the intermediate conversion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.
When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.
Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.
Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.
Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.
The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.
But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.
The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`. If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).
Fixesscylladb/scylladb#15787Closesscylladb/scylladb#15788
* github.com:scylladb/scylladb:
storage_service: join_token_ring: load ignored nodes state if replacing
storage_service: replacement_info: return ignore_nodes state
locator: host_id_or_endpoint: keep value as variant
gms: endpoint_state: add getters for host_id, dc_rack, and tokens
storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
gossiper: add_saved_endpoint: set dc and rack
gossiper: add_saved_endpoint: fixup indentation
gossiper: add_saved_endpoint: make host_id mandatory
gossiper: add load_endpoint_state
gossiper: start_gossiping: log local state
Copied from the add_replica counterpart
TODO: Generalize common parts of move_tablet and add_|del_tablet_replica
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rather than allowing to keep both
host_id and endpoint, keep only one of them
and provide resolve functions that use the
token_metadata to resolve the host_id into
an inet_address or vice verse.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When altering rf for a keyspace, all tablets in this ks will get more replicas. Part of this process is rebuilding tablets' onto new node(s). This PR extends the tablets transition code to support rebuilding of tablet on new replica.
fixes: #18030Closesscylladb/scylladb#18082
* github.com:scylladb/scylladb:
test: Check data presense as well
test: Test how tablets are copied between nodes
test: Add sanity test for tablet migration
api: Add method to add replica to a tablet
tablet: Make leaving replica optional
`database::find_column_family()` throws no_such_column_family
if an unknown ks.cf is fed to it. and we call into this function
without checking for the existence of ks.cf first. since
"/storage_service/tablets/move" is a public interface, we should
translate this error to a better http error.
in this change, we check for the existence of the given ks.cf, and
throw an exception so that it can be caught by seastar::httpd::routers,
and converted to an HTTP error.
Fixes#17198
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17217
The new API submits rebuild transition with new replicas set to be old
(current) replicas plus the provided one. It looks and acts like the
move_tablet API call with several changes:
- lacks the "source" replica argument
- submits "rebuild" transition kind
- cross racks checks are not performed
The 'force' argument is inherited from move_tablet, but is unused now
and is left for future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Upgrading raft topology is an important api call
that should be logged.
When failed, it is also important to log the
exception to get better visibility into why
the call failed.
Indentation will be fixed in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This change introduces a logic, that is responsible
for checking if tablets are enabled for any of
keyspaces when get_ownership() is invoked.
Without it, the result would be calculated
based solely on sorted_tokens() which was
invalid.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Before this change, when user tried to utilize
'storage_service/ownership/{keyspace}' API with
keyspace parameter that uses tablets, then internal
error was thrown. The code was calling a function,
that is intended for vnodes: get_vnode_effective_replication_map().
This commit introduces graceful handling of such scenario and
extends the API to allow passing 'cf' parameter that denotes
table name.
Now, when keyspace uses tablets and cf parameter is not passed
a descriptive error message is returned via BAD_REQUEST.
Users cannot query ownership for keyspace that uses tablets,
but they can query ownership for a table in a given keyspace that uses tablets.
Also, new tests have been added to test/rest_api/test_storage_service.py and
to test/topology_experimental_raft/test_tablets.py in order to verify the behavior
with and without tablets enabled.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
To allow to filter the returned keyspaces based by the replication they
use: tablets or vnodes.
The filter can be disabled by omitting the parameter or passing "all".
The default is "all".
Fixes: #16509Closesscylladb/scylladb#17319
This API endpoint was failing when tablets were enabled
because of usage of get_vnode_effective_replication_map().
Moreover, it was providing an error message that was not
user-friendly.
This change extends the handler to properly service the incoming requests.
Furthermore, it introduces two new test cases that verify the behavior of
storage_service/range_to_endpoint_map API. It also adjusts the test case
of this endpoint for vnodes to succeed when tablets are enabled by default.
The new logic is as follows:
- when tablets are disabled then users may query endpoints
for a keyspace or for a given table in a keyspace
- when tablets are enabled then users have to provide
table name, because effective replication map is per-table
When user does not provide table name when tablets are enabled
for a given keyspace, then BAD_REQUEST is returned with a
meaningful error message.
Fixes: scylladb#17343
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#17372
when we just want to perform read access to `http_context`, there
is no need to use a non-const reference. so let's add `const` specifier
to make this explicit. this shoudl help with the readability and
maintainability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17219
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.
Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.
Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.
The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.
Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.
The procedure is explained in more detail in `topology-over-raft.md`.
Fixes: https://github.com/scylladb/scylladb/issues/15008Closesscylladb/scylladb#17077
* github.com:scylladb/scylladb:
test/topology_custom: upgrade/recovery tests for topology on raft
cdc/generation_service: in legacy mode, fall back to raft tables
system_keyspace: add read_cdc_generation_opt
cdc/generation_service: turn off gossip notifications in raft topo mode
cql_test_env: move raft_topology_change_enabled var earlier
group0_state_machine: pull snapshot after raft topology feature enabled
storage_service: disable persistent feature enabler on upgrade
storage_service: replicate raft features to system.peers
storage_service: gossip tokens and cdc generation in raft topology mode
API: add api for triggering and monitoring topology-on-raft upgrade
storage_service: infer which topology operations to use on startup
storage_service: set the topology kind value based on group 0 state
raft_group0: expose link to the upgrade doc in the header
feature_service: fall back to checking legacy features on startup
storage_service: add fiber for tracking the topology upgrade progress
gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
topology_coordinator: implement core upgrade logic
topology_coordinator: extract top-level error handling logic
storage_service: initialize discovery leader's state earlier
topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
topology_state_machine: introduce upgrade_state
storage_service: disallow topology ops when upgrade is in progress
raft_group0_client: add in_recovery method
storage_service: introduce join_node_query verb
raft_group0: make discover_group0 public
raft_group0: filter current node's IP in discover_group0
raft_group0: remove my_id arg from discover_group0
storage_service: make _raft_topology_change_enabled more advanced
docs: document raft topology upgrade and recovery
per its description, "`/storage_service/describe_ring/`" returns the
token ranges of an arbitrary keyspace. actually, it returns the
first keyspace which is of non-local-vnode-based-strategy. this API
is not used by nodetool, neither is it exercised in dtest.
scylla-manager has a wrapper for this API though, but that wrapper
is not used anywhere.
in this change, this API is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17197
Implements the /storage_service/raft_topology/upgrade route. The route
supports two methods: POST, which triggers the cluster-wide upgrade to
topology-on-raft, and GET which reports the status of the upgrade.
The table query param is added to get the describe_ring result for a
given table.
Both vnode table and tablet table can use this table param, so it is
easier for users to user.
If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.
E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"
Refs #16509Closesscylladb/scylladb#17118
* github.com:scylladb/scylladb:
tablets: Convert to use the new version of for_each_tablet
storage_service: Add describe_ring support for tablet table
storage_service: Mark host2ip as const
tablets: Add for_each_tablet_gently
Validate replication strategy constraints in /storage_service/tablets/move API:
- replicas are not on the same node
- replicas don't move across DC (violates RF in each DC)
- availability is not reduced due to rack overloading
Add flag to force tablet move even though dc/rack constraints aren't fulfilled.
Test for the change: https://github.com/scylladb/scylla-dtest/pull/3911.
Fixes: #16379.
Closesscylladb/scylladb#16648
* github.com:scylladb/scylladb:
api: service: add force param to move_tablet api
service: validate replication strategy constraints
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
before this change, if no keyspaces are specified,
scylla-nodetool just enumerate all non-local keyspaces, and
call "/storage_service/keyspace_cleanup" on them one after another.
this is not quite efficient, as each this RESTful API call
force a new active commitlog segment, and flushes all tables.
so, if the target node of this command has N non-local keyspaces,
it would repeat the steps above for N times. this is not necessary.
and after a topology change, we would like to run a global
"nodetool cleanup" without specifying the keyspace, so this
is a typical use case which we do care about.
to address this performance issue, in this change, we improve
an existing RESTful API call "/storage_service/cleanup_all", so
if the topology coordinator is not enabled, we fall back to
a local cleanup to cleanup all non-local keyspaces.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
according to the document "nodetool cleanup"
> Triggers removal of data that the node no longer owns
currently, scylla performs cleanup by rewriting the sstables. but
commitlog segments may still contain the mutations to the tables
which are dropped during sstable rewriting. when scylla server
restarts, the dirty mutations are replayed to the memtable. if
any of these dirty mutations changes the tables cleaned up. the
stale data are reapplied. this would lead to data resurrection.
so, in this change we following the same model of major compaction:
1. force new active segment,
2. flush all tables
3. perform cleanup using compaction, which rewrites the sstables
of specified tables
because we already `flush()` all tables in
`cleanup_keyspace_compaction_task_impl::run()`, there is no need to
call `flush()` again, in `table::perform_cleanup_compaction()`, so
the `flush()` call is dropped in this function, and the tests using
this function are updated to call `flush()` manually to preserve
the existing behavior.
there are two callers of `cleanup_keyspace_compaction_task_impl`,
* one is `storage_service::sstable_cleanup_fiber()`, which listens
for the events fired by topology_state_machine, which is in turn
driven by, for instance, "/storage_service/cleanup_all" API.
which cleanup all keyspaces in one after another.
* another is "/storage_service/keyspace_cleanup", which cleans up
the specified keyspace.
in the first use case, we can force a new active segment for a single
time, so another parameter to the ctor of
`cleanup_keyspace_compaction_task_impl` is introduced to specify if
the `db.flush_all_tables()` call should be skiped.
please note, there are two possible optimizations,
1. force new active segment only if the mutations in it touches the
tables being cleaned up
2. after forcing new active segment, only flush the (mem)tables
mutated by the non-active segments
but let's leave them for following-up changes. this change is a
minimal fix for data resurrection issue.
Fixes#16757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.
This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
This change replaces usage of db::config with usage
of utils::directories in api/storage_service.cc in
order to get the paths of directories.
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
join_cluster and start_maintenance_mode are incompatible.
To make sure that only one is called when the node starts, add the MAINTENANCE option.
start_maintenance_mode sets _operation_mode to MAINTENANCE.
join_cluster sets _operation_mode to STARTING.
set_mode will result in an internal error if:
* it tries to set MAINTENANCE mode when the _operation_mode is other than NONE,
i.e. start_maintenance_mode is called after join_cluster (or it is called during
the drain, but it also shouldn't happen).
* it tries to set STARTING mode when the mode is set to MAINTENANCE,
i.e. join_cluster is called after start_maintenance_mode.
Local keyspaces do not need cleanup, and
keyspaces configured with tablets, where their
replication strategy is per-table do not support
cleanup.
In both cases, just skip their cleanup via the api.
Fixes#16738
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16785