There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several *snapshot_details* aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls.
Closesscylladb/scylladb#18317
* github.com:scylladb/scylladb:
snapshot_ctl: Brush up true_snapshots_size() internals
snapshot_ctl: Remove unused details struct
snapshot_ctl: No double recoding of details
database,snapshots: Move database::snapshot_details into snapshot_ctl
database,snapshots: Make database::get_snapshot_details() return map, not vector
table,snapshots: Move table::snapshot_details into snapshot_ctl
We are going to make the `consistent-topology-changes` experimental
feature unused in 6.0. However, the topology upgrade procedure will
be manual and voluntary, so some 6.0 clusters will be using the
gossip-based topology. Therefore, we need to continue testing the
gossip-based topology. The solution is introducing a new flag,
`force-gossip-topology-changes`, that will enforce the gossip-based
topology in a fresh cluster.
In this patch, we only introduce the parameter without any effect.
Here is the explanation. Making `consistent-topology-changes` unused
and introducing `force-gossip-topology-changes` requires adjustments
in scylla-dtest. We want to merge changes to scylladb and scylla-dtest
in a way that ensures all tests are run correctly during the whole
process. If we merged all changes to scylladb first, before merging
the scylla-dtest changes, all tests would run with the raft-based
topology and the ones excluded in the raft-based topology would fail.
We also can't merge all changes to scylla-dtest first. However, we
can follow this plan:
1. scylladb: merge this patch
2. scylla-dtest: start using `force-gossip-topology-changes`
in jobs that run without the raft-based topology
3. scylladb: merge the rest of the changes
4. scylla-dtest: merge the rest of the changes
Ref scylladb/scylladb#17802Closesscylladb/scylladb#18284
Fixes#18329
named_file::assign call uses old object "known_size" after a move
of the object. While this is wholly ok, since the attribute accessed
will not be modified/destroyed by the move, it causes warnings in
"tidy" runs, and might confuse or cause real errors should impl. change.
Closesscylladb/scylladb#18337
Previous patches broke indentation in this method. Fix it by shortening
the summation loop with the help of std::accumulate()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently database::get_snapshot_details() returns a collection of
snapshots. The snapshot_ctl converts this collection into similarly
looking one with slightly different structures inside. The resulting
collection is converted one more time on the API layer into another
similarly looking map.
This patch removes the intermediate conversion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that it's in-sync with table::get_snapshot_details(). Next patches
will improve this place even further.
Also, there can be many snapshots and vector can grow large, but that's
less of an issue here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.
in this change,
* utils: drop the range formatters in to_string.hh and to_string.c, as
we don't use them anymore. and the tests for them in
test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
we are not able to print the elements using operator<< anymore
after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
due to a bug in {fmt} v9, {fmt} fails to format a range whose
element type is `basic_sstring<uint8_t>`, as it considers it
as a string-like type, but `basic_sstring<uint8_t>`'s char type
is signed char, not char. this issue does not exist in {fmt} v10,
so, in this change, we add a workaround to explicitly specialize
the type trait to assure that {fmt} format this type using its
`fmt::formatter` specialization instead of trying to format it
as a string. also, {fmt}'s generic ranges formatter calls the
pair formatter's `set_brackets()` and `set_separator()` methods
when printing the range, but operator<< based formatter does not
provide these method, we have to include this change in the change
switching to {fmt}, otherwise the change specializing
`fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
for comparing values. but without the operator<< based formatters,
Boost.Test would not be able to print them. after removing
the homebrew formatters, we need to use the generic
`boost_test_print_type()` helper to do this job. so we are
including `test_utils.hh` in tests so that we can print
the formattable types.
* treewide: add "#include "utils/to_string.hh" where
`fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.
When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.
Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.
Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.
Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.
The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.
But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.
The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`. If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).
Fixesscylladb/scylladb#15787Closesscylladb/scylladb#15788
* github.com:scylladb/scylladb:
storage_service: join_token_ring: load ignored nodes state if replacing
storage_service: replacement_info: return ignore_nodes state
locator: host_id_or_endpoint: keep value as variant
gms: endpoint_state: add getters for host_id, dc_rack, and tokens
storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
gossiper: add_saved_endpoint: set dc and rack
gossiper: add_saved_endpoint: fixup indentation
gossiper: add_saved_endpoint: make host_id mandatory
gossiper: add load_endpoint_state
gossiper: start_gossiping: log local state
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.
refs: #2737
refs: #2795Closesscylladb/scylladb#16558
* github.com:scylladb/scylladb:
storage_service: Drain view builder on drain too
view_builder: Generalize mark_as_built(view_ptr) method
view_builder: Move mark_existing_views_as_built from storage service
storage_service: Add view_builder& reference
main,cql_test_env: Move view_builder start up (and make unconditional)
The partitions_bigger_than_threshold is incremented only if the previous
check detects that the partition exceeds a threshold by its size. It's
done with an extra if, but it can be done without (explicit) condition
as bool type is guaranteed by the standard to convert into integers as
true = 1 and false = 0
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18217
Pack the topology-related data loaded from system.peers
in `gms::load_endpoint_state`, to be used in a following
patch for `add_saved_endpoint`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.
During recovery auth works in read-only mode, writes
will fail.
Fixes https://github.com/scylladb/scylladb/issues/17736Closesscylladb/scylladb#18039
* github.com:scylladb/scylladb:
auth: keep auth version in scylla_local
auth: coroutinize service::start
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.
During recovery auth works in read-only mode, writes
will fail.
It now happens in initialize_virtual_tables(), but this function is split into sub-calls and iterates over virtual tables map several times to do its work. This PR squashes it into a straightforward code which is shorter and, hopefully, easier to read.
Closesscylladb/scylladb#18133
* github.com:scylladb/scylladb:
virtual_tables: Open-code install_virtual_readers_and_writers()
virtual_tables: Move readers setup loop into add_table()
virtual_tables: Move tables creation loop into add_table()
virtual_tables: Make add_tablet() a coroutine
virtual_tables: Open-code register_virtual_tables()
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
It's pretty short already and is naturally a "part" of
initialize_virtual_tables(). Neither it installs writers any longer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patch, after virtual tables are registered the
registry is iterated over to install virtual readers onto each entry.
Again, this can happen at the time of registering, no need in dedicated
loop for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Once virtual_tables map is populated, it's iterated over to create
replica::table entries for each virtual table. This can be done in the
same place where the virtual table is created, no need in dedicated loop
for it nowadays.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's naturally a "part" of initialize_virtual_tables(). Further patching
gets possible with it being open-coded.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixesscylladb/scylladb#17893
* 'gleb/initial-token-v1' of github.com:scylladb/scylla-dev:
dht: drop unused parameter from get_random_bootstrap_tokens() function
test: add test for initial_token parameter
topology coordinator: use provided initial_token parameter to choose bootstrap tokens
topology cooordinator: propagate initial_token option to the coordinator
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.
Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.
Closesscylladb/scylladb#18005
When loading topology state, nodes are checked to have or not to have
"tokens" field set. The check is done based on node state and it's
spread across the loading method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#17957
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change:
* add `format_as()` for `segment` so we can use it as a fallback
after upgrading to {fmt} v10
* use fmt::streamed() when formatting `segment`, this will be used
the intermediate solution before {fmt} v10 after dropping
`FMT_DEPRECATED_OSTREAM` macro
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18019
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890Closesscylladb/scylladb#17891
This patch introduces raft-based service levels.
The difference to the current method of working is:
- service levels are stored in `system.service_levels_v2`
- reads are executed with `LOCAL_ONE`
- writes are done via raft group0 operation
Service levels are migrated to v2 in topology upgrade.
After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then)
Fixes#17926Closesscylladb/scylladb#16585
* github.com:scylladb/scylladb:
test: test service levels v2 works in recovery mode
test: add test for service levels migration
test: add test for service levels snapshot
test:topology: extract `trigger_snapshot` to utils
main: create raft dda if sl data was migrated
service:qos: store information about sl data migration
service:qos: service levels migration
main: assign standard service level DDA before starting group0
service:qos: fix `is_v2()` method
service:qos: add a method to upgrade data accessor
test: add unit_test_raft_service_levels_accessor
service:storage_service: add support for service levels raft snapshot
service:qos: add abort_source for group0 operations
service:qos: raft service level distributed data accessor
service:qos: use group0_guard in data accessor
cql3:statements: run service level statements on shard0 with raft guard
test: fix overrides in unit_test_service_levels_accessor
service:qos: fix indentation
service:qos: coroutinize some of the methods
db:system_keyspace: add `SERVICE_LEVELS_V2` table
service:qos: extract common service levels' table functions
In this PR we add timeouts support to raft groups registry. We introduce
the `raft_server_with_timeouts` class, which wraps the `raft::server`
add exposes its interface with additional `raft_timeout` parameter. If
it's set, the wrapper cancels the `abort_source` after certain amount of
time. The value of the timeout can be specified either in the
`raft_timeout` parameter, or the default value can be set in `the
raft_server_with_timeouts` class constructor.
The `raft_group_registry` interface is extended with
`group0_with_timeouts()` method. It returns an instance of
`raft_server_with_timeouts` for group0 raft server. The timeout value
for it is configured in `create_server_for_group0`. It's one minute by
default and can be overridden for tests with
`group0-raft-op-timeout-in-ms` parameter.
The new api allows the client to decide whether to use timeouts or not.
In this PR we are reviewing all the group0 call sites and add
`raft_timeout` if that makes sense. The general principle is that if the
code is handling a client request and the client expects a potential
error, we use timeouts. We don't use timeouts for background fibers
(such as topology coordinator), since they wouldn't add much value. The
only thing the background fiber can do with a timeout is to retry, and
this will have the same end effect as not having a timeout at all.
Fixesscylladb/scylladb#16604Closesscylladb/scylladb#17590
* github.com:scylladb/scylladb:
migration_manager: use raft_timeout{}
storage_service::join_node_response_handler: use raft_timeout{}
storage_service::start_upgrade_to_raft_topology: use raft_timeout{}
storage_service::set_tablet_balancing_enabled: use raft_timeout{}
storage_service::move_tablet: use raft_timeout{}
raft_check_and_repair_cdc_streams: use raft_timeout{}
raft_timeout: test that node operations fail properly
raft_rebuild: use raft_timeout{}
do_cluster_cleanup: use raft_timeout{}
raft_initialize_discovery_leader: use raft_timeout{}
update_topology_with_local_metadata: use with_timeout{}
raft_decommission: use raft_timeout{}
raft_removenode: use raft_timeout{}
join_node_request_handler: add raft_timeout to make_nonvoters and add_entry
raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
raft_group0: make_raft_config_nonvoter: add abort_source parameter
manager_client: server_add with start=false shouldn't call driver_connect
scylla_cluster: add seeds parameter to the add_server and servers_add
raft_server_with_timeouts: report the lost quorum
join_node_request_handler: add raft_timeout{} for start_operation
skip_mode: add platform_key
auth: use raft_timeout{}
raft_group0_client: add raft_timeout parameter
raft_group_registry: add group0_with_timeouts
utils: add composite_abort_source.hh
error_injection: move api registration to set_server_init
error_injection: add inject_parameter method
error_injection: move injection_name string into injection_shared_data
error_injection: pass injection parameters at startup
Save information whether service levels data was migrated to v2 table.
The information is stored in `system.scylla_local` table. It's
written with raft command and included in raft snapshot.
The table has the same schema as `system_distributed.service_levels`.
However it's created entirely at once (unlike old table which creates
base table first and then it adds other columns) because `system` tables
are local to the node.
Getting a service level(s) will be done the same way in raft-based
service levels as it's done in standard service levels, so those
funtions are extracted to reused it.
The token ring table is a virtual table (`system.token_ring`), which contains the ring information for all keyspaces in the system. This is essentially an alternative to `nodetool describering`, but since it is a virtual table, it allows for all the usual filtering/aggregation/etc. that CQL supports.
Up until now, this table only supported keyspaces which use vnodes. This PR adds support for tablet keyspaces. To accommodate these keyspaces a new `table_name` column is added, which is set to `ALL` for vnodes keyspaces. For tablet keyspaces, this contains the name of the table.
Simple sanity tests are added for this virtual table (it had none).
Fixes: #16850Closesscylladb/scylladb#17351
* github.com:scylladb/scylladb:
test/cql-pytest: test_virtual_tables: add test for token_ring table
db/virtual_tables: token_ring_table: add tablet support
db/virtual_tables: token_ring_table: add table_name column
db/virtual_tables: token_ring_table: extract ring emit
service/storage_service: describe_ring_for_table(): use topology to map hostid to ip
Injection parameters can be used in the lambda passed to
inject_with_handler method to take some values from
the test. However, there was no way to set values to these
parameters on node startup, only through
the error injection REST api. Therefore, we couldn't rely
on this when inject_with_handler is used during
node startup, it could trigger before we call the api
from the test.
In this commit with solve this problem by allowing these
parameters to be assigned through scylla.yaml config.
The defer.hh header was added to error_injection.hh to fix
compilation after adding error_injection.hh to config.hh,
defer function is used in error_injection.hh.
sstables_manager now depends on system_keyspace for access to the
system.sstables table, needed by object storage. This violates
modularity, since sstables_manager is a relatively low-level leaf
module while system_keyspace integrates large parts of the system
(including, indirectly, sstables_manager).
One area where this is grating is sstables::test_env, which has
to include the much higher level cql_test_env to accommodate it.
Fix this by having sstables_manager expose its dependency on
system_keyspace as an interface, sstables_registry, and have
system_keyspace implement the glue logic in
system_keyspace_sstables_manager.
Closesscylladb/scylladb#17868
This PR fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because tablet replica allocation for
rebuild will not be able to find a viable destination, as the replacing node
is not considered to be a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.
The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is in left state and replacing node is in normal state.
The replacing node waits for this draining to be complete on boot
before the node is considered booted.
Fixes https://github.com/scylladb/scylladb/issues/17025
Nodes in the left state will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet replica sets.
Currently left nodes are never removed from topology, so will
accumulate in memory. We could garbage-collect them from topology
coordinator if a left node is absent in any replica set. That means we
need a new state - left_for_real.
Closesscylladb/scylladb#17388
* github.com:scylladb/scylladb:
test: py: Add test for view replica pairing after replace
raft, api: Add RESTful API to query current leader of a raft group
test: test_tablets_removenode: Verify replacing when there is no spare node
doc: topology-on-raft: Document replace behavior with tablets
tablets, raft topology: Rebuild tablets after replacing node is normal
tablets: load_balancer: Access node attributes via node struct
tablets: load_balancer: Extract ensure_node()
mv: Switch to using host_id-based replica set
effective_replication_map: Introduce host_id-based get_replicas()
raft topology: Keep nodes in the left state to topology
tablets: Introduce read_required_hosts()
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854Closesscylladb/scylladb#17855
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `db::functions::function`.
please note, because we use `std::ostream` as the parameter of
the polymorphism implementation of `function::print()`.
without an intrusive change, we have to use `fmt::ostream_formatter`
or at least use similar technique to format the `function` instance
into an instance of `ostream` first. so instead of implementing
a "native" `fmt::formatter`, in this change, we just use
`fmt::ostream_formatter`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17832
This is necessary to not break replica pairing between base and
view. After replacing a node, tablet replica set contains for a while
the replaced node which is in the left state. This node is not
returned by the IP-based get_natural_endpoints() so the replica
indexes would shift, changing the pairing with the view.
The host_id-based replica set always has stable indexes for replicas.
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on
their location. For example materialized views code which pairs base
replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order
to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.
We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.