locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.
All the boost and topology tests pass with this change.
In this commit we switch the function
calculate_effective_replication_map to use the new
token_metadata. We do this by employing our new helper
calculate_natural_ips function. We can't use this helper for
current_endpoints/target_endpoints though,
since in that case we won't add the IP to the
pending_endpoints in the replace-with-same-ip scenario
The token_metadata_test is migrated to host_ids in the same
commit to make it pass. Other tests work because they fill
both versions of the token_metadata, but for this test it was
simpler to just migrate it straight away. The test constructs
the old token_metadata over the new token_metadata,
this means only the get_new() method will work on it. That's
why we also need to switch some other functions
(maybe_remove_node_being_replaced, do_get_natural_endpoints,
get_replication_factor) to the new version in the same commit.
All the boost and topology tests pass with this change.
We've updated all the places where token_metadata
is mutated, and now we can progress to the next stage
of the refactoring - gradually switching the read
code paths.
The calculate_natural_endpoints function
is at the core of all of them. It decides to what nodes
the given token should be replicated to for the given
token_metadata. It has a lot of usages in various contexts,
we can't switch them all in one commit, so instead we
allowed the function to behave in both ways. If
use_host_id parameter is false, the function uses the provided
token_metadata as is and returns endpoint_set as a result.
If it's true, it uses get_new() on the provided token_metadata
and returns host_id_set as a result.
The scope of the whole refactoring is limited to the erm data
structure, its interface will be kept inet_address based for now.
This means we'll often need to resolve host_ids to inet_address-es
as soon as we got a result from calculated_natural_endpoints.
A new calculate_natural_ips function is added for convenience.
It uses the new token_metadata and immediately resolves
returned host_id-s to inet_address-es.
The auxiliary declarations natural_ep_type, set_type, vector_type,
get_self_id, select_tm are introduced only for the sake of
migration, they will be removed later.
With this commit, we begin the next stage of the
refactoring - updating the new version of the token_metadata
in all places where the old version is currently being updated.
In this commit we assign host_id of this node, both in main.cc
and in boost tests.
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.
We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
In the next commits token_metadata will be
made a template with NodeId=inet_address|host_id
parameter. This parameter will be passed to dc_rack_fn
function, so it also should be made a template.
For the host_id-based token_metadata we want host_id
to be the main node key, meaning it should be used
in add_or_update_endpoint to find the node to update.
For the inet_address-based token_metadata version
we want to retain the old behaviour during transition period.
In this commit we introduce key_kind parameter and use
key_kind::inet_address in all current topology usages.
Later we'll use key_kind::host_id for the new token_metadata.
In the last commits of the series, when the new token_metadata
version is used everywhere, we will remove key_kind enum.
Reject ALTER KEYSPACE request for NetworkTopologyStrategy when
replication options are missed.
Also reject CREATE KEYSPACE with no replication factor options.
Cassandra has a default_keyspace_rf configuration that may allow such
CREATE KEYSPACE commands, but Scylla doesn't have this option (refs #16028).
fixes#10036Closesscylladb/scylladb#16221
this source file was added in d3d83869. so let's update cmake
as well.
sessions_tests was added in the same commit, so add it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16344
This PR implements the following new nodetool commands:
* decomission
* rebuild
* removenode
* getlogginglevels
* setlogginglevel
* move
* refresh
All commands come with tests and all tests pass with both the new and the current nodetool implementations.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#16348
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement the refresh command
tools/scylla-nodetool: implement the move command
tools/scylla-nodetool: implement setlogginglevel command
tools/sclla-sstable: implement the getlogginglevels command
tools/scylla-nodetool: implement the removenode command
tools/scylla-nodetool: implement the rebuild command
tools/scylla-nodetool: implement the decommission command
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
This series adds preparation patches for file stream tablet implementation in enterprise branch. It minimizes the differences between those two branches.
Closesscylladb/scylladb#16297
* github.com:scylladb/scylladb:
messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
compaction_group_for_token: Handle minimum_token and maximum_token token
serializer: Add temporary_buffer support
cql_test_env: Allow messaging_service to start listen
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Closesscylladb/scylladb#16302
* github.com:scylladb/scylladb:
commitlog: Fix allocation size check to take sector overhead into account.
commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
The test test_many_partitions is very slow, as it tests a slow scan over
a lot of partitions. This was observed to time out on the slower ARM
machines, making the test flaky. To prevent this, create an
extra-patient cql connection with a 10 minutes timeout for the scan
itself.
Fixes: #16145Closesscylladb/scylladb#16303
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
This is needed for rpc calls to work in the tests. With this patch, by
default, messaging_service does not listen as it was before.
This is useful for file stream for tablet test.
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.
Also will be used in tests to ensure tablets stick to expected locations.
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.
Refs: https://github.com/scylladb/scylladb/issues/16255Closesscylladb/scylladb#16289
* github.com:scylladb/scylladb:
Update unified/build_unified.sh
Update main.cc
Update dist/common/scripts/scylla-housekeeping
Typos: fix typos in code
utils::fb_utilities is a global in-memory registry for storing and retrieving broadcast_address and broadcat_rpc_address.
As part of the effort to get rid of all global state, this series gets rid of fb_utilities.
This will eventually allow e.g. cql_test_env to instantiate multiple scylla server nodes, each serving on its own address.
Closesscylladb/scylladb#16250
* github.com:scylladb/scylladb:
treewide: get rid of now unused fb_utilities
tracing: use locator::topology rather than fb_utilities
streaming: use locator::topology rather than fb_utilities
raft: use locator::topology/messaging rather than fb_utilities
storage_service: use locator::topology rather than fb_utilities
storage_proxy: use locator::topology rather than fb_utilities
service_level_controller: use locator::topology rather than fb_utilities
misc_services: use locator::topology rather than fb_utilities
migration_manager: use messaging rather than fb_utilities
forward_service: use messaging rather than fb_utilities
messaging_service: accept broadcast_addr in config rather than via fb_utilities
messaging_service: move listen_address and port getters inline
test: manual: modernize message test
table: use gossiper rather than fb_utilities
repair: use locator::topology rather than fb_utilities
dht/range_streamer: use locator::topology rather than fb_utilities
db/view: use locator::topology rather than fb_utilities
database: use locator::topology rather than fb_utilities
db/system_keyspace: use topology via db rather than fb_utilities
db/system_keyspace: save_local_info: get broadcast addresses from caller
db/hints/manager: use locator::topology rather than fb_utilities
db/consistency_level: use locator::topology rather than fb_utilities
api: use locator::topology rather than fb_utilities
alternator: ttl: use locator::topology rather than fb_utilities
gossiper: use locator::topology rather than fb_utilities
gossiper: add get_this_endpoint_state_ptr
test: lib: cql_test_env: pass broadcast_address in cql_test_config
init: get_seeds_from_db_config: accept broadcast_address
locator: replication strategies: use locator::topology rather than fb_utilities
locator: topology: add helpers to retrieve this host_id and address
snitch: pass broadcast_address in snitch_config
snitch: add optional get_broadcast_address method
locator: ec2_multi_region_snitch: keep local public address as member
ec2_multi_region_snitch: reindent load_config
ec2_multi_region_snitch: coroutinize load_config
ec2_snitch: reindent load_config
ec2_snitch: coroutinize load_config
thrift: thrift_validation: use std::numeric_limits rather than fb_utilities
Currently, scylla.yaml is read conditionally, if either the user
provided `--scylla-yaml-file` command line parameter, or if deducing the
data dir location from the sstable path failed.
We want the scylla.yaml file to be always read, so that when working
with encrypted file (enterprise), scylla-sstable can pick up the
configuration for the encryption.
This patch makes scylla-sstable always attempt to read the scylla-yaml
file, whether the user provided a location for it or not. When not, the
default location is used (also considering the `SCYLLA_CONF` and
`SCYLLA_HOME` environment variables.
Failing to find the scylla.yaml file is not considered an error. The
rational is that the user will discover this if they attempt to do an
operation that requires this anyway.
There is a debug-level log about whether it was successfully read or
not.
Fixes: #16132Closesscylladb/scylladb#16174
Storage service uses group0 internally, but group0 is create long after
storage service is initialized and passed to it using ss::set_group0()
function. What it means is that during shutdown group0 is destroyed
before ss::stop() is called and thus storage service is left with a
dangling reference. Fix it by introducing a function that cancels all
group0 operations and waits for background fibers to complete. For that
we need separate abort source for group0 operation which the patch
series also introduces.
* 'gleb/group0-ss-shutdown' of github.com:scylladb/scylla-dev:
storage_service: topology coordinator: ignore abort_requested_exception in background fibers
storage_service: fix de-initialization order between storage service and group0_service
For getting rid of fb_utilities.
In the future, that could be used to instantiate
multiple scylla node instances.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
in 7a1fbb38, a new test is added to an existing test for
comparing the UUIDs with different time stamps, but we should tighten
the test a little bit to reflect the intention of the test:
the timestamp of "2023-11-24 23:41:56" should be less than
"2023-11-24 23:41:57".
in this change, we replace LE with LT to correct it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16245
Using consistent cluster management and not using schema commitlog
ends with a bad configuration throw during bootstrap. Soon, we
will make consistent cluster management mandatory. This forces us
to also make schema commitlog mandatory, which we do in this patch.
A booting node decides to use schema commitlog if at least one of
the two statements below is true:
- the node has `force_schema_commitlog=true` config,
- the node knows that the cluster supports the `SCHEMA_COMMITLOG`
cluster feature.
The `SCHEMA_COMMITLOG` cluster feature has been added in version
5.1. This patch is supposed to be a part of version 6.0. We don't
support a direct upgrade from 5.1 to 6.0 because it skips two
versions - 5.2 and 5.4. So, in a supported upgrade we can assume
that the version which we upgrade from has schema commitlog. This
means that we don't need to check the `SCHEMA_COMMITLOG` feature
during an upgrade.
The reasoning above also applies to Scylla Enterprise. Version
2024.2 will be based on 6.0. Probably, we will only support
an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even
if we support an upgrade from 2023.x, this patch won't break
anything because 2023.1 is based on 5.2, which has schema
commitlog. Upgrades from 2022.x definitely won't be supported.
When we populate a new cluster, we can use the
`force_schema_commitlog=true` config to use schema commitlog
unconditionally. Then, the cluster feature check is irrelevant.
This check could fail because we initiate schema commitlog before
we learn about the features. The `force_schema_commitlog=true`
config is especially useful when we want to use consistent cluster
management. Failing feature checks would lead to crashes during
initial bootstraps. Moreover, there is no point in creating a new
cluster with `consistent_cluster_management=true` and
`force_schema_commitlog=false`. It would just cause some initial
bootstraps to fail, and after successful restarts, the result would
be the same as if we used `force_schema_commitlog=true` from the
start.
In conclusion, we can unconditionally use schema commitlog without
any checks in 6.0 because we can always safely upgrade a cluster
and start a new cluster.
Apart from making schema commitlog mandatory, this patch adds two
changes that are its consequences:
- making the unneeded `force_schema_commitlog` config unused,
- deprecating the `SCHEMA_COMMITLOG` feature, which is always
assumed to be true.
Closesscylladb/scylladb#16254
Refs #15269
Unit test to check that trying to skip past EOF in a borked segment
will not crash the process. file_data_input_impl asserts iff caller
tries this.