Change the token_metadata type to token_metadata2 in
the signatures of CDC-related methods in
storage_service and cdc/generation. Use
get_new_strong to get a pointer to the new host_id-based
token_metadata from the inet_address-based one,
living in the shared_token_metadata.
The starting point of the patch is in
storage_service::handle_global_request. We change the
tmptr type to token_metadata2 and propagate the change
down the call chains. This includes token-related methods
of the boot_strapper class.
Returns this node's endpoint_state_ptr.
With this entry point, the caller doesn't need to
get_broadcast_address.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to
give them a time-based order. We also change how we initialize
them so that the new CDC generation always has the highest ID.
This is the last step to enabling the efficient clearing of
obsolete CDC generation data.
Additionally, we change the types of current_cdc_generation_uuid,
new_cdc_generation_data_uuid and the second values of the elements
in unpublished_cdc_generations to timeuuid, so that they match id
in CDC_GENERATIONS_V3.
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
In the future commit, we change how we initialize uuid of the
new CDC generation in the Raft-based topology. It forces us to
move this initialization out of the make_new_generation_data
function shared between Raft-based and gossiper-based topologies.
We also rename make_new_generation_data to
make_new_generation_description since it only returns
cdc::topology_description now.
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.
After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
In the following commit, we implement the
get_cdc_generation_mutations_v3 function very similar to
get_cdc_generation_mutations_v2. The only differences in creating
mutations between CDC_GENERATIONS_V2 and CDC_GENERATIONS_V3 are:
- a need to set the num_ranges cell for CDC_GENERATIONS_V2,
- different partition keys,
- different clustering keys.
To avoid code duplication, we introduce
get_common_cdc_generation_mutations, which does most of the work
shared by both functions.
In the following commits, we modify the CDC_GENERATIONS_V3 schema
to enable efficient clearing of obsolete CDC generation data.
These modifications make the current get_cdc_generation_mutations
work only for the CDC_GENERATIONS_V2 schema, and we need a new
function for CDC_GENERATIONS_V3, so we add the "_v2" suffix.
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to lookup the application_state again using the
endpoint, as both callers already have a reference to
the endpoint_state handy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before changing _endpoint_state_map to hold a
lw_shared_ptr<endpoint_state>, provide synchronous helpers
for users to traverse all endpoint_states with no need
to copy them (as long as the called func does not yield).
With that, gossiper::get_endpoint_states() can be made private.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As was described in the previous patch, this method is explicitly called
by storage service after updating the bootstrap state, so it's unneeded
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The legacy_handle_cdc_generation() checks if the node had bootstrapped
with the help of system_keyspace method. The former is called in two
cases -- on boot via cdc_generation_service::after_join() and via
gossiper on_...() notifications. The notifications, in turn, are set up
in the very same after_join().
The after_join(), in turn, is called from storage_service explicitly
after the bootstrap state is updated to be "complete", so the check for
the state in legacy_handle_...() seems unnecessary. However, there's
still the case when it may be stepped on -- decommission. When performed
it calls storage_service::leave_ring() which udpates the bootstrap state
to be "needed", thus preventing the cdc gen. service from doing anything
inside gossiper's on_...() notifications.
It's more correct to stop cdc gen. service handling gossiper
notifications by unsubscribing it, but by adding fragile implicit
dependencies on the bootstrap state.
Checks for sys.dist.ks in the legacy_handle_...() are kept in a form
of on-internal-error. The system distributed keyspace is activated by
storage service even before the bootstrap state is updated and is
never deactivated, but it's anyway good to have this assertion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.
We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.
An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.
Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471
Closes#14845
* github.com:scylladb/scylladb:
gossiper: replicate: ensure non-null permit
gossiper: add_saved_endpoint: lock_endpoint
gossiper: mark_as_shutdown: lock_endpoint
gossiper: real_mark_alive: lock_endpoint
gossiper: advertise_token_removed: lock_endpoint
gossiper: do_status_check: lock_endpoint
gossiper: remove_endpoint: lock_endpoint if needed
gossiper: force_remove_endpoint: lock_endpoint if needed
storage_service: lock_endpoint when removing node
gossiper: use permit_id to serialize state changes while preventing deadlocks
gossiper: lock_endpoint: add debug messages
utils: UUID: make default tagged_uuid ctor constexpr
gossiper: lock_endpoint must be called on shard 0
gossiper: replicate: simplify interface
gossiper: mark_as_shutdown: make private
gossiper: convict: make private
gossiper: mark_as_shutdown: do not call convict
Pass permit_id to subscribers when we acquire one
via lock_endpoint. The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We add the CDC generation optimality check in
`storage_service::raft_check_and_repair_cdc_streams` so that it doesn't
create new generations when unnecessary. Since
`generation_service::check_and_repair_cdc_streams` already has this
check, we extract it to the new `is_cdc_generation_optimal` function to
not duplicate the code.
After this change, multiple tasks could wait for a single generation
change. Calling `signal` on `topology_state_machine.event` would't wake
them all. Moreover, we must ensure the topology coordinator wakes when
his logic expects it. Therefore, we change all `signal` calls on
`topology_state_machine.event` to `broadcast`.
We delay the deletion of the `new_cdc_generation` request to the moment
when the topology transition reaches the `publish_cdc_generation` state.
We need this change to ensure the added CDC generation optimality check
in the next commit has an intended effect. If we didn't make it, it
would be possible that a task makes the `new_cdc_generation` request,
and then, after this request was removed but before committing the new
generation, another task also makes the `new_cdc_generation` request. In
such a scenario, two generations are created, but only one should. After
delaying the deletion of `new_cdc_generation` requests, the second
request would have no effect.
Additionally, we modify the `test_topology_ops.py` test in a way that
verifies the new changes. We call
`storage_service::raft_check_and_repair_cdc_streams` multiple times
concurrently and verify that exactly one generation has been created.
Fixes#14055Closes#14789
* github.com:scylladb/scylladb:
storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
storage_service: delay deletion of the new_cdc_generation request
raft topology: broadcast on topology_state_machine.event instead of signal
cdc: implement the is_cdc_generation_optimal function
In the following commits, we add the CDC generation optimality
check to storage_service::raft_check_and_repair_cdc_streams so
that it doesn't create new CDC generations when unnecessary. Since
generation_service::check_and_repair_cdc_streams already has
this check, we extract it to the new is_cdc_generation_optimal
function to not duplicate the code.
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.
This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `generation_id` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
if the visitor clauses are the same, we can just use the generic version
of it by specifying the parameter with `auto&`. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13626
now that C++20 generates operator== for us, these is no need to
handcraft it manually. also, in C++17, the standard library offers
default implementation of operator== for `std::variant<>`, so no need
to implement it by ourselves.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13625
and provide accessor functions to get them.
1. So they can't be modified by mistake, as the versioned value is
immutable. A new value must have a higher version.
2. Before making the version a strong gms::version_type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data
structures that are used when coordinating writes to CDC log tables.
`cdc::generation_service::make_new_cdc_generation` would create a new
CDC generation and insert it into the `CDC_GENERATIONS_V2` table these
days. For Raft-based topology chnages we'll do the data insertion
somewhere else - in topology coordinator code. So extract the parts for
calculating the CDC generation to free-standing functions (these are
almost pure calculations, modulo accessing RNG).
The function would generate a mutation timestamp for itself, take it as
parameter instead. We'll use timestamps provided by Group 0 APIs when
creating CDC generations during Group 0- based topology changes.
The function used to obtain the sharding info for a given node (its
number of shards and ignore_msb_bits) was using gossiper application
states.
We want to reuse `topology_description_generator` to build CDC
generations when doing Raft Group 0-based topology changes, so make
`get_sharding_info` a parameter.
It was a `static` function inside system_distributed_keyspace. Later it
will be used for another table living in system_keyspace, so move it
outside, to the CDC generations module, and make it accessible from
other places.
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When update_streams_description() fails it spawns a fiber and retries
the update in the background once every 60s. If the sleeping between
attempts is aborted, the respective exceptional future happens to be
ignored and warned in logs.
fixes: #11192
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220802132148.20688-1-xemul@scylladb.com>
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.
To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).
References throughout the codebase are adjusted.
The users of get_/set_bootstrap_sate and aux helpers are CDC and
storage service. Both have local system_keyspace references and can
just use them. This removes some users of global system ks. cache
and the qctx thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service uses system keyspace to, e.g., manage the generation id,
thus it depends on the system_keyspace instance and deserves the
explicit reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
If the number of streams exceeds the number of token ranges
it indicates that some spurious streams from decommissioned
nodes are present.
In such a situation - simply regenerate.
Fixes#9772Closes#9780
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.