"
The series fixes bogus asserting during topology state load and add a
test that runs rebuild to make sure the code will not regress again.
Fixes#14958
"
* 'gleb/rebuilding_fix_v1' of github.com:scylladb/scylla-dev:
test: add rebuild test
system_keyspace: fix assertion for missing transition_state
On boot several manipulations with system.local are performed.
1. The host_id value is selected from it with key = local
If not found, system_keyspace generates a new host_id, inserts the
new value into the table and returns back
2. The cluster_name is selected from it with key = local
Then it's system_keyspace that either checks that the name matches
the one from db::config, or inserts the db::config value into the
table
3. The row with key = local is updated with various info like versions,
listen, rpc and bcast addresses, dc, rack, etc. Unconditionally
All three steps are scattered over main, p.1 is called directly, p.2 and
p.3 are executed via system_keyspace::setup() that happens rather late.
Also there's some touch of this table from the cql_test_env startup code.
The proposal is to collect this setup into one place and execute it
early -- as soon as the system.local table is populated. This frees the
system_keyspace code from the logic of selecting host id and cluster
name leaving it to main and keeps it with only select/insert work.
refs: #2795
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15082
The code assumes that if there is no transition_state there should be no
nodes that currently in transition in a state other then left_token_ring
state, but rebuild operation also creates such nodes, so add the check
for it as well.
In force_blocking_flush() there's an invoke-on-all invocation of
replica::database::flush() and a FIXME to get the replica database from
somewhere else rather than via query-processor -> data_dictionary.
Since now the force_blocking_flush() is non-static the invoke-on-all can
happen via system_keyspace's container and the database can be obtained
directly from the sys.ks. local instance
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now, it is possible to load topology_features separately from the
topology struct. It will be used in the code that checks enabled
features on startup.
`enabled_features` and `supported_features` are now moved to a new
`topology::features` struct. This will allow to move load this
information independently from the `topology` struct, which will be
needed for feature checking during start.
Now those methods are non-static and can start using this's reference to
query processor instead of the global qctx thing
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_scylla_local_param_as() wants to flush replica::database on all
shards. For that it uses smp::invoke_on_all() and qctx, but since the
method is now non-static one for system_keyspace it can enjoy usiing
container().invoke_on_all() and this->_db (on target shard)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These are now two .cc-local templatized helpers, but they are only
called by system_keyspace:: non-static methods, so can be such as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is raft_group0_client with sys.ks. dependency reference and
group0_state_machine with raft_group0_client exporing its sys.ks.
This makes it possible to instantly drop one more qctx reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is raft_group0_client with sys.ks. dependency reference.
This allows to drop one qctx reference right at once
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The db.get_version() called that early returns value that database got
construction-time, i.e. -- empty_version thing. It makes little sense
committing it into the system k.s. all the more so the "real" version is
calculated and updated few steps after .setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14833
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14824
This is to make m.s. initialization more solid and simplify sys.ks.::setup()
Closes#14832
* github.com:scylladb/scylladb:
system_keyspace: Remove unused snitch arg from setup()
messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.
Closes#14758
* github.com:scylladb/scylladb:
db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
db/system_keyspace: Make paxos methods non-static
service/paxos: Add db::system_keyspace& argument to some methods
test: Optionally initialize proxy remote for cql_test_env
proxy/remote: Keep sharded<db::system_keyspace>& dependency
This template call is only used by system keyspace paxos methods. All
those methods are no longer static and can use system_keyspace::_qp
reference to real query processor instead of global qctx. The
execute_cql_with_timeout() wrapper is moved to system_keyspace to make
it work
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this change is a follow-up of 3129ae3c8c.
since in both cases in this change, the `num_ranges` should always
be greater than zero, there is no need to use `int` for its type,
and "num_ranges" returned by the CQL query should always be greater
or equal to zero, so there is no need to check if it is positive.
in this change, we
* change the type of `num_ranges` to `size_t`
* change std::cmp_not_equal() to !=
to avoid using the verbose `std::cmp_not_equal()` helper, for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14754
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:
```
int main(int argc, char **argv) {
int x = -1;
unsigned y = 2;
fmt::print("{}\n", x < y);
return 0;
}
```
this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The definitions of virtual tables make up approximately a quarter of the
huge system_keyspace.cc file (almost 4K lines), pulling in a lot of
headers only used by them.
Move them to a separate source file to make system_keyspace.cc easier
for humans and compilers to digest.
This patch also moves the `register_virtual_tables()`,
`install_virtual_readers()` as well as the `virtual_tables` global.
Closes#14308
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.
This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.
Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.
Closes#13850
* github.com:scylladb/scylladb:
test: pylib: increase checking period for `get_alive_endpoints`
test: add node banning test
test: pylib: manager_client: `get_cql()` helper
test: pylib: ScyllaCluster: server pause/unpause API
raft topology: ban left nodes
raft topology: skip `left_token_ring` state during `removenode`
raft topology: prepare decommission path for node banning
raft topology: introduce `left_token_ring` state
raft topology: `raft_topology_cmd` implicit constructor
messaging_service: implement host banning
messaging_service: exchange host IDs and map them to connections
messaging_service: store the node's host ID
messaging_service: don't use parameter defaults in constructor
main: move messaging_service init after system_keyspace init
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
Querying from virtual table system.token_ring fails if there is a
tablet-based table due to attempt to obtain a per-keyspace erm.
Fix by not showing such keyspaces.
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.
Two new columns are added to `system.topology`:
- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.
These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.
During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.
Closes#14232
* github.com:scylladb/scylladb:
storage_service: update supported cluster features in group0 on start
storage_service: add methods for features to topology mutation builder
storage_service: use explicit ::set overload instead of a template
storage_service: reimplement mutation builder setters
storage_service: introduce topology_mutation_builder_base
topology_state_machine: include information about features
system_keyspace: introduce deserialize_set_column
db/system_keyspace: add storage for cluster features managed in group 0
There are three places in system_keyspace.cc which deserialize a column
holding a set of tokens and convert it to an unordered set of
dht::token. The deserialization process involves a small number of steps
that are the same in all of those places, therefore they can be
abstracted away.
This commit adds `deserialize_set_column` function which takes care of
deserializing the column to `set_type_impl::native_type` which can be
then passed to `decode_tokens`. The new function will also be useful for
decoding set columns with cluster features, which will be handled in the
next commit.
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).
Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.
After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.
Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.
Refs: #14283
Initialization of `system_keyspace` is now all done at once instead of
being spread out through the entire procedure. This is doable because
`query_processor` is now available early. A couple of FIXMEs have been
resolved.
Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.
This will allow us to move the `make` step right after starting
`system_keyspace`.
`initialize_virtual_tables` was called from `system_keyspace::make`,
which caused this `make` function to take a bunch of references to
late-initialized services (`gossiper`, `storage_service`).
Call it from `main`/`cql_test_env` instead.
Note: `system_keyspace::make` is called from
`distributed_loader::init_system_keyspace`. The latter function contains
additional steps: populate the system keyspaces (with data from
sstables) and mark their tables ready for writes.
None of these steps apply to virtual tables.
There exists at least one writable virtual table, but writes into
virtual tables are special and the implementation of writes is
virtual-table specific. The existing writable virtual table
(`db_config_table`) only updates in-memory state when written to. If a
virtual table would like to create sstables, or populate itself with
sstable data on startup, it will have to handle this in its own
initialization function.
Separating `initialize_virtual_tables` like this will allow us to
simplify `system_keyspace` initialization, making it independent of
services used for distributed communication.
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.
This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).
Extract this part into `database::create_local_system_table`.
Make `database::add_column_family` private.
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.
Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
The `system.topology` table is extended with two new columns that will
be used to manage cluster features:
- `supported_features set<text>` is a new clustering column which holds
the features that given node advertises as supported. It will be first
initialized when the node joins the cluster, and then updated every
time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the
features that are considered enabled by the cluster. Unlike in the
current gossip-based implementation the features will not be enabled
implicitly when all nodes support a feature, but rather when via an
explicit action of the topology coordinator.
This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit).
* the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm;
* when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely;
* the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future.
* after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced.
* the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command;
* currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness;
* along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done;
* we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present.
Closes#13884
* github.com:scylladb/scylladb:
storage_service: warn if can't find ip for server
storage_proxy.cc: add and use global_token_metadata_barrier
storage_service: exec_global_command: bool result -> exceptions
raft_topology: add cmd_index to raft commands
storage_proxy.cc: add fencing to read RPCs
storage_proxy.cc: extract handle_read
storage_proxy.cc: refactor encode_replica_exception_for_rpc
storage_proxy: fix indentation
storage_proxy: add fencing for mutation
storage_servie: fix indentation
storage_proxy: add fencing_token and related infrastructure
raft topology: add fence_version
raft_topology: add barrier_and_drain cmd
token_metadata: add topology version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.
The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
this change generalize the value of generation_type so it also
supports UUID based identifier.
* sstables/generation_type.h:
- add formatter and parse for UUID. please note, Cassandra uses
a different format for formatting the SSTable identifier. and
this formatter suits our needs as it uses underscore "_" as the
delimiter, as the file name of components uses dash "-" as the
delimiter. instead of reinventing the formatting or just use
another delimiter in the stringified UUID, we choose to use the
Cassandra's formatting.
- add accessors for accessing the type and value of generation_type
- add constructor for constructing generation_type with UUID and
string.
- use hash for placing sstables with uuid identifiers into shards
for more uniformed distrbution of tables in shards.
* replica/table.cc:
- only update the generator if the given generation contains an
integer
* test/boost:
- add a simple test to verify the generation_type is able to
parse and format
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.
It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.
This PR implements the API on top of raft topology, it was originally
implemented using gossiper. It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.
Closes#13683
* github.com:scylladb/scylladb:
docs: update topology-over-raft.md
test: topology_experimental_raft: test `check_and_repair_cdc` API
raft topology: implement `check_and_repair_cdc_streams` API
raft topology: implement global request handling
raft topology: introduce `prepare_new_cdc_generation_data`
raft_topology: `get_node_to_work_on_opt`: return guard if no node found
raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
raft topology: separate `publish_cdc_generation` state
raft topology: non-node-specific `exec_global_command`
raft topology: introduce `start_operation()`
raft topology: non-node-specific `topology_mutation_builder`
topology_state_machine: introduce `global_topology_request`
topology_state_machine: use `uint16_t` for `enum_class`es
raft topology: make `new_cdc_generation_data_uuid` topology-global