Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.
This will allow us to move the `make` step right after starting
`system_keyspace`.
`initialize_virtual_tables` was called from `system_keyspace::make`,
which caused this `make` function to take a bunch of references to
late-initialized services (`gossiper`, `storage_service`).
Call it from `main`/`cql_test_env` instead.
Note: `system_keyspace::make` is called from
`distributed_loader::init_system_keyspace`. The latter function contains
additional steps: populate the system keyspaces (with data from
sstables) and mark their tables ready for writes.
None of these steps apply to virtual tables.
There exists at least one writable virtual table, but writes into
virtual tables are special and the implementation of writes is
virtual-table specific. The existing writable virtual table
(`db_config_table`) only updates in-memory state when written to. If a
virtual table would like to create sstables, or populate itself with
sstable data on startup, it will have to handle this in its own
initialization function.
Separating `initialize_virtual_tables` like this will allow us to
simplify `system_keyspace` initialization, making it independent of
services used for distributed communication.
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.
This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).
Extract this part into `database::create_local_system_table`.
Make `database::add_column_family` private.
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.
Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
This reverts commit 7dadd38161.
The latest revert cited debuggability trumping performance, but the
performance loss is su huge here that debug builds are unusable and
next promotions time out.
In the interest of progress, pick the lesser of two evils.
The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully.
More context can be found on the python driver issue.
This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart.
Refs: scylladb/python-driver#230
Closes#14276
* github.com:scylladb/scylladb:
tests/topology: work around python driver issue in cluster feature tests
test/topology{_raft_disabled}: move reconnect_driver to topology utils
In https://github.com/scylladb/scylladb/pull/14231 we split `storage_proxy` initialization into two phases: for local and remote parts. Here we do the same with `query_processor`. This allows performing queries for local tables early in the Scylla startup procedure, before we initialize services used for cluster communication such as `messaging_service` or `gossiper`.
Fixes: #14202
As a follow-up we will simplify `system_keyspace` initialization, making it available earlier as well.
Closes#14256
* github.com:scylladb/scylladb:
main, cql_test_env: start `query_processor` early
cql3: query_processor: split `remote` initialization step
cql3: query_processor: move `migration_manager&`, `forwarder&`, `group0_client&` to a `remote` object
cql3: query_processor: make `forwarder()` private
cql3: query_processor: make `get_group0_client()` private
cql3: strongly_consistent_modification_statement: fix indentation
cql3: query_processor: make `get_migration_manager` private
tracing: remove `qp.get_migration_manager()` calls
table_helper: remove `qp.get_migration_manager()` calls
thrift: handler: move implementation of `execute_schema_command` to `query_processor`
data_dictionary: add `get_version`
cql3: statements: schema_altering_statement: move `execute0` to `query_processor`
cql3: statements: pass `migration_manager&` explicitly to `prepare_schema_mutations`
main: add missing `supervisor::notify` message
The `topology_coordinator::cleanup_group0_config_if_needed` function first checks whether the number of group 0 members is larger than the number of non-left entries in the topology table, then attempts to remove nodes in left state from group 0 and prints a warning if no such nodes are found. There are some problems with this check:
- Currently, a node is added to group 0 before it inserts its entry to the topology table. Such a node may cause the check to succeed but no nodes will be removed, which will cause the warning to be printed needlessly.
- Cluster features on raft will reverse the situation and it will be possible for an entry in system.topology to exist without the corresponding node being a part of group 0. This, in turn, may cause the check not to pass when it should and nodes could be removed later than necessary.
This commit gets rid of the optimization and the warning, and the topology coordinator will always compute the set of nodes that should be removed. Additionally, the set of nodes to remove is now computed differently: instead of iterating over left nodes and including only those that are in group 0, we now iterate over group 0 members and include those that are in `left` state. As the number of left nodes can potentially grow unbounded and the number of group 0 members is more likely to be bounded, this should give better performance in long-running clusters.
Closes#14238
* github.com:scylladb/scylladb:
storage_service: fix indentation after previous commit
storage_service: remove optimization in cleanup_group0_config_if_needed
The test `test_downgrade_after_successful_upgrade_fails` stops all
nodes, reconfigures them to support the test-only feature and restarts
them. Unfortunately, it looks like python driver sometimes does not
handle this properly and might not reconnect after all nodes are shut
down.
This commit adds a workaround for scylladb/python-driver#230 - the test
re-creates python driver session right after nodes are restarted.
The `reconnect_driver` function will be useful outside the
`topology_raft_disabled` test suite - namely, for cluster feature tests
in `topology`. The best course of action for this function would be to
put it into pylib utils; however, the function depends on ManagerClient
which is defined in `test.pylib.manager_client` that depends on
`test.pylib.utils` - therefore we cannot put it there as it would cause
an import cycle. The `topology.utils` module sounds like the next best
thing.
In addition, the docstring comment is updated to reflect that this
function will now be used to work around another issue as well.
Pass `migration_manager&`, `forward_service&` and `raft_group0_client&`
in the remote init step which happens after the constructor.
Add a corresponding uninit remote step.
Make sure that any use of the `remote` services is finished before we
destroy the `remote` object by using a gate.
Thanks to this in a later commit we'll be able to move the construction
of `query_processor` earlier in the Scylla initialization procedure.
These services are used for performing distributed queries, which
require remote calls. As a preparation for 2-phase initialization of
`query_processor` (for local queries vs for distributed queries), move
them to a separate `remote` object which will be constructed in the
second phase.
Replace the getters for the different services with a single `remote()`
getter. Once we split the initialization into two phases, `remote()`
will include a safety protection.
After previous commits it's no longer used outside `query_processor`.
Also remove the `const` version - not needed for anything.
Use the getter instead of directly accessing `_mm` in `query_processor`
methods. Later we will put `_mm` in a separate object.
This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit).
* the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm;
* when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely;
* the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future.
* after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced.
* the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command;
* currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness;
* along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done;
* we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present.
Closes#13884
* github.com:scylladb/scylladb:
storage_service: warn if can't find ip for server
storage_proxy.cc: add and use global_token_metadata_barrier
storage_service: exec_global_command: bool result -> exceptions
raft_topology: add cmd_index to raft commands
storage_proxy.cc: add fencing to read RPCs
storage_proxy.cc: extract handle_read
storage_proxy.cc: refactor encode_replica_exception_for_rpc
storage_proxy: fix indentation
storage_proxy: add fencing for mutation
storage_servie: fix indentation
storage_proxy: add fencing_token and related infrastructure
raft topology: add fence_version
raft_topology: add barrier_and_drain cmd
token_metadata: add topology version
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14266
memory_footprint_test fails with:
`sstable - writing sstables with too old format`
because it attempts to write the obsolete sstables formats,
for which the writer code has been long removed.
Fix that.
Closes#14265
repair_node_state::state is only for debugging purpose, see
ab57cea783 which introduced it.
so this change does not impact the behavior of scylla, but can
improve the debugging experience by reflecting more accurate state
of repair when we are actually inspecting it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14255
CQL statements carry expressions in many contexts: the SELECT, WHERE, SET, and IF clauses, plus various attributes. Previously, each of these contexts had its own representation for an expression, and another one for the same expression but before preparation. We have been gradually moving towards a uniform representation of expressions.
This series tackles SELECT clause elements (selectors), in their unprepared phase. It's relatively simple since there are only five types of expression components (column references, writetime/ttl modifiers, function calls, casts, and field selections). Nevertheless, there isn't much commonality with previously converted expression elements so quite a lot of code is involved.
After the series, we are still left with a custom post-prepare representation of expressions. It's quite complicated since it deals with two passes, for aggregation, so it will be left for another series.
Closes#14219
* github.com:scylladb/scylladb:
cql3: seletor: drop inheritance from assignment_testable
cql3: selection: rely on prepared expressions
cql3: selection: prepare selector expressions
cql3: expr: match counter arguments to function parameters expecting bigint
cql3: expr: avoid function constant-folding if a thread is needed
cql3: add optional type annotation to assignment_testable
cql3: expr: wire unresolved_identifier to test_assignment()
cql3: expr: support preparing column_mutation_attribute
cql3: expr: support preparing SQL-style casts
cql3: expr: support preparing field_selection expressions
cql3: expr: make the two styles of cast expressions explicit
cql3: error injection functions: mark enabled_injections() as impure
cql3: eliminate dynamic_cast<selector> from functions::get()
cql3: test_assignment: pass optional schema everywhere
cql3: expr: prepare_expr(): allow aggregate functions
cql3: add checks for aggregation functions after prepare
cql3: expr: add verify_no_aggregate_functions() helper
test: add regression test for rejection of aggregates in the WHERE clause
cql3: expr: extract column_mutation_attribute_type
cql3: expr: add fmt formatter for column_mutation_attribute_kind
cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON
currently, despite that we are moving from Java-8 to
Java-11, we still support both Java versions. and the
docker image used for testing Datatax driver has
not been updated to install java-11.
the "java" executable provided by openjdk-java-8 does
not support "--version" command line argument. java-11
accept both "-version" and "--version". so to cater
the needs of the the outdated docker image, we pass
"-version" to the selected java. so the test passes
if java-8 is found. a better fix is to update the
docker image to install java-11 though.
the output of "java -version" and "java --version" is
attached here as a reference:
```console
$ /usr/lib/jvm/java-1.8.0/bin/java --version
Unrecognized option: --version
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
```
```console
$ /usr/lib/jvm/java-1.8.0/bin/java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)
```
```console
/usr/lib/jvm/jre-11/bin/java --version
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```
```console
$ /usr/lib/jvm/jre-11/bin/java -version
openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```
Fixes#14253
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14254
this series adds an option named "uuid_sstable_identifier_enabled", and the related cluster feature bit, which is set once all nodes in this cluster set this option to "true". and the sstable subsystem will start using timeuuid instead plain integer for the identifier of sstables. timeuuid should be a better choice for identifiers as we don't need to worry about the id conflicts anymore. but we still have quite a few tests using static sstables with integer in their names, these tests are not changed in this series. we will create some tests to exercise the sstable subsystem with this option set.
a very simple inter-op test with Cassandra 4.1.1 was also performed to verify that the generated sstables can be read by the Cassandra:
1. start scylla, and connect it with cqlsh, run following commands, and stop it
```
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
cqlsh> SELECT * FROM ks.cf;
```
2. enable Cassandra's `uuid_sstable_identifiers_enabled`, and start Cassandra 4.1.1, and connect it with cqlsh, run following commands, and stop it
```
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
cqlsh> SELECT * FROM ks.cf;
```
2. move away the sstables generated by Cassandra, and replace it with the sstables generated by scylladb:
```console
$ mv ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49 /tmp
$ mv ~/scylla/ks/cf-db47a12009d611eea6b8b179df3a2d5d ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49
```
3. start Cassandra 4.1.1 again, and connect it with cqlsh, run following commands
```
cqlsh> SELECT * FROM ks.cf;
name | value
------+-------
1 | one
```
Fixes https://github.com/scylladb/scylladb/issues/10459Closes#13932
* github.com:scylladb/scylladb:
replica,sstable: introduce invalid generation id
sstables, replica: pass uuid_sstable_identifiers to generation generator
gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature
db: config: add uuid_sstable_identifiers_enabled option
sstables, replica: support UUID in generation_type
This allows to reflect cause-and-effect
relationships in the logs: if some command
failed, we write to the log at the top level
of the topology state machine. The log message
includes the current state of the state
machine and a description of what
exactly went wrong.
Note that in the exec_global_command overload
returning node_to_work_on we don't call retake_node()
if the nested exec_global_command failed.
This is fine, since all the callers
just log/break in this case.
In this commit we add logic to protect against
raft commands reordering. This way we can be
sure that the topology state
(_topology_state_machine._topology) on all the
nodes processing the command is consistent
with the topology state on the topology change
coordinator. In particular, this allows
us to simply use _topology.version as the current
version in barrier_and_drain instead of passing it
along with the command as a parameter.
Topology coordinator maintains an index of the last
command it has sent to the cluster. This index is
incremented for each command and sent along with it.
The receiving node compares it with the last index
it received in the same term and returns an error
if it's not greater. We are protected
against topology change coordinator migrating
to other node by the already existing
terms check: if the term from the command
doesn't match the current term we return an error.
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.
Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
We are going to add fencing to read RPCs, it would be easier
to do it once for all three of them. This refactoring
enables this since it allows to use
encode_replica_exception_for_rpc for handle_read_digest.
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.
Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.
An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.
The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
the invalid sstable id is the NULL of a sstable identifier. with
this concept, it would be a lot simpler to find/track the greatest
generation. the complexity is hidden in the generation_type, which
compares the a) integer-based identifiers b) uuid-based identifiers
c) invalid identitifer in different ways.
so, in this change
* the default constructor generation_type is
now public.
* we don't check for empty generation anymore when loading
SSTables or enumerating them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we assume that generation is always integer based.
in order to enable the UUID-based generation identifier if the related
option is set, we should populate this option down to generation generator.
because we don't have access to the cluster features in some places where
a new generation is created, a new accessor exposing feature_service from
sstable manager is added.
Fixes#10459
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
UUID_SSTABLE_IDENTIFIERS is a new cluster wide feature. if it is
enabled, all nodes will generate new sstables with the UUID as their
generation identifiers. this feature is configured using
config option of "uuid_sstable_identifiers_enabled".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
unlike Cassandra 4.1, this option is true by default, will be used
for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet.
please note, because we are still using sstableloader and
sstabledump based on 3.x branch, while the Cassandra upstream
introduced the uuid sstable identifier in its 4.x branch, these tool
fail to work with the sstables with uuid identifier, so this option
is disabled when performing these tests. we will enable it once
these tools are updated to support the uuid-basd sstable identifiers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change generalize the value of generation_type so it also
supports UUID based identifier.
* sstables/generation_type.h:
- add formatter and parse for UUID. please note, Cassandra uses
a different format for formatting the SSTable identifier. and
this formatter suits our needs as it uses underscore "_" as the
delimiter, as the file name of components uses dash "-" as the
delimiter. instead of reinventing the formatting or just use
another delimiter in the stringified UUID, we choose to use the
Cassandra's formatting.
- add accessors for accessing the type and value of generation_type
- add constructor for constructing generation_type with UUID and
string.
- use hash for placing sstables with uuid identifiers into shards
for more uniformed distrbution of tables in shards.
* replica/table.cc:
- only update the generator if the given generation contains an
integer
* test/boost:
- add a simple test to verify the generation_type is able to
parse and format
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Push those calls up the call stack, to `trace_keyspace_helper` module.
Pass `migration_manager` reference around together with
`query_processor` reference.