thrift support was deprecated since ScyllaDB 5.2
> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.
so let's drop it. in this change,
* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is
preserved for backward compatibility, as we could load
from an existing system.local table which still contains
this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for
backward compatibility with java-based nodetool.
* `rpc_port` and `start_rpc` options are preserved, but
they are marked as "Unused". so that the new release
of scylladb can consume existing scylla.yaml configurations
which might contain these settings. by making them
deprecated, user will be able get warned, and update
their configurations before we actually remove them
in the next major release.
Fixes#3811Fixes#18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.
Add method which waits until all raft changes are applied and then
checks whether given table exists.
Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.
In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In the following commits, we reduce the number of the
migration_manager::anounce calls by merging some of them in a way
that logically makes sense. Some of these merges are similar --
we announce a new keyspace and its tables together. However,
we cannot use the current prepare_new_column_family_announcement
there because it assumes that the keyspace has already been created
(when it loads the keyspace from the database). Luckily, this
assumption is not necessary as this function only needs
keyspace_metadata. Instead of loading it from the database, we can
pass it as a parameter.
This reverts commit 4b80130b0b, reversing
changes made to a5519c7c1f. It's suspected
of causing dtest failures due to a bug in coroutine::parallel_for_each.
In the following commits, we reduce the number of the
migration_manager::anounce calls by merging some of them in a way
that logically makes sense. Some of these merges are similar --
we announce a new keyspace and its tables together. However,
we cannot use the current prepare_new_column_family_announcement
there because it assumes that the keyspace has already been created
(when it loads the keyspace from the database). Luckily, this
assumption is not necessary as this function only needs
keyspace_metadata. Instead of loading it from the database, we can
pass it as a parameter.
There's a dedicated call to register migration manager's verbs somewhere
in the middle of main. However, until messaging service listening starts
it makes no difference when to register verbs.
This patch moves the verbs registration into mig. manager constructor
thus making it called it with sharded<migration_manager>::start().
Unregistration happens in migration_manager::drain() and it's not
touched here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15367
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We remove the default value for the description parameter of
migration_manager::announce to avoid using it in the future.
Thanks to this, all commands in system.group0_history will
have a non-null description.
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.
We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.
An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.
Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471
Closes#14845
* github.com:scylladb/scylladb:
gossiper: replicate: ensure non-null permit
gossiper: add_saved_endpoint: lock_endpoint
gossiper: mark_as_shutdown: lock_endpoint
gossiper: real_mark_alive: lock_endpoint
gossiper: advertise_token_removed: lock_endpoint
gossiper: do_status_check: lock_endpoint
gossiper: remove_endpoint: lock_endpoint if needed
gossiper: force_remove_endpoint: lock_endpoint if needed
storage_service: lock_endpoint when removing node
gossiper: use permit_id to serialize state changes while preventing deadlocks
gossiper: lock_endpoint: add debug messages
utils: UUID: make default tagged_uuid ctor constexpr
gossiper: lock_endpoint must be called on shard 0
gossiper: replicate: simplify interface
gossiper: mark_as_shutdown: make private
gossiper: convict: make private
gossiper: mark_as_shutdown: do not call convict
Pass permit_id to subscribers when we acquire one
via lock_endpoint. The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The `migration_manager` service is responsible for schema convergence in
the cluster - pushing schema changes to other nodes and pulling schema
when a version mismatch is observed. However, there is also a part of
`migration_manager` that doesn't really belong there - creating
mutations for schema updates. These are the functions with `prepare_`
prefix. They don't modify any state and don't exchange any messages.
They only need to read the local database.
We take these functions out of `migration_manager` and make them
separate functions to reduce the dependency of other modules (especially
`query_processor` and CQL statements) on `migration_manager`. Since all
of these functions only need access to `storage_proxy` (or even only
`replica::database`), doing such a refactor is not complicated. We just
have to add one parameter, either `storage_proxy` or `database` and both
of them are easily accessible in the places where these functions are
called.
This refactor makes `migration_manager` unneeded in a few functions:
- `alternator::executor::create_keyspace`,
- `cql3::statements::alter_type_statement::prepare_announcement_mutations`,
- `cql3::statements::schema_altering_statement::prepare_schema_mutations`,
- `cql3::query_processor::execute_thrift_schema_command:`,
- `thrift::handler::execute_schema_command`.
We remove the `migration_manager&` parameter from all these functions.
Fixes#14339Closes#14875
* github.com:scylladb/scylladb:
cql3: query_processor::execute_thrift_schema_command: remove an unused parameter
cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter
cql3: alter_type_statement::prepare_announcement_mutations: change parameters
alternator: executor::create_keyspace: remove an unused parameter
service: migration_manager: change the prepare_ methods to functions
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.
Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member
Closes#14876
* github.com:scylladb/scylladb:
schema_tables: Extract query_processor from system_keyspace for querying
schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
migration_manager: Add system_keyspace argument to get_schema_mapping()
It will need one to pass to db::schema_tables code. The caller is paxos
code with sys_ks local variable at hand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The migration_manager service is responsible for schema convergence
in the cluster - pushing schema changes to other nodes and pulling
schema when a version mismatch is observed. However, there is also
a part of migration_manager that doesn't really belong there -
creating mutations for schema updates. These are the functions with
prepare_ prefix. They don't modify any state and don't exchange any
messages. They only need to read the local database.
We take these functions out of migration_manager and make them
separate functions to reduce the dependency of other modules
(especially query_processor and CQL statements) on
migration_manager. Since all of these functions only need access
to storage_proxy (or even only replica::database), doing such a
refactor is not complicated. We just have to add one parameter,
either storage_proxy or database and both of them are easily
accessible in the places where these functions are called.
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
We want to disable `migration_manager` schema pulls and make schema
managed only by Raft group 0 if Raft is enabled. This will be important
with Raft-based topology, when schema will depend on topology (e.g. for
tablets).
We solved the problem partially in PR #13695. However, it's still
possible for a bootstrapping node to pull schema in the early part of
bootstrap procedure, before it setups group 0, because of how the
currently used `_raft_gr.using_raft()` check is implemented.
Here's the list of cases:
- If a node is bootstrapping in non-Raft mode, schema pulls must remain
enabled.
- If a node is bootstrapping in Raft mode, it should never perform a
schema pull.
- If a bootstrapped node is restarting in non-Raft mode but with Raft
feature enabled (which means we should start upgrading to use Raft),
or restarting in the middle of Raft upgrade procedure, schema pulls must
remain enabled until the Raft upgrade procedure finishes.
This is also the case of restarting after RECOVERY.
- If a bootstrapped node is restarting in Raft mode, it should never
perform a schema pull.
The `raft_group0` service is responsible for setting up Raft during boot
and for the Raft upgrade procedure. So this is the most natural place to
make the decision that schema pulls should be disabled. Instead of
trying to come up with a correct condition that fully covers the above
list of cases, store a `bool` inside `migration_manager` and set it from
`raft_group0` function at the right moment - when we decide that we
should boot in Raft mode, or restart with Raft, or upgrade. Most of the
conditions are already checked in `setup_group0_if_exist`, we just need
to set the bool. Also print a log message when schema pulls are
disabled.
Fix a small bug in `migration_manager::get_schema_for_write` - it was
possible for the function to mark schema as synced without actually
syncing it if it was running concurrently to the Raft upgrade procedure.
Correct some typos in comments and update the comments.
Fixes#12870
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
Now which schema pull may issues raft read barrier it may stuck if
majority is not available. Make the operation abortable and abort it
during queries if timeout is reached.
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This patch changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
Refs: #3760Fixes: #13211
A migration_manager holds a reference to a storage_proxy, and uses it
internally a lot - e.g., to gain access to the data_dictionary. Users
of migration_manager might also benefit from this storage_proxy - we
will see such a case in the next patches. So let's provide a getter
for the storage_proxy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.
In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.
The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.
To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.
We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).
Some additional comments were written.
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.
Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
The main target here is system_keyspace::update_schema_version() which
is now static, but needs to have system_keyspace at "this". Migration
manager is one of the places that calls that method indirectly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.
Catch the exception in CQL DDL statement execution function and retry.
In addition, the description of CQL DDL statements in group 0 history
table was improved.
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
When performing a change through group 0 (which right now only covers
schema changes), clear entries from group 0 history table which are older
than one week.
This is done by including an appropriate range tombstone in the group 0
history table mutation.
The description parameter is used for the group 0 history mutation.
The default is empty, in which case the mutation will leave
the description column as `null`.
I filled the parameter in some easy places as an example and left the
rest for a follow-up.
This is how it looks now in a fresh cluster with a single statement
performed by the user:
cqlsh> select * from system.group0_history ;
key | state_id | description
---------+--------------------------------------+------------------------------------------------------
history | 9ec29cac-7547-11ec-cfd6-77bb9e31c952 | CQL DDL statement
history | 9beb2526-7547-11ec-7b3e-3b198c757ef2 | null
history | 9be937b6-7547-11ec-3b19-97e88bd1ca6f | null
history | 9be784ca-7547-11ec-f297-f40f0073038e | null
history | 9be52e14-7547-11ec-f7c5-af15a1a2de8c | null
history | 9be335dc-7547-11ec-0b6d-f9798d005fb0 | null
history | 9be160c2-7547-11ec-e0ea-29f4272345de | null
history | 9bdf300e-7547-11ec-3d3f-e577a2e31ffd | null
history | 9bdd2ea8-7547-11ec-c25d-8e297b77380e | null
history | 9bdb925a-7547-11ec-d754-aa2cc394a22c | null
history | 9bd8d830-7547-11ec-1550-5fd155e6cd86 | null
history | 9bd36666-7547-11ec-230c-8702bc785cb9 | Add new columns to system_distributed.service_levels
history | 9bd0a156-7547-11ec-a834-85eac94fd3b8 | Create system_distributed(_everywhere) tables
history | 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a | Create system_distributed_everywhere keyspace
history | 9bcec89a-7547-11ec-e1b4-34e0010b4183 | Create system_distributed keyspace
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.
To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.
The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.
The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.
We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.
The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.
The guard will be a method of ensuring linearizability for group 0
operations.
This object will be used to "guard" group 0 operations. Obtaining it
will be necessary to perform a group 0 change (such as modifying the
schema), which will be enforced by the type system.
The initial implementation is a stub and only provides a timestamp which
will be used by callers to create mutations for group 0 changes. The
next commit will change all call sites to use the guard as intended.
The final implementation, coming later, will ensure linearizability of
group 0 operations.
1. Generalize the name so it mentions group 0, which schema will be a
strict subset of.
2. Remove the fact that it performs a "read barrier" from the name. The
function will be used in general to ensure linearizability of group0
operations - both reads and writes. "Read barrier" is Raft-specific
terminology, so it can be thought of as an implementation detail.
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.