There are async timeouts for ALTER queries. Seems related to othe issues
with the driver and async.
Make these queries synchronous for now.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11394
This commit introduces the following changes to Alternator compability doc:
* As of https://github.com/scylladb/scylladb/pull/11298 Alternator will return ProvisionedThroughput in DescribeTable API calls. We add the fact that tables will default to a BillingMode of PAY_PER_REQUEST (this wasn't made explicit anywhere in the docs), and that the values for RCUs/WCUs are hardcoded to 0.
* Mention the fact that ScyllaDB (thus Alternator) hashing function is different than AWS proprietary implementation for DynamoDB. This is mostly of an implementation aspect rather than a bug, but it may cause user confusion when/if comparing the ResultSet between DynamoDB and Alternator returned from Table Scans.
Refs: https://github.com/scylladb/scylladb/issues/11222
Fixes: https://github.com/scylladb/scylladb/issues/11315Closes#11360
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
Closes#11318
* github.com:scylladb/scylladb:
raft, use max_command_size to satisfy commitlog limit
raft, limit for command size
Currently SCYLLA_BULD_MODE is defined as a string by the cxxflags
generated by configure.py. This is not very useful since one cannot use
it in a @if preprocessor directive.
Instead, use -DSCYLLA_BULD_MODE=release, for example, and define a
SCYLLA_BULD_MODE_STR as the dtirng representation of it.
In addition define the respective
SCYLLA_BUILD_MODE_{RELEASE,DEV,DEBUG,SANITIZE} macros that can be easily
used in @ifdef (or #ifndef :)) for conditional compilation.
The planned use case for it is to enable a task_manager test module only
in non-release modes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11357
Currently, if a keyspace has an aggregate and the keyspace
is dropped, the keyspace becomes corrupted and another keyspace
with the same name cannot be created again
This is caused by the fact that when removing an aggregate, we
call create_aggregate() to get values for its name and signature.
In the create_aggregate(), we check whether the row and final
functions for the aggregate exist.
Normally, that's not an issue, because when dropping an existing
aggregate alone, we know that its UDFs also exist. But when dropping
and entire keyspace, we first drop the UDFs, making us unable to drop
the aggregate afterwards.
This patch fixes this behavior by removing the create_aggregate()
from the aggregate dropping implementation and replacing it with
specific calls for getting the aggregate name and signature.
Additionally, a test that would previously fail is added to
cql-pytest/test_uda.py where we drop a keyspace with an aggregate.
Fixes#11327Closes#11375
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
"
On token_metadata there are two update_normal_tokens() overloads --
one updates tokens for a single endpoint, another one -- for a set
(well -- std::map) of them. Other than updating the tokens both
methods also may add an endpoint to the t.m.'s topology object.
There's an ongoing effort in moving the dc/rack information from
snitch to topology, and one of the changes made in it is -- when
adding an entry to topology, the dc/rack info should be provided
by the caller (which is in 99% of the cases is the storage service).
The batched tokens update is extremely unfriendly to the latter
change. Fortunately, this helper is only used by tests, the core
code always uses fine-grained tokens updating.
"
* 'br-tokens-update-relax' of https://github.com/xemul/scylla:
token_metadata: Indentation fix after prevuous patch
token_metadata: Remove excessive empty tokens check
token_metadata: Remove batch tokens updating method
tests: Use one-by-one tokens updating method
Some cases in test_wasm.py assumed that all cases
are ran in the same order every time and depended
on values that should have been added to tables in
previous cases. Because of that, they were sometimes
failing. This patch removes this assumption by
adding the missing inserts to the affected cases.
Additionally, an assert that confirms low miss
rate of udfs is more precise, a comment is added
to explain it clearly.
Closes#11367
It could happen that we accessed failure detector service after it was
stopped if a reconfiguration happened in the 'right' moment. This would
resolve in an assertion failure. Fix this.
Closes#11326
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.
Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
schema operations.
With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).
This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.
Read the design doc, then read the commits in topological order
for best reviewing experience.
---
I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.
As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).
Closes#10835
* github.com:scylladb/scylladb:
service/raft: raft_group0: implement upgrade procedure
service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
service/raft: raft_group0: introduce local loggers for group 0 and upgrade
service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
service/raft: raft_group0_client: prepare for upgrade procedure
service/raft: introduce `group0_upgrade_state`
db: system_keyspace: introduce `load_peers`
idl-compiler: introduce cancellable verbs
message: messaging_service: cancellable version of `send_schema_check`
After the previous patch empty passed tokens make the helper co_return
early, so this if is the dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No users left.
The endpoint_tokens.empty() check is removed, only tests could trigger
it, but they didn't and are patched out.
Indentation is left broken
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_pending_address_ranges() accepting a single token is not in use,
its peer that accepts a set of tokens is
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11358
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.
Closes#11284
* github.com:scylladb/scylladb:
utils/logalloc: remove reclaim_timer:: globals
utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
utils/logalloc: move global stat accessors to tracker
utils/logalloc: allocating_section: don't use the global tracker
utils/logalloc: pass down tracker::impl reference to segment_pool
utils/logalloc: move segment pool into tracker
utils/logalloc: add tracker member to basic_region_impl
utils/logalloc: make segment independent of segment pool
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.
Fixes#11245
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11246
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
Fixes#11349
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch set does so, adapted
to the recent version of this file.
Closes#11350
* github.com:scylladb/scylladb:
distributed_loader: Restore separate processing of keyspace init prio/normal
Revert "distributed_loader: Remove unused load-prio manipulations"
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).
The listener starts the `upgrade_to_group0` procedure.
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
`system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
for schema operations (only those for now).
The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.
`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.
If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.
We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.
To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
(the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.
Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
whenever we extend group 0 with new stuff...)
Add some more logging to `randomized_nemesis_test` such as logging the
start and end of a reconfiguration operation in a way that makes it easy
to find one given the other in the logs.
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Fixes#11349
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch and revert before it does so, adapted
to the recent version of this file.
This reverts commit 7396de72b1.
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This reverts the actual commit, patch after
ensures we use the prio set.
This series turns plan_id from a generic UUID into a strong type so it can't be used interchangeably with other uuid's.
While at it, streaming/stream_fwd.hh was added for forward declarations and the definition of plan_id.
Also, `stream_manager::update_progress` parameter name was renamed to plan_id to represent its assumed content, before changing its type to `streaming::plan_id`.
Closes#11338
* github.com:scylladb/scylladb:
streaming: define plan_id as a strong tagged_uuid type
stream_manager: update_progress: rename cf_id param to plan_id
streaming: add forward declarations in stream_fwd.hh
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.
Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.
While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.
Closes#11271
* github.com:scylladb/scylladb:
mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
mutation: consume_clustering_fragments: reindent
mutation: consume_clustering_fragments: shuffle emit_rt logic around
mutation: consume, consume_gently: simplify partition_start logic
mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
mutation: consume_clustering_fragments: keep the reversed schema in cookie
mutation: clustering_iterators: get rid of current_rt
mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.
Pass 60 seconds timeout (default is 10) to match the session's.
Fixes https://github.com/scylladb/scylladb/issues/11289Closes#11348
* github.com:scylladb/scylladb:
test.py: bump schema agreement timeout for topology tests
test.py: bump timeout of async requests for topology
test.py: fix bad indent
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198Closes#11269
* github.com:scylladb/scylladb:
mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
frozen_mutation: consume and consume_gently in-order
frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.
Pass 60 seconds timeout (default is 10) to match the session's.
This will hopefully will fix timeout failures on debug mode.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
So that the frozen_mutation consumer can return
stop_iteration::yes if it wishes to stop consuming at
some clustering position.
In this case, on_end_of_partition must still be called
so a closing range_tombstone_change can be emitted to the consumer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Improve the randomness of this test, making it a bit easier to
reproduce the scenarios that the test aims to catch.
Increase timeouts a bit to account for this additional randomness.