The currently used "<unknown>" marker for invalid values/types is
undistinguishable from a normal value in some cases. Use the much more
distinct and unique json Null instead.
Instead of trying to be clever and switching the output on the type of
collection, use the same format always: a list of objects, where the
object has a key and value attribute, containing to the respective
collection item key and values. This makes processing much easier for
machines (and humans too since the previous system wasn't working well).
There are several slightly different cell types in scylla: regular
cells, collection cells (frozen and non-frozen) and counter cells
(update and shards). In C++ code the type of the cell is always
available for code wishing to make out exactly what kind of cell a cell
is. In the JSON output of the dump-data this is currently really hard to
do as there is not enough information to disambiguate all the different
cell types. We wish to make the JSON output self-sufficient so in this
patch we introduce a "type" field which contains one of:
* regular
* counter-update
* counter-shards
* frozen-collection
* collection
Furthermore, we bring the different types closer by also printing the
counter shards under the 'value' key, not under the 'shards' key as
before. The separate 'shards' is no longer needed to disambiguate.
The documentation and the write operation is also updated to reflect the
changes.
task_manager::task::impl contains an abort source which can
be used to check whether it is aborted and an abort method
which aborts the task (request_abort on abort_source) and all
its descendants recursively.
When the start method is called after the task was aborted,
then its state is set to failed and the task does not run.
Fixes: #11995Closes#11996
* github.com:scylladb/scylladb:
tasks: do not run tasks that are aborted
tasks: delete unused variable
tasks: add abort_source to task_manager::task::impl
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780Closes#11942
* github.com:scylladb/scylladb:
message: messaging_service: check for known topology before calling is_same_dc/rack
test: reenable test_topology::test_decommission_node_add_column
test/pylib: util: configurable period in wait_for
message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
message: messaging_service: topology independent connection settings for GOSSIP verbs
As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization.
This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation.
While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield.
Fixes#11993Closes#11994
* github.com:scylladb/scylladb:
repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
repair: pass effective_replication_map down to repair_info
repair: coroutinize sync_data_using_repair
repair: futurize do_repair_start
effective_replication_map: add global_effective_replication_map
shared_token_metadata: get_lock is const
repair: sync_data_using_repair: require to run on shard 0
repair: require all node operations to be called on shard 0
repair: repair_info: keep effective_replication_map
repair: do_repair_start: use keyspace erm to get keyspace local ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
repair: do_repair_start: check_in_shutdown first
repair: get_db().local() where needed
repair: get topology from erm/token_metdata_ptr
view: get_view_natural_endpoint: get topology from erm
This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh}
and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987
While at it, the token_metadata functions are turned into coroutines
and topology copy constructor is deleted. The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology.
Closes#12001
* github.com:scylladb/scylladb:
locator: refactor topology out of token_metadata
locator: add types.hh
topology: delete copy constructor
token_metadata: coroutinize clone functions
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.
To prevent this before each request_abort() we check whether an abort
was requested before.
Closes#12004
An unordered_set is more efficient and there is no need
to return an ordered set for this purpose.
This change facilitates a follow-up change of adding
topology::get_datacenters(), returning an unordered_set
of datacenter names.
Refs #11987
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12003
Fix a typo introduced by the the recent patch fixing the spelling of
Barrett. The patch introduced a typo in the aarch64 version of the code,
which wasn't found by promotion, as that only builds on X86_64.
Closes#12006
And make sure the token_metadata ring version is same as the
reference one (from the erm on shard 0), when starting the
repair on each shard.
Refs #11993
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Turn it into a coroutine to prepare for the next path
that will co_await make_global_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Class to hold a coherent view of a keyspace
effective replication map on all shards.
To be used in a following patch to pass the sharded
keyspace e_r_m:s to repair.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that our toolchain is based on Fedora 37, we can rely on its
libdeflate rather than have to carry our own in a submodule.
Frozen toolchain is regenerated. As a side effect clang is updated
from 15.0.0 to 15.0.4.
Closes#12000
And with that do_sync_data_using_repair can be folded into
sync_data_using_repair.
This will simplify using the effective_replication_map
throughout the operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling db.get_keyspace_local_ranges that
looks up the keyspace and its erm again.
We want all the inforamtion derived from the erm to
be based on the same source.
The function is synchronous so this changes doesn't
fix anything, just cleans up the code.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Ensure that the primary ranges are in sync with the
keyspace erm.
The function is synchronous so this change doesn't fix anything,
it just cleans up the code.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Ensure the erm and topology are in sync.
The function is synchronous so this change doesn't fix
anything, just cleans up the code.
Fix mistake in comment while at it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In several places we get the sharded database using get_db()
and then we only use db.local(). Simplify the code by keeping
reference only to the local database upfront.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want the topology to be synchronized with the respective
effective_replication_map / token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get the topology for the effective replication map rather
than from the storage_proxy to ensure its synchronized
with the natural endpoints.
Since there's no preemption between the two calls
currently there is no issue, so this is merely a clean up
of the code and not supposed to fix anything.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a reproducer for issue #11954: Attempting an
"IF NOT EXISTS" (LWT) write with a null key crashes Scylla,
instead of producing a simple error message (like happens
without the "IF NOT EXISTS" after #7852 was fixed).
The test passed on Cassandra, but crashes Scylla. Because of this
crash, we can't just mark the test "xfail" and it's temporarily
marked "skip" instead.
Refs #11954.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11982
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.
The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.
Fixes: #11564Closes#11769
* github.com:scylladb/scylladb:
raft: rafactor: remove duplicate code on retries delays
raft: use wait_for_next_tick in read_barrier
raft: wait for the next tick before retrying
Currently in start() method a task is run even if it was already
aborted.
When start() is called on an aborted task, its state is set to
task_manager::task_state::failed and it doesn't run.
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.
Closes#11989
Topology is copied only from token_metadata_impl::clone_only_token_map
which copies the token_metadata_impl with yielding to prevent reactor
stalls. This should apply to topology as well, so
add a clone_gently function for cloning the topology
from token_metadata_impl::clone_only_token_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.
When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).
Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.
Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.
Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).
The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.
However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix#11780.
To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).
This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.
Fixes#11992.
Inspired by xemul/scylla@45d48f3d02.
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.
Fixes https://github.com/scylladb/scylladb/issues/11378Closes#11984
* github.com:scylladb/scylladb:
doc: label user-defined functions as Experimental
doc: restore the note for the Count function (removed by mistatke)
doc: document user defined functions (UDFs)
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.
Closes#11846
* github.com:scylladb/scylladb:
messaging_service: Deny putting INADD_ANY as preferred ip
messaging_service: Toss preferred ip cache management
gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
gossiping_property_file_snitch: Make _listen_address optional
When we translate from docker/go arch names to the kernel arch
names, we use an associative array hack using computed variable
names "{$!variable_name}". But it turns out bash has real
associative arrays, introduced with "declare -A". Use the to make
the code a little clearer.
Closes#11985
Fix https://github.com/scylladb/scylla-docs/issues/4144Closes#11226
* github.com:scylladb/scylladb:
Update docs/getting-started/system-requirements.rst
doc: specify the recommended AWS instance types
doc: replace the tables with a generic description of support for Im4gn and Is4gen instances
doc: add support for AWS i4g instances
doc: extend the list of supported CPUs