`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780Closes#11942
* github.com:scylladb/scylladb:
message: messaging_service: check for known topology before calling is_same_dc/rack
test: reenable test_topology::test_decommission_node_add_column
test/pylib: util: configurable period in wait_for
message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
message: messaging_service: topology independent connection settings for GOSSIP verbs
This patch adds a reproducer for issue #11954: Attempting an
"IF NOT EXISTS" (LWT) write with a null key crashes Scylla,
instead of producing a simple error message (like happens
without the "IF NOT EXISTS" after #7852 was fixed).
The test passed on Cassandra, but crashes Scylla. Because of this
crash, we can't just mark the test "xfail" and it's temporarily
marked "skip" instead.
Refs #11954.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11982
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.
Closes#11989
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.
Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
As indicated in #11816, we'd like to enable deserializing vectors in reverse.
The forward deserialization is achieved by reading from an input_stream. The
input stream internally is a singly linked list with complicated logic. In order to
allow for going through it in reverse, instead when creating the reverse vector
initializer, we scan the stream and store substreams to all the places that are a
starting point for a next element. The iterator itself just deserializes elements
from the remembered substreams, this time in reverse.
Fixes#11816Closes#11956
* github.com:scylladb/scylladb:
test/boost/serialization_test.cc: add test for reverse vector deserializer
serializer_impl.hh: add reverse vector serializer
serializer_impl: remove unneeded generic parameter
As noted in issue #11979, Scylla inconsistently (and unlike Cassandra)
requires "IS NOT NULL" one some but not all materialized-view key
columns. Specifically, Scylla does not require "IS NOT NULL" on the
base's partition key, while Cassandra does.
This patch is a test which demonstrates this inconsistency. It currently
passes on Cassandra and fails on Scylla, so is marked xfail.
Refs #11979
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11980
The helper is already widely used, one (last) test case can benefit from using it too
Closes#11978
* github.com:scylladb/scylladb:
test: Indentation fix after previous patch
test: Wse with_sstable_directory() helper
We use Barrett tables (misspelled in the code unfortunately) to fold
crc computations of multiple buffers into a single crc. This is important
because it turns out to be faster to compute crc of three different buffers
in parallel rather than compute the crc of one large buffer, since the crc
instruction has latency 3.
Currently, we have a separate code generation step to compute the
fold tables. The step generates a new C++ source files with the tables.
But modern C++ allows us to do this computation at compile time, avoiding
the code generation step. This simplifies the build.
This series does that. There is some complication in that the code uses
compiler intrinsics for the computation, and these are not constexpr friendly.
So we first introduce constexpr-friendly alternatives and use them.
To prove the transformation is correct, I compared the generated code from
before the series and from just before the last step (where we use constexpr
evaluation but still retain the generated file) and saw no difference in the values.
Note that constexpr is not strictly needed - we could have run the code in the
global variables' initializer. But that would cause a crash if we run on a pre-clmul
machine, and is not as fun.
Closes#11957
* github.com:scylladb/scylladb:
test: crc: add unit tests for constexpr clmul and barrett fold
utils: crc combine table: generate at compile time
utils: barrett: inline functions in header
utils: crc combine table: generate tables at compile time
utils: crc combine table: extract table generation into a constexpr function
utils: crc combine table: extract "pow table" code into constexpr function
utils: crc combine table: store tables std::arrray rather than C array
utils: barrett: make the barrett reduction constexpr friendly
utils: clmul: add 64-bit constexpr clmul
utils: barrett: extract barrett reduction constants
utils: barrett: reorder functions
utils: make clmul() constexpr
It's already used everywhere, but one test case wires up the
sstable_directory by hand. Fix it too, but keep in mind, that the caller
fn stops the directory early.
(indentation is deliberately left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().
But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction_group will be allowed to
create its own tracker and manage it by itself.
Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.
Closes#11762
* github.com:scylladb/scylladb:
replica: Allow one compaction_backlog_tracker for each compaction_group
compaction: Make compaction_state available for compaction tasks being stopped
compaction: Implement move assignment for compaction_backlog_tracker
compaction: Fix compaction_backlog_tracker move ctor
compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
compaction: kill undefined get_unimplemented_backlog_tracker()
replica: Refactor table::set_compaction_strategy for multiple groups
Fix exception safety when transferring ongoing charges to new backlog tracker
replica: move_sstables_from_staging: Use tracker from group owning the SSTable
replica: Move table::backlog_tracker_adjust_charges() to compaction_group
replica: table::discard_sstables: Use compaction_group's backlog tracker
replica: Disable backlog tracker in compaction_group::stop()
replica: database_sstable_write_monitor: use compaction_group's backlog tracker
replica: Move table::do_add_sstable() to compaction_group
test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
compaction/table_state: Introduce get_backlog_tracker()
Check that the constexpr variants indeed match the runtime variants.
I verified manually that exactly one computation in each test is
executed at run time (and is compared against a constant).
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().
But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.
On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.
Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface will be helpful for allowing replica::table, unit
tests and sstables::compaction to access the compaction group's tracker
which will be managed by the compaction manager, once we complete
the decoupling work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In commit 544ef2caf3 we fixed a bug where
a reveresed clustering-key order caused problems using a secondary index
because of incorrect type comparison. That commit also included a
regression test for this fix.
However, that fix was incomplete, and improved later in commit
c8653d1321. That later fix was labeled
"better safe than sorry", and did not include a test demonstrating
any actual bug, so unsurprisingly we never backported that second
fix to any older branches.
Recently we discovered that missing the second patch does cause real
problems, and this patch includes a test which fails when the first
patch is in, but the second patch isn't (and passes when both patches
are in, and also passes on Cassandra).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11943
Instead of using assigned IP addresses, use a local integer ID for
managing servers. IP address can be reused by a different server.
While there, get host ID (UUID). This can also be reused with `node
replace` so it's not good enough for tracking.
Closes#11747
* github.com:scylladb/scylladb:
test.py: use internal id to manage servers
test.py: rename hostname to ip_addr
test.py: get host id
test.py: use REST api client in ScyllaCluster
test.py: remove unnecessary reference to web app
test.py: requests without aiohttp ClientSession
Instead of using assigned IP addresses, use an internal server id.
Define types to distinguish local server id, host ID (UUID), and IP
address.
This is needed to test servers changing IP address and for node replace
(host UUID).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The code explicitly manages an IP as string, make it explicit in the
variable name.
Define its type and test for set in the instance instead of using an
empty string as placeholder.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When initializing a ScyllaServer, try to get the host id instead of only
checking the REST API is up.
Use the existing aiohttp session from ScyllaCluster.
In case of HTTP error check the status was not an internal error (500+).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move the REST api client to ScyllaCluster. This will allow the cluster
to query its own servers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The aiohttp.web.Application only needs to be passed, so don't store a
reference in ScyllaCluster object.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Simplify REST helper by doing requests without a session.
Reusing an aiohttp.ClientSession causes knock-on effects on
`rest_api/test_task_manager` due to handling exceptions outside of an
async with block.
Requests for cluster management and Scylla REST API don't need session,
anyway.
Raise HTTPError with status code, text reason, params, and json.
In ScyllaCluster.install_and_start() instead of adding one more custom
exception, just catch all exceptions as they will be re-raised later.
While there avoid code duplication and improve sanity, type checking,
and lint score.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.
However, the latter broadcasts all value to all shards
so it's terribly inefficient.
Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.
Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.
Refs #7316Closes#11893
* github.com:scylladb/scylladb:
tests: check ttl on different shards
utils: config_src: add set_value_on_all_shards functions
utils: config_file: add config_source::API
This reverts commit ba6186a47f.
Said commit violates the widely held assumption that sstables
generations can be used as sstable identity. One known problem caused
this is potential OOO partition emitted when reading from sstables
(#11843). We now also have a better fix for #11789 (the bug this commit
was meant to fix): 4aa0b16852. So we can
revert without regressions.
Fixes: #11843Closes#11886
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).
Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.
To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).
This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.
To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
an IP address, because now the stored `_alive_set` already contains
`raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
`gms::inet_address`. To send the ping message, we need to translate
this to IP address; we do it by the `raft_address_map` pointer
introduced in an earlier commit.
Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
Closes#11759
* github.com:scylladb/scylladb:
direct_failure_detector: get rid of complex `endpoint_id` translations
service/raft: ping `raft::server_id`s, not `gms::inet_address`es
service/raft: store `raft_address_map` reference in `direct_fd_pinger`
gms: gossiper: move `direct_fd_pinger` out to a separate service
gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
The test runs remove_node command with background ddl workload.
It was written in an attempt to reproduce scylladb#11228 but seems to have
value on its own.
The if_exists parameter has been added to the add_table
and drop_table functions, since the driver could retry
the request sent to a removed node, but that request
might have already been completed.
Function wait_for_host_known waits until the information
about the node reaches the destination node. Since we add
new nodes at each iteration in main, this can take some time.
A number of abort-related options was added
SCYLLA_CMDLINE_OPTIONS as it simplifies
nailing down problems.
Closes#11734
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Closes#11791
* github.com:scylladb/scylladb:
test/raft: raft_address_map_test: add replication test
service/raft: raft_address_map: replicate non-expiring entries to other shards
service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
service/raft: turn raft_address_map into a service
We capture `key` by reference, but it is in a another continuation.
Capture it by value, and avoid the default capture specification.
Found by clang 15 + asan + aarch64.
Closes#11884
This patch adds a reproducing test for issue #11588, which is still open
so the test is expected to fail on Scylla ("xfail), and passes on Cassandra.
The test shows that Scylla allows an out-of-range value to be written to
timestamp column, but then it can't be read back.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11864
Rename to scylla tables. Less typing and more up-to-date.
By default it now only lists tables from local shard. Added flag -a
which brings back old behaviour (lists on all shards).
Added -u (only list user tables) and -k (list tables of provided
keyspace only) filtering options.
Since the end bound is exclusive, the end position should be
before_key(), not after_key().
Affects only tests, as far as I know, only there we can get an end
bound which is a clustering row position.
Would cause failures once row cache is switched to v2 representation
because of violated assumptions about positions.
Introduced in 76ee3f029cCloses#11823
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
The test wants to see that no allocations larger than 128k are present,
but sets the warning threshold to exactly 128k. Due to an off-by-one in
Seastar, this went unnoticed. However, now that the off-by-one in Seastar
is fixed [1], this test starts to fail.
Fix by setting the warning threshold to 128k + 1.
[1] 429efb5086Closes#11817
This PR adds some unit tests for the `expr::evaluate()` function.
At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request.
I found a few places where I think it would be better to behave in a different way, but nothing serious.
Closes#11815
* github.com:scylladb/scylladb:
test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
cql3: expr: Add unit tests for bind_variable validation of collections
cql3: expr: Add test for subscripted list and map
cql3: expr: Add test for usertype_constructor
cql3: expr: Add test for tuple_constructor
cql3: expr: Add tests for evaluation of collection constructors
cql3: expr: Add tests for evaluation of column_values and bind_variables
cql3: expr: Add constant evaluation tests
test/boost: Add expr_test_utils.hh
cql3: Add ostream operator for raw_value
cql3: add is_empty_value() to raw_value and raw_value_view
expr_test_utils.hh was a header file with helper methods for
expression tests. All functions were inline, because I didn't
know how to create and link a .cc file in test/boost.
Now the header is split into expr_test_utils.hh and expr_test_utils.cc
and moved to test/lib, which is designed to keep this kind of files.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
"
Snitch was the junction of several services' deps because it was the
holder of endpoint->dc/rack mappings. Now this information is all on
topology object, so snitch can be finally made main-local
"
* 'br-deglobalize-snitch' of https://github.com/xemul/scylla:
code: Deglobalize snitch
tests: Get local reference on global snitch instance once
gossiper: Pass current snitch name into checker
snitch: Add sharded<snitch_ptr> arg to reset_snitch()
api: Move update_snitch endpoint
api: Use local snitch reference
api: Unset snitch endpoints on stop
storage_service: Keep local snitch reference
system_keyspace: Don't use global snitch instance
snitch: Add const snitch_ptr::operator->()