This patch adds a simple reproducer for a regression in Scylla 5.4 caused
by commit 432cb02, breaking LIMIT support in GROUP BY.
Refs #17237
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17275
Reimplements stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.
The test was observed to fail in CI like this:
```
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.157.135.26:9042 datacenter1>: ConnectionException('Pool for 127.157.135.26:9042 is shutdown')})
...
@pytest.mark.repair
@pytest.mark.asyncio
async def test_tablet_missing_data_repair(manager: ManagerClient):
...
for idx in range(0,3):
s = servers[idx].server_id
await manager.server_stop_gracefully(s, timeout=120)
> await check()
```
Hopefully: Fixes#17107Closesscylladb/scylladb#17252
* github.com:scylladb/scylladb:
test: py: tablets: Fix flakiness of test_tablet_missing_data_repair
test: pylib: manager_client: Wait for driver to catch up in rolling_restart()
test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.
So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.
Fixes: #17241Closesscylladb/scylladb#17242
This change introduces a new test that verifies the
functionality related to tablet_count metric.
It checks if tablet_count metric is correctly reported
and updated when new tables are created, when tables
are dropped and when `move_tablet` is executed.
Refs: scylladb#16131
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#17165
Recently we added a trick to allow running cql-pytests either with or
without tablets. A single fixture test_keyspace uses two separate
fixtures test_keyspace_tablets or test_keyspace_vnodes, as requested.
The problem is that even if test_keyspace doesn't use its
test_keyspace_tablets fixture (it doesn't, if the test isn't
parameterized to ask for tablets explicitly), it's still a fixture,
and it causes the test to be skipped. This causes every test to be
skipped when running on Cassandra or old Scylla which doesn't support
tablets.
The fix is simple - the internal fixture test_keyspace_tablets should
yield None instead of skipping. It is the caller, test_keyspace, which
now skips the test if tablets are requested but test_keyspace_tablets
is None.
Fixes#17266
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17267
In particular, `inet_address(const sstring& addr)` is
dangerous, since a function like
`topology::get_datacenter(inet_address ep)`
might accidentally convert a `sstring` argument
into an `inet_address` (which would most likely
throw an obscure std::invalid_argument if the datacenter
name does not look like an inet_address).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#17260
Reimplement stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.
Hopefully: Fixes#17107
This patch adds a reproducer test for an issue #16382.
See scylladb/seastar#2044 for details of the problem.
The test is enabled only in dev mode since it requires
error injection mechanism. The patch adds a new injection
into storage_proxy::handle_read to simulate the problem
scenario - the node is shutting down and there are some
unfinished pending replica requests.
Closesscylladb/scylladb#16776
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.
Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.
Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.
The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.
Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.
The procedure is explained in more detail in `topology-over-raft.md`.
Fixes: https://github.com/scylladb/scylladb/issues/15008Closesscylladb/scylladb#17077
* github.com:scylladb/scylladb:
test/topology_custom: upgrade/recovery tests for topology on raft
cdc/generation_service: in legacy mode, fall back to raft tables
system_keyspace: add read_cdc_generation_opt
cdc/generation_service: turn off gossip notifications in raft topo mode
cql_test_env: move raft_topology_change_enabled var earlier
group0_state_machine: pull snapshot after raft topology feature enabled
storage_service: disable persistent feature enabler on upgrade
storage_service: replicate raft features to system.peers
storage_service: gossip tokens and cdc generation in raft topology mode
API: add api for triggering and monitoring topology-on-raft upgrade
storage_service: infer which topology operations to use on startup
storage_service: set the topology kind value based on group 0 state
raft_group0: expose link to the upgrade doc in the header
feature_service: fall back to checking legacy features on startup
storage_service: add fiber for tracking the topology upgrade progress
gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
topology_coordinator: implement core upgrade logic
topology_coordinator: extract top-level error handling logic
storage_service: initialize discovery leader's state earlier
topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
topology_state_machine: introduce upgrade_state
storage_service: disallow topology ops when upgrade is in progress
raft_group0_client: add in_recovery method
storage_service: introduce join_node_query verb
raft_group0: make discover_group0 public
raft_group0: filter current node's IP in discover_group0
raft_group0: remove my_id arg from discover_group0
storage_service: make _raft_topology_change_enabled more advanced
docs: document raft topology upgrade and recovery
per its description, "`/storage_service/describe_ring/`" returns the
token ranges of an arbitrary keyspace. actually, it returns the
first keyspace which is of non-local-vnode-based-strategy. this API
is not used by nodetool, neither is it exercised in dtest.
scylla-manager has a wrapper for this API though, but that wrapper
is not used anywhere.
in this change, this API is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17197
Adds three tests for the new upgrade procedure:
- test_topology_upgrade - upgrades a cluster operating in legacy mode to
use raft topology operations,
- test_topology_recovery_basic - performs recovery on a three-node
cluster, no node removal is done,
- test_topology_majority_loss - simulates a majority loss scenario, i.e.
removed two nodes out of three, performs recovery to rebuild the
raft topology state and re-add two nodes back.
When a node enters recovery after being in raft topology mode, topology
operations switch back to legacy mode. We want CDC to keep working when
that happens, so we need for the legacy code to be able to access
generations created back in raft mode - so that the node can still
properly serve writes to CDC log tables.
In order to make this possible, modify the legacy logic to also look for
a cdc generation in raft tables, if it is not found in legacy tables.
In raft topology mode CDC information is propagated through group 0.
Prevent the generation service from reacting to gossiper notifications
after we made the switch to raft mode.
Also implementing tablet support, which basically just means that a new
table parameter is also accepted and forwarded to the API, in addition
to the existing keyspace one.
Currently, error handling is done via catching
http::unexpected_status_error and re-throwing an std::runtime_error.
Turns out this no longer works, because this error will only be thrown
by the http client, if the request had an expected reply code set.
The scylla_rest_client doesn't set an expected reply code, so this
exception was never thrown for some time now.
Furthermore, even when the above worked, it was not too user-friendly as
the error message only included the reply-code, but not the reply
itself.
So in this patch this is fixed:
* The handling of http::unexpected_status_error is removed, we don't
want to use this mechanism, because it yields very terse error
messages.
* Instead, the status code of the request is checked explicitely and all
cases where it is not 200 are handled.
* A new api_request_failed exception is added, which is throw for all
non-200 statuses with the extracted error message from the server (if
any).
* This exception is caught by main, the error message is printed and
scylla-nodetool returns with a new distinct error-code: 4.
With this, all cases where the request fails on ScyllaDB are handled and
we shouldn't hit cases where a nodetool command fails with some
obscure JSON parsing error, because the error reply has different JSON
schema than the expected happy-path reply.
Similar to the existing check_nodetool_fails_with() but checks that all
error messages from expected_errors are contained in stderr.
While at it, use list as the typing hint, instead of typing.List.
To facilitate testing the state of cache after the update is preempted
at various points, pass a preemption_source& to update() instead of
calling the reactor directly.
In release builds, the calls to preemption_source methods should compile
to the same direct reactor calls as today. Only in dev mode they should
add an extra branch. (However, the `preemption_source&` argument has
to be shoveled in any mode).
The table query param is added to get the describe_ring result for a
given table.
Both vnode table and tablet table can use this table param, so it is
easier for users to user.
If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.
E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"
Refs #16509Closesscylladb/scylladb#17118
* github.com:scylladb/scylladb:
tablets: Convert to use the new version of for_each_tablet
storage_service: Add describe_ring support for tablet table
storage_service: Mark host2ip as const
tablets: Add for_each_tablet_gently
Reduces footprint from hundreds of MB to a very few MB.
Issue could be reproduced with:
./build/dev/test/boost/mutation_writer_test --run_test=test_token_group_based_splitting_mutation_writer -- -m 500M --smp 1 --random-seed 1848215131
Fixes#17076.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17187
When creating a keyspace, scylla allows setting RF value smaller than there are nodes in the DC. With vnodes, when new nodes are bootstrapped, new tokens are inserted thus catching up with RF. With tablets, it's not the case as replica set remains unchanged.
With tablets it's good chance not to mimic the vnodes behavior and require as many nodes to be up and running as the requested RF is. This patch implementes this in a lazy manned -- when creating a keyspace RF can be any, but when a new table is created the topology should meet RF requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE.
closes: #16529Closesscylladb/scylladb#17079
* github.com:scylladb/scylladb:
tablets: Make sure topology has enough endpoints for RF
cql-pytest: Disable tablets when RF > nodes-in-DC
test: Remove test that configures RF larger than the number of nodes
keyspace_metadata: Include tablets property in DESCRIBE
In this PR we add the tests for two scenarios, related to the use of IPs in raft topology.
* When the replaced node transitions to the `LEFT` state we used to
remove the IP of such node from gossiper. If we replace with same IP,
this caused the IP of the new node to be removed from gossiper. This
problem was fixed by #16820, this PR adds a regression test for it.
* When a node is restarted after decommissioning some other node, the
restarting node tries to apply the raft log, this log contains a
record about the decommissioned node, and we got stuck trying to resolve
its IP. This was fixed by #16639 - we excluded IPs from the RAFt log
application code and moved it entirely to host_id-s. This PR adds a
regression test for this case.
Closesscylladb/scylladb#15967Closesscylladb/scylladb#14803Closesscylladb/scylladb#17180
* github.com:scylladb/scylladb:
test_topology_ops: check node restart after decommission
test_replace_reuse_ip: check other servers see the IP
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.
This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).
Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.
This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:
1. The counter "_op_count" that decides where to break the output mutation
should only be incremented when adding rows to this output mutation.
The existing code strangely incrmented it on every read (!?) which
resulted in the counter being incremented on every *input* fragment,
and in particular could reach the limit 100 between two range
tombstone pieces.
2. Moreover, the length of output was checked in the wrong place...
The existing code could get to 100 rows, not check at that point,
read the next input - half a range tombstone - and only *then*
check that we reached 100 rows and stop. The fix is to calculate
the number of rows in the right place - exactly when it's needed,
not before the step.
The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.
This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).
By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).
Fixes#17117
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17164
also disable some more warnings which are failing the build after `-Wextra` is enabled. we can fix them on a case-by-case basis, if they are geniune issues. but before that, we just disable them.
this goal of this change is to reduce the discrepancies between the compile options used by CMake and those used by configure.py. the side effect is that we enable some more warning enabeld by `-Wextra`, for instance, `-Wsign-compare` is enable now. for the full list of the enabled warnings when building with Clang, please see https://clang.llvm.org/docs/DiagnosticsReference.html#wextra.
Closesscylladb/scylladb#17131
* github.com:scylladb/scylladb:
configure.py: add -Wextra to cflags
test/tablets: do not compare signed and unsigned
There used to be a problem with restarting a node after
decommissioning some other node - the restarting node
tries to apply the raft log, this log contains a record
about the decommissioned node, and we got stuck trying
to resolve its IP.
This was fixed in #16639 - we excluded IPs from
the RAFt log application code and moved it entirely
to host_id-s.
In this commit we add a regression test
for this case. We move the decommission_node
call before server_stop/server_start. We need
to add one more server to retain majority when
the node is decommissioned, otherwise the topology
coordinator won't migrate from the stopped node
before replacing it, and we'll get an error.
closes#14803
The replaced node transitions to LEFT state, and
we used to remove the IPs of such nodes from gossiper.
If we replace with same IP, this caused the IP of the
new node to be removed from gossiper.
This problem was fixed by #16820, this commit
adds a regression test for it.
closes#15967
After changing `left_token_ring` from a node state to a transition
state in scylladb/scylladb#17009, we do the same for
`rollback_to_normal`. `rollback_to_normal` was created as a node
state because `left_token_ring` was a node state.
This change will allow us to distinguish a failed removenode from
a failed decommission in the `rollback_to_normal` handler.
Currently, we use the same logic for both of them, so it's not
required. However, this might change, as it has happened with the
decommission and the failed bootstrap/replace in the
`left_token_ring` state (scylladb/scylladb#16797). We are making
this change now because it would be much harder after branching.
Fixesscylladb/scylladb#17032Closesscylladb/scylladb#17136
* github.com:scylladb/scylladb:
docs: dev: topology-over-raft: align indentation
docs: dev: topology-over-raft: document the rollback_to_normal state
topology_coordinator: improve logs in rollback_to_normal handler
raft topology: make rollback_to_normal a transition state
In a previous PR (https://github.com/scylladb/scylladb/pull/16840), we enabled tablets by default when running the cql-pytest suite. To handle tests which are failing with tablets enabled, we used a new fixture, `xfail_tablets` to mark these as xfail. This means that we effectively lost test coverage, as these tests can now freely fail and no-one will notice if this is due to a new regression. To restore test coverage, this PR re-enables all the previously disabled tests, by parametrizing each one of them to run with both vnodes and tablets, and targetedly mark as xfail, only the tablet variant. After these tests are fixed with tablets (or the underlying functionality they test is fixed to work with tablets), we will run them with both vnodes and tablets, because these tests apparently *do* care which replication method is used.
Together with https://github.com/scylladb/scylladb/pull/16802, this means all previously disabled test is re-enabled and no coverage is lost.
Closesscylladb/scylladb#16945
* github.com:scylladb/scylladb:
test/cql-pytest: conftest.py: remove xfail_tablets fixture
test/cql-pytest: test_tombstone_limit.py: re-enable disabled tests
test/cql-pytest: test_describe.py: re-enable disabled tests
test/cql-pytest: test_cdc.py: re-enable disabled tests
test/cql-pytest: add parameter support to test_keyspace
When creating a keyspace, scylla allows setting RF value smaller than
there are nodes in the DC. With vnodes, when new nodes are bootstrapped,
new tokens are inserted thus catching up with RF. With tablets, it's not
the case as replica set remains unchanged.
With tablets it's good chance not to mimic the vnodes behavior and
require as many nodes to be up and running as the requested RF is. This
patch implementes this in a lazy manned -- when creating a keyspace RF
can be any, but when a new table is created the topology should meet RF
requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the cql-pytest-s run agains single scylla node, but
new_random_keyspace() helper may request RF in the rage of 1 through 6,
so tablets need to be explicitly disabled when the RF is too large
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When tablets are enabled and a keyspace being described has them
explicitly disabled or non-automatic initial value of zero, include this
into the returned describe statement too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
Closesscylladb/scylladb#17130
* github.com:scylladb/scylladb:
treewide: replace seastar::future::get0() with seastar::future::get()
sstable: capture return value of get0() using auto
utils: result_loop: define result_type with decayed type
[avi: add another one that snuck in while this was cooking]
After changing `left_token_ring` from a node state to a transition
state in scylladb/scylladb#17009, we do the same for
`rollback_to_normal`. `rollback_to_normal` was created as a node
state because `left_token_ring` was a node state.
This change will allow us to distinguish a failed removenode from
a failed decommission in the `rollback_to_normal` handler.
Currently, we use the same logic for both of them, so it's not
required. However, this might change, as it has happened with the
decommission and the failed bootstrap/replace in the
`left_token_ring` state (scylladb/scylladb#16797). We are making
this change now because it would be much harder after branching.
The change also simplifies the code in
`topology_coordinator:rollback_current_topology_op`.
Moving the `rollback_to_normal` handler from
`handle_node_transition` to `handle_topology_transition` created a
large diff. There is only one change - adding
`auto node = get_node_to_work_on(std::move(guard));`.
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
this change should silence following warning:
```
test/boost/tablets_test.cc:1600:27: error: comparison of integers of different signs: 'int' and 'unsigned int' [-Werror,-Wsign-compare]
19:47:04 for (int i = 0; i < smp::count * 20; i++) {
19:47:04 ~ ^ ~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The latter suite is now tablets-aware and tablets cases from the former one can happily work with single shared scylla instance
Closesscylladb/scylladb#17101
* github.com:scylladb/scylladb:
test/topology_custom: Remove test_tablets.py
test/topology: Move test_tablet_change_initial_tablets
test/topology: Move test_tablet_explicit_disabling
test/topology: Move test_tablet_default_initialization
test/topology: Move test_tablet_change_replication_strategy
test/topology: Move test_tablet_change_replication_vnode_to_tablets
cql-pytest: Add skip_without_tablets fixture
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.
To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.
Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.
But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.
Closesscylladb/scylladb#17106