Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.
(cherry picked from commit c1d0ee2bce)
Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.
(cherry picked from commit a4411e9ec4)
The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.
(cherry picked from commit 21b505e67c)
Instead of decommission of initial cluster, use custom cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13589
(cherry picked from commit ce87aedd30)
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.
Add a proof-of-concept test to be removed later.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13342
(cherry picked from commit e3b462507d)
Move long running topology tests out of `test_topology.py` and into their own files, so they can be run in parallel.
While there, merge simple schema tests.
Closes#12804
* github.com:scylladb/scylladb:
test/topology: rename topology test file
test/topology: lint and type for topology tests
test/topology: move topology ip tests to own file
test/topology: move topology test remove garbaje...
test/topology: move topology rejoin test to own file
test/topology: merge topology schema tests and...
test/topology: isolate topology smp params test
test/topology: move topology helpers to common file
(cherry picked from commit a24600a662)
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.
The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
Closes#12765
* github.com:scylladb/scylladb:
test/pylib: use larger timeout for decommission/removenode
test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
(cherry picked from commit e55f475db1)
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.
Fix tests using the previous helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 9ceb6aba81)
There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.
Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.
Closes#12686
(cherry picked from commit 40142a51d0)
After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.
Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.
Fixes: #12153Closes#12619
(cherry picked from commit fa9cf81af2)
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.
Fixes#12218Closes#12630
* github.com:scylladb/scylladb:
pytest: test start after ungraceful stop
test.py: enable force_schema_commit_log
(cherry picked from commit 5eadea301e)
Improve logging by printing the cluster at the end of each test.
Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.
Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.
Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.
Closes#12652
* github.com:scylladb/scylladb:
test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
test/topology: don't drop random_tables keyspace after a failed test
test/pylib: mark cluster as dirty after a failed test
test: pylib, topology: don't perform operations after test on a dirty cluster
test/pylib: print cluster at the end of test
(cherry picked from commit 2653865b34)
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
(cherry picked from commit 6f4a68175b)
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes#12798)
Closes#13531
* github.com:scylladb/scylladb:
Merge 'Do not mask node operation errors' from Benny Halevy
Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
storage_service: Wait for normal state handler to finish in replace
storage_service: Wait for normal state handler to finish in bootstrap
storage_service: Send heartbeat earlier for node ops
Fixes https://github.com/scylladb/scylladb/issues/13857
This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.
After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1
Closes: #13858
(cherry picked from commit 84ed95f86f)
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes https://github.com/scylladb/scylladb/issues/12462Closes#13894
* github.com:scylladb/scylladb:
tests: row_cache: Add reproducer for reader producing missing closing range tombstone
range_tombstone_change_generator: fix an edge case in flush()
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);
Fixes#13394.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
Fixes#13396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
Variant used by
streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))
as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.
static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());
Fixes#13397.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
Adds a reproducer for #12462.
The bug manifests by reader throwing:
std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
The reason is that prior to the fix range_tombstone_change_generator::flush()
was used with end_of_range=true to produce the closing range_tombstone_change
and it did not handle correctly the case when there are two adjacent range
tombstones and flush(pos, end_of_range=true) is called such that pos is the
boundary between the two.
Cherry-picked from a717c803c7.
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
(cherry picked from commit 7fcc403122)
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
(cherry picked from commit cc42f00232)
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
(cherry picked from commit d2d151ae5b)
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.
Fixes: #12939Closes#12963
* github.com:scylladb/scylladb:
test:boost: counter column parallelized aggregation test
service:forward_service: use long type when column is counter
(cherry picked from commit 61e67b865a)
Consider
- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2
```
auto host_id = _gossiper.get_host_id(endpoint);
auto existing = tmptr->get_endpoint_for_host_id(host_id);
```
host_id = new host id
existing = empty
As a result, del_replacing_endpoint() will not be called.
This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.
This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.
To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.
Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3
After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3
Tests: https://github.com/scylladb/scylla-dtest/pull/3126Closes#13677
Fixes: https://github.com/scylladb/scylla-enterprise/issues/2852
(cherry picked from commit a8040306bb)
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
(cherry picked from commit 72003dc35c)
I've no idea why the quotes are there at all, it works even without
them. However, with quotes gdb-13 fails to find the _all_threads static
thread-local variable _unless_ it's printed with gdb "p" command
beforehand.
fixes: #13125
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13132
(cherry picked from commit 537510f7d2)
This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user.
Also, cleanup the node_ops logging formats when aborting different node ops
and add more error logging around errors in the "worker" nodes.
Closes#12799
* github.com:scylladb/scylladb:
storage_service: node_ops_signal_abort: print a warning when signaling abort
storage_service: s/node_ops_singal_abort/node_ops_signal_abort/
storage_service: node_ops_abort: add log messages
storage_service: wire node_ops_ctl for node operations
storage_service: add node_ops_ctl class to formalize all node_ops flow
repair: node_ops_cmd_request: add print function
repair: do_decommission_removenode_with_repair: log ignore_nodes
repair: replace_with_repair: get ignore_nodes as unordered_set
gossiper: get_generation_for_nodes: get nodes as unordered_set
storage_service: don't let node_ops abort failures mask the real error
(cherry picked from commit 6373452b31)
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.
If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.
It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.
It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.
There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.
We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.
The race can still happen, but the time window is now much smaller.
The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.
Fixes#12989
Refs #12969Closes#13028
* github.com:scylladb/scylladb:
storage_service: node ops: Extract node_ops_insert() to reduce code duplication
storage_service: Make node operations safer by detecting asymmetric abort
storage_service: node ops: Add error injections
service: node_ops: Make watchdog and heartbeat intervals configurable
(cherry picked from commit 2b44631ded)
Similar to "storage_service: Wait for normal state handler to finish in
bootstrap", this patch enables the check on the replace procedure.
(cherry picked from commit 5856e69462)
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.
Consider this:
- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
bootstrap
- n2 fails to bootstrap
For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.
repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})
This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.
Fixes#12764Fixes#12956
(cherry picked from commit 53636167ca)