Debug mode shuffles task position in the queue. So the following is possible:
1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there
2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call
3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor
4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..)
5) shard 0 inserts the completion from (4) before task from (1)
6) the check on shard 0: m.find(id1) fails because the timer is not expired yet
To fix that, wait for timer expiration on shard 0, so that the test
doesn't depend on task execution order.
Note: I was not able to reproduce the problem locally using test.py --mode
debug --repeat 1000.
It happens in jenkins very rarely. Which is expected as the scenario which
leads to this is quite unlikely.
Fixes SCYLLADB-1265
Closesscylladb/scylladb#29290
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835