test/cluster: fix server_add/server_start hanging when starting in maintenance mode

When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering
maintenance mode") instead of sd_notify("STATUS=serving"), and does not
open the standard CQL port. This caused two independent bugs after the
default was changed to ServerUpState.SERVING:

1. poll_status() resolved serving_signal to False on the maintenance
   notification, so check_serving_notification() would never return True,
   and start() would time out waiting for SERVING.

2. The readiness check in start() was guarded by
   `server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in
   maintenance mode (the standard CQL port is not open). Even if bug 1
   were fixed, SERVING would never be recognized.

Fix both:

- Treat STATUS=entering maintenance mode as a successful readiness signal
  in poll_status(), resolving serving_signal to True just like
  STATUS=serving. Both mean "all configured ports are now open".

- Remove the CQL_ALTERNATOR_QUERIED precondition from the
  check_serving_notification() call in start(). The sd_notify signal is
  authoritative: Scylla sends it only when fully ready, regardless of
  which ports it opened. No CQL precondition is needed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This commit is contained in:
Nadav Har'El
2026-05-05 18:12:51 +03:00
parent 597838c501
commit af03f0e8c4

View File

@@ -835,8 +835,9 @@ class ScyllaServer:
loop.call_soon_threadsafe(f.set_result, True)
return
if 'STATUS=entering maintenance mode' in message:
logger.debug("Receive sd_notify 'entering maintenance mode'")
break
logger.debug("Received sd_notify 'entering maintenance mode' message")
loop.call_soon_threadsafe(f.set_result, True)
return
except socket.timeout:
pass
except Exception as e:
@@ -972,8 +973,11 @@ class ScyllaServer:
if server_up_state == ServerUpState.PROCESS_STARTED:
server_up_state = ServerUpState.HOST_ID_QUERIED
server_up_state = await self.get_cql_alternator_up_state() or server_up_state
# Check for SERVING state (sd_notify "serving" message)
if server_up_state >= ServerUpState.CQL_ALTERNATOR_QUERIED and self.check_serving_notification():
# Check for SERVING state via sd_notify. This is authoritative: Scylla sends
# STATUS=serving once all configured listeners are ready, and
# STATUS=entering maintenance mode once the maintenance socket is ready.
# Both mean the server is fully started and we don't need to wait further.
if self.check_serving_notification():
server_up_state = ServerUpState.SERVING
if server_up_state >= expected_server_up_state:
if expected_error is not None: