Node state transitions

Table of Contents:



introduction:

The lifecycle of a node comprises instantiation, and runtime state. The runtime state of an uninstantiated node is UNDEFINED. An instantiated node is typically either STOPPED, RUNNING, or SUSPENDED. Other states reflect transitions, which are made explicit due to potential lag between receipt of a control command, and state transition. This document describes valid runtime states, valid transitions, and use of relative control state information by cooperating nodes.

TOC

node state transitions:

When a node has not yet been instantiated, or its control state does not map to one of the known states, it is UNDEFINED. The FAILED state is reserved for catastrophic failure of a node (the node cannot continue without manual intervention). A node can transition to a FAILED state from any other state, even though these are not all diagrammed.

When transitioning to a failed state, the intermediate FAILING state is optional. If the node wishes to perform some action before failure is complete, then the node must first set the state to FAILING before taking this action and setting the status to FAILED. Actions may take time to complete, especially under failure circumstances, and the transitionary status provides information to the control system in the interim.

Node state transitions:

    images/StateTrans.gif

Meanings inherent in the states and valid transitions:

  • undefined: no state information for this node is available
  • undefined -> stopped: standard node instantiation
  • undefined -> failing: catastrophic failure on instantiation
  • stopped: the node is instantiated but must be initialized to become operational. It is not processing input or producing output of any kind.
  • stopped -> starting: initialization begins
  • starting -> failing
  • starting -> running: successful initialization
  • running: the node is fully operational
  • running -> stopping: the node is being shut down normally, no new input is being accepted, outstanding work is being completed and resources are being relinquished.
  • stopping -> failing
  • stopping -> stopped: all outstanding operations have been completed and resources have been relinquished.
  • running -> failing: a catastrophic failure occurred during processing
  • running -> suspending: the node has been suspended and is cleaning up its remaining outstanding synchronous input
  • suspending -> failing
  • suspending -> suspended: processing of remaining synchronous input has been completed
  • suspended: synchronous input not accepted, resources requiring extensive initialization continue to be held
  • suspended -> failing
  • suspended -> resuming: setting up for acceptance of synchronous input.
  • resuming -> failing
  • resuming -> running: all suspended functions restored

For brevity, the transition of any state xyz -> failing is understood to mean:

A suspended node does not process synchronous input (input declared using @receive). This input is automatically suspended when the node changes to the SUSPENDING state. A node may optionally process asynchronous input (input declared via @subscribe). This input is not automatically suspended; the node can choose to continue processing this input or ignore it as application logic dictates.

TOC

notes on heartbeat requirements:

Heartbeat monitoring is not explicitely supported within SAND, but is generally expected to be available as part of the underlying control technology. The logic behind this is:

  1. You don't need a heartbeat for asynchronous communication. Message delivery is guaranteed, so if the subscriber is up it will get the message. If the sender goes down, and it is expected to be producing regular output, then its output is already a heartbeat and can be detected. If the sender does not produce regular output then its health can be monitored using the control system directly.
  2. You don't need a heartbeat for synchronous communication. The call either succeeds or not, and state information can be pulled from the call failure details.

With the timer utility, any node can be programmed to send out a regular signal (as with Stats messages). This is useful in cases where the application itself needs to be reactive to specific situations.

TOC

helper nodes and state:

When a node requires other nodes to perform its work, some degree of cross-node state checking is necessary for due diligence during the initialization process. The amount of checking is application dependent, but tends to break along two categories of node relationships:

The common logic to verify child nodes is supported within SAND:

TOC

implementation notes:

If the underlying control system provides comprehensive state management using a standard that differs from the SAND state specification, it is expected that the SAND environment will provide both the SAND state management, and a mapping to/from the additional standard. Best efforts in mapping between the standards provide the least astonishment and maximum utility to those responsible for system maintenance, regardless of expressive power.

TOC









© 2002 SAND Services Inc.
All Rights Reserved.