Apache Mesos - Task State Reasons

Task State Reasons

Some TaskStatus messages will arrive with the reason field set to a value that can allow frameworks to display better error messages and to implement special behaviour for some of the reasons.

For most reasons, the message field of the TaskStatus message will give a more detailed, human-readable error description.

Not all status updates will contain a reason.

Guidelines for Framework Authors

Frameworks that implement their own executors are free to set the reason field on any status messages they produce.

Note that executors can not generally rely on the fact that the scheduler will see the status update with the reason set by the executor, since only the latest update for each different task state is stored and re-transmitted. See in particular the description of REASON_RECONCILIATION below.

Most reasons describe conditions that can only be detected in the master or agent code, and will accompany automatically generated status updates from either of these.

For consistency with the existing usages of the different task reasons, we recommend that executors restrict themselves to the following subset if they use a non-default reason in their status updates.

REASON_TASK_CHECK_STATUS_UPDATED For executors that support running task checks, it is recommended to generate a status update with this reason every time the task check status changes, together with a human-readable description of the change in the message field.
REASON_TASK_HEALTH_CHECK_STATUS_UPDATED For executors that support running task health checks, it is recommended to generate a status update with this reason every time the health check status changes, together with a human-readable description of the change in the message field. Note: The built-in executors additionally send an update with this reason every time a health check is unhealthy.
REASON_TASK_INVALID For executors that implement their own task validation logic, this reason can be used when the validation check fails, together with a human-readable description of the failed check in the message field.
REASON_TASK_UNAUTHORIZED For executors that implement their own authorization logic, this reason can be used when authorization fails, together with a human-readable description in the message field.

Reference of Reasons Currently Used in Mesos

Deprecated Reasons

The reason REASON_COMMAND_EXECUTOR_FAILED is deprecated and will be removed in the future. It should not be referenced by newly written code.

Unused Reasons

The reasons REASON_CONTAINER_LIMITATION, REASON_INVALID_FRAMEWORKID, REASON_SLAVE_UNKNOWN, REASON_TASK_UNKNOWN and REASON_EXECUTOR_UNREGISTERED are not used as of Mesos 1.4.

Reasons for Terminal Status Updates

For these status updates, the reason indicates why the task state changed. Typically, a given reason will always appear together with the same state.

Typically they are generated by mesos when an error occurs that prevents the executor from sending its own status update messages.

Below, a partition-aware framework means a framework which has the Capability::PARTITION_AWARE capability bit set in its FrameworkInfo. Messages generated on the master will have the source field set to SOURCE_MASTER and messages generated on the agent will have it set to SOURCE_AGENT in the v1 API or SOURCE_SLAVE in the v0 API.

As of Mesos 1.4, the following reasons are being used.

For state TASK_FAILED

In status updates generated on the agent:

REASON_CONTAINER_LAUNCH_FAILED The task could not be launched because its container failed to launch.
REASON_CONTAINER_LIMITATION_MEMORY The container in which the task was running exceeded its memory allocation.
REASON_CONTAINER_LIMITATION_DISK The container in which the task was running exceeded its disk quota.
REASON_IO_SWITCHBOARD_EXITED The I/O switchboard server terminated unexpectedly.
REASON_EXECUTOR_REGISTRATION_TIMEOUT The executor for this task didn’t register with the agent within the allowed time limit.
REASON_EXECUTOR_REREGISTRATION_TIMEOUT The executor for this task lost connection and didn’t reregister within the allowed time limit.
REASON_EXECUTOR_TERMINATED The tasks’ executor terminated abnormally, and no more specific reason could be determined.

For state TASK_KILLED

In status updates generated on the master:

REASON_FRAMEWORK_REMOVED The framework to which this task belonged was removed.
Note: The status update will be sent out before the task is actually killed.
REASON_TASK_KILLED_DURING_LAUNCH This task, or a task within this task group, was killed before delivery to the agent.

In status updates generated on the agent:

REASON_TASK_KILLED_DURING_LAUNCH This task, or a task within this task group, was killed before delivery to the executor.
Note: Prior to version 1.5, the agent would in this situation sometimes send status updates with reason set to REASON_EXECUTOR_UNREGISTERED and sometimes without any reason set, depending on details of the timing of the executor launch and the kill command.

For state TASK_ERROR

In status updates generated on the master:

REASON_TASK_INVALID Task or resource validation checks failed.
REASON_TASK_GROUP_INVALID Task group or resource validation checks failed.
REASON_TASK_UNAUTHORIZED Task authorization failed on the master.
REASON_TASK_GROUP_UNAUTHORIZED Task group authorization failed on the master.

In status updates generated on the agent:

REASON_TASK_UNAUTHORIZED Task authorization failed on the agent.
REASON_TASK_GROUP_UNAUTHORIZED Task group authorization failed on the agent.

For state TASK_LOST

In status updates generated on the master:

REASON_SLAVE_DISCONNECTED The agent on which the task was running disconnected, and didn’t reconnect in time.
The task was part of an accepted offer, but the agent sending the offer disconnected in the meantime.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.
REASON_MASTER_DISCONNECTED The task was part of an accepted offer which couldn’t be sent to the master, because it was disconnected.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.
Note: Despite the source being set to SOURCE_MASTER , the message is not sent from the master but locally from the scheduler driver. Note: This reason is only used in the v0 API.
REASON_SLAVE_REMOVED The agent on which the task was running was removed.
The task was part of an accepted offer, but the agent sending the offer was disconnected in the meantime.
Note: For partition-aware frameworks, the state will be to TASK_DROPPED instead.
The agent on which the task was running was marked unreachable.
Note: For partition-aware frameworks, the state will be TASK_UNREACHABLE instead.
REASON_RESOURCES_UNKNOWN The task was part of an accepted offer which used checkpointed resources that are not known to the master.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.

In status updates generated on the agent:

REASON_SLAVE_RESTARTED The task was launched during an agent restart, and never got forwarded to the executor.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.
REASON_CONTAINER_PREEMPTED The container in which the task was running was pre-empted by a QoS correction.
Note: For partition-aware frameworks, the state will be changed to TASK_GONE instead.
REASON_CONTAINER_UPDATE_FAILED The container in which the task was running was discarded because a resource update failed.
Note: For partition-aware frameworks, the state will be TASK_GONE instead.
REASON_EXECUTOR_TERMINATED The executor which was supposed to execute this task was already terminated, or the agent receives an instruction to kill the task before the executor was started.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.
REASON_GC_ERROR A directory to be used by this task was scheduled for GC and it could not be unscheduled.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.
REASON_INVALID_OFFERS This task belonged to an accepted offer that didn’t pass validation checks.
Note: For partition-aware frameworks, the state will be TASK_DROPPED instead.

For state TASK_DROPPED:

In status updates generated on the master:

REASON_SLAVE_DISCONNECTED See TASK_LOST
REASON_SLAVE_REMOVED See TASK_LOST
REASON_RESOURCES_UNKNOWN See TASK_LOST

In status updates generated on the agent:

REASON_SLAVE_RESTARTED See TASK_LOST
REASON_GC_ERROR See TASK_LOST
REASON_INVALID_OFFERS See TASK_LOST

For state TASK_UNREACHABLE:

In status updates generated on the master:

REASON_SLAVE_REMOVED See TASK_LOST

For state TASK_GONE

In status updates generated on the agent:

REASON_CONTAINER_UPDATE_FAILED See TASK_LOST
REASON_CONTAINER_PREEMPTED See TASK_LOST
REASON_EXECUTOR_PREEMPTED Renamed to REASON_CONTAINER_PREEMPTED in Mesos 0.26.

Reasons for Non-Terminal Status Updates

These reasons do not cause a state change, and will be sent along with the last known state of the task. The reason field indicates why the status update was sent.

REASON_RECONCILIATION A framework requested implicit or explicit reconciliation for this task.
Note: Status updates with this reason are not the original ones, but rather a modified copy that is re-sent from the master. In particular, the original data and message fields are erased and the original reason field is overwritten by REASON_RECONCILIATION .
REASON_TASK_CHECK_STATUS_UPDATED A task check notified the agent that its state changed.
Note: This reason is set by the executor, so for tasks that are running with a custom executor, whether or not status updates with this reasons are sent depends on that executors implementation. Note: Currently, when using one of the built-in executors, this reason is only used within status updates with task state TASK_RUNNING .
REASON_TASK_HEALTH_CHECK_STATUS_UPDATED A task health check notified the agent that its state changed.
Note: This reason is set by the executor, so for tasks that are running with a custom executor, whether or not status updates with this reasons are sent depends on that executors implementation. Note: Currently, when using one of the built-in executors, this reason is only used within status updates with task state TASK_RUNNING .
REASON_SLAVE_REREGISTERED The agent on which the task was running has reregistered after being marked unreachable by the master.
Note: Due to garbage collection of the unreachable and gone agents in the registry and master state Mesos also sends such status updates for agents unknown to the master. Note: Status updates with this reason are modified copies re-sent by the master which reflect the states of the tasks reported by the agent upon its re-registration. See comments for REASON_RECONCILIATION .