Messages between framework schedulers and the Mesos master may be dropped due to failures and network partitions. This may cause a framework scheduler and the master to have different views of the current state of the cluster. For example, consider a launch task request sent by a framework. There are many ways that failures can prevent the task launch operation from succeeding, such as:
In these cases, the framework believes the task to be staging but the task is unknown to the master. To cope with such situations, Mesos frameworks should use reconciliation to ask the master for the current state of their tasks.
Frameworks can use the scheduler driver’s reconcileTasks
method to send a reconciliation request to the master:
// Allows the framework to query the status for non-terminal tasks.
// This causes the master to send back the latest task status for
// each task in 'statuses', if possible. Tasks that are no longer
// known will result in a TASK_LOST update. If statuses is empty,
// then the master will send the latest status for each task
// currently known.
virtual Status reconcileTasks(const std::vector<TaskStatus>& statuses);
Currently, the master will only examine two fields in TaskStatus
:
TaskID
: This is required.SlaveID
: Optional but recommended. This leads to faster reconciliation in the presence of agents that are transitioning between states.Mesos provides two forms of reconciliation:
Reconciliation results are returned as task status updates (e.g., via the scheduler driver’s statusUpdate
callback). Status updates that result from reconciliation requests will their reason
field set to REASON_RECONCILIATION
. Note that most of the other fields in the returned TaskStatus
message will not be set: for example, reconciliation cannot be used to retrieve the labels
or data
fields associated with a running task.
Framework schedulers should periodically reconcile all of their tasks (for example, every fifteen minutes). This serves two purposes:
As an optimization, framework schedulers should reconcile more frequently when they have reason to suspect that their local state differs from that of the master. For example, after a framework launches a task, it should expect to receive a TASK_RUNNING
status update for the new task fairly promptly. If no such update is received, the framework should perform explicit reconciliation more quickly than usual.
Similarly, frameworks should initiate reconciliation after both framework failovers and master failovers. Note that the scheduler driver notifies frameworks when master failover has occurred (via the reregistered()
callback). For more information, see the guide to designing highly available frameworks.
This technique for explicit reconciliation reconciles all non-terminal tasks until an update is received for each task, using exponential backoff to retry tasks that remain unreconciled. Retries are needed because the master temporarily may not be able to reply for a particular task. For example, during master failover the master must reregister all of the agents to rebuild its set of known tasks (this process can take minutes for large clusters, and is bounded by the --agent_reregister_timeout
flag on the master).
Steps:
start = now()
remaining = { T in tasks | T is non-terminal }
reconcile(remaining)
remaining = { T in remaining | T.last_update_arrival() < start }
remaining
is non-empty, go to 3.This reconciliation algorithm must be run after each (re-)registration.
Implicit reconciliation (passing an empty list) should also be used periodically, as a defense against data loss in the framework. Unless a strict registry is in use on the master, its possible for tasks to resurrect from a LOST state (without a strict registry the master does not enforce agent removal across failovers). When an unknown task is encountered, the scheduler should kill or recover the task.
Notes:
Offers are reconciled automatically after a failure: