Apache Mesos - Reconciliation

Task Reconciliation

Messages between framework schedulers and the Mesos master may be dropped due to failures and network partitions. This may cause a framework scheduler and the master to have different views of the current state of the cluster. For example, consider a launch task request sent by a framework. There are many ways that failures can prevent the task launch operation from succeeding, such as:

Framework fails after persisting its intent to launch the task, but before the launch task message was sent.
Master fails before receiving the message.
Master fails after receiving the message but before sending it to the agent.

In these cases, the framework believes the task to be staging but the task is unknown to the master. To cope with such situations, Mesos frameworks should use reconciliation to ask the master for the current state of their tasks.

How To Reconcile

Frameworks can use the scheduler driver’s reconcileTasks method to send a reconciliation request to the master:

// Allows the framework to query the status for non-terminal tasks.
// This causes the master to send back the latest task status for
// each task in 'statuses', if possible. Tasks that are no longer
// known will result in a TASK_LOST update. If statuses is empty,
// then the master will send the latest status for each task
// currently known.
virtual Status reconcileTasks(const std::vector<TaskStatus>& statuses);

Currently, the master will only examine two fields in TaskStatus:

TaskID: This is required.
SlaveID: Optional but recommended. This leads to faster reconciliation in the presence of agents that are transitioning between states.

Mesos provides two forms of reconciliation:

“Explicit” reconciliation: the scheduler sends a list of non-terminal task IDs and the master responds with the latest state for each task, if possible.
“Implicit” reconciliation: the scheduler sends an empty list of tasks and the master responds with the latest state for all currently known non-terminal tasks.

Reconciliation results are returned as task status updates (e.g., via the scheduler driver’s statusUpdate callback). Status updates that result from reconciliation requests will their reason field set to REASON_RECONCILIATION. Note that most of the other fields in the returned TaskStatus message will not be set: for example, reconciliation cannot be used to retrieve the labels or data fields associated with a running task.

When To Reconcile

Framework schedulers should periodically reconcile all of their tasks (for example, every fifteen minutes). This serves two purposes:

It is necessary to account for dropped messages between the framework and the master; for example, see the task launch scenario described above.
It is a defensive programming technique to catch bugs in both the framework and the Mesos master.

As an optimization, framework schedulers should reconcile more frequently when they have reason to suspect that their local state differs from that of the master. For example, after a framework launches a task, it should expect to receive a TASK_RUNNING status update for the new task fairly promptly. If no such update is received, the framework should perform explicit reconciliation more quickly than usual.

Similarly, frameworks should initiate reconciliation after both framework failovers and master failovers. Note that the scheduler driver notifies frameworks when master failover has occurred (via the reregistered() callback). For more information, see the guide to designing highly available frameworks.

Algorithm

This technique for explicit reconciliation reconciles all non-terminal tasks until an update is received for each task, using exponential backoff to retry tasks that remain unreconciled. Retries are needed because the master temporarily may not be able to reply for a particular task. For example, during master failover the master must reregister all of the agents to rebuild its set of known tasks (this process can take minutes for large clusters, and is bounded by the --agent_reregister_timeout flag on the master).

Steps:

let start = now()
let remaining = { T in tasks | T is non-terminal }
Perform reconciliation: reconcile(remaining)
Wait for status updates to arrive (use truncated exponential backoff). For each update, note the time of arrival.
let remaining = { T in remaining | T.last_update_arrival() < start }
If remaining is non-empty, go to 3.

This reconciliation algorithm must be run after each (re-)registration.

Implicit reconciliation (passing an empty list) should also be used periodically, as a defense against data loss in the framework. Unless a strict registry is in use on the master, its possible for tasks to resurrect from a LOST state (without a strict registry the master does not enforce agent removal across failovers). When an unknown task is encountered, the scheduler should kill or recover the task.

Notes:

When waiting for updates to arrive, use a truncated exponential backoff. This will avoid a snowball effect in the case of the driver or master being backed up.
It is beneficial to ensure that only 1 reconciliation is in progress at a time, to avoid a snowball effect in the face of many re-registrations. If another reconciliation should be started while one is in-progress, then the previous reconciliation algorithm should stop running.

Offer Reconciliation

Offers are reconciled automatically after a failure:

Offers do not persist beyond the lifetime of a Master.
If a disconnection occurs, offers are no longer valid.
Offers are rescinded and regenerated each time the framework (re-)registers.