Apache Mesos - Designing Highly Available Mesos Frameworks

Designing Highly Available Mesos Frameworks

A Mesos framework manages tasks. For a Mesos framework to be highly available, it must continue to manage tasks correctly in the presence of a variety of failure scenarios. The most common failure conditions that framework authors should consider include:

Note that more than one of these failures might occur simultaneously.

Mesos Architecture

Before discussing the specific failure scenarios outlined above, it is worth highlighting some aspects of how Mesos is designed that influence high availability:

Recommendations for Highly Available Frameworks

Highly available framework designs typically follow a few common patterns:

  1. To tolerate scheduler failures, frameworks run multiple scheduler instances (three instances is typical). At any given time, only one of these scheduler instances is the leader: this instance is connected to the Mesos master, receives resource offers and task status updates, and launches new tasks. The other scheduler replicas are followers: they are used only when the leader fails, in which case one of the followers is chosen to become the new leader.

  2. Schedulers need a mechanism to decide when the current scheduler leader has failed and to elect a new leader. This is typically accomplished using a coordination service like Apache ZooKeeper or etcd. Consult the documentation of the coordination system you are using for more information on how to correctly implement leader election.

  3. After electing a new leading scheduler, the new leader should reconnect to the Mesos master. When registering with the master, the framework should set the id field in its FrameworkInfo to the ID that was assigned to the failed scheduler instance. This ensures that the master will recognize that the connection does not start a new session, but rather continues (and replaces) the session used by the failed scheduler instance.

    NOTE: When the old scheduler leader disconnects from the master, by default the master will immediately kill all the tasks and executors associated with the failed framework. For a typical production framework, this default behavior is very undesirable! To avoid this, highly available frameworks should set the failover_timeout field in their FrameworkInfo to a generous value. To avoid accidental destruction of tasks in production environments, many frameworks use a failover_timeout of 1 week or more.

  4. After connecting to the Mesos master, the new leading scheduler should ensure that its local state is consistent with the current state of the cluster. For example, suppose that the previous leading scheduler attempted to launch a new task and then immediately failed. The task might have launched successfully, at which point the newly elected leader will begin to receive status updates about it. To handle this situation, frameworks typically use a strongly consistent distributed data store to record information about active and pending tasks. In fact, the same coordination service that is used for leader election (such as ZooKeeper or etcd) can often be used for this purpose. Some Mesos frameworks (such as Apache Aurora) use the Mesos replicated log for this purpose.

The Life Cycle of a Task

A Mesos task transitions through a sequence of states. The authoritative “source of truth” for the current state of a task is the agent on which the task is running. A framework scheduler learns about the current state of a task by communicating with the Mesos master—specifically, by listening for task status updates and by performing task state reconciliation.

Frameworks can represent the state of a task using a state machine, with one initial state and several possible terminal states:

Note that the same task status can be used in several different (but usually related) situations. For example, TASK_ERROR is used when the framework’s principal is not authorized to launch tasks as a certain user, and also when the task description is syntactically malformed (e.g., the task ID contains an invalid character). The reason field of the TaskStatus message can be used to disambiguate between such situations.

Dealing with Partitioned or Failed Agents

The Mesos master tracks the availability and health of the registered agents using two different mechanisms:

  1. The state of a persistent TCP connection between the master and the agent.

  2. Health checks using periodic ping messages to the agent. The master sends “ping” messages to the agent and expects a “pong” response message within a configurable timeout. The agent is considered to have failed if it does not respond promptly to a certain number of ping messages in a row. This behavior is controlled by the --agent_ping_timeout and --max_agent_ping_timeouts master flags.

If the persistent TCP connection to the agent breaks or the agent fails health checks, the master decides that the agent has failed and takes steps to remove it from the cluster. Specifically:

Typically, frameworks respond to failed or partitioned agents by scheduling new copies of the tasks that were running on the lost agent. This should be done with caution, however: it is possible that the lost agent is still alive, but is partitioned from the master and is unable to communicate with it. Depending on the nature of the network partition, tasks on the agent might still be able to communicate with external clients or other hosts in the cluster. Frameworks can take steps to prevent this (e.g., by having tasks connect to ZooKeeper and cease operation if their ZooKeeper session expires), but Mesos leaves such details to framework authors.

Dealing with Partitioned or Failed Masters

The behavior described above does not apply during the period immediately after a new Mesos master is elected. As noted above, most Mesos master state is only kept in memory; hence, when the leading master fails and a new master is elected, the new master will have little knowledge of the current state of the cluster. Instead, it rebuilds this information as the frameworks and agents notice that a new master has been elected and then reregister with it.

Framework Reregistration

When master failover occurs, frameworks that were connected to the previous leading master should reconnect to the new leading master. MesosSchedulerDriver handles most of the details of detecting when the previous leading master has failed and connecting to the new leader; when the framework has successfully reregistered with the new leading master, the reregistered scheduler driver callback will be invoked.

Agent Reregistration

During the period after a new master has been elected but before a given agent has reregistered or the agent_reregister_timeout has fired, attempting to reconcile the state of a task running on that agent will not return any information (because the master cannot accurately determine the state of the task).

If an agent does not reregister with the new master within a timeout (controlled by the --agent_reregister_timeout configuration flag), the master marks the agent as failed and follows the same steps described above. However, there is one difference: by default, agents are allowed to reconnect following master failover, even after the agent_reregister_timeout has fired. This means that frameworks might see a TASK_LOST update for a task but then later discover that the task is running (because the agent where it was running was allowed to reconnect).