Apache Mesos - Persistent Volumes

Persistent Volumes

Mesos supports creating persistent volumes from disk resources. When launching a task, you can create a volume that exists outside the task’s sandbox and will persist on the node even after the task dies or completes. When the task exits, its resources – including the persistent volume – can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task’s output as its input.

Persistent volumes enable stateful services such as HDFS and Cassandra to store their data within Mesos rather than having to resort to workarounds (e.g., writing task state to a distributed filesystem that is mounted at a well-known location outside the task’s sandbox).

Usage

Persistent volumes can only be created from reserved disk resources, whether it be statically reserved or dynamically reserved. A dynamically reserved persistent volume also cannot be unreserved without first explicitly destroying the volume. These rules exist to limit accidental mistakes, such as a persistent volume containing sensitive data being offered to other frameworks in the cluster. Similarly, a persistent volume cannot be destroyed if there is an active task that is still using the volume.

Please refer to the Reservation documentation for details regarding reservation mechanisms available in Mesos.

Persistent volumes can also be created on isolated and auxiliary disks by reserving multiple disk resources.

By default, a persistent volume cannot be shared between tasks running under different executors: that is, once a task is launched using a persistent volume, that volume will not appear in any resource offers until the task has finished running. Shared volumes are a type of persistent volumes that can be accessed by multiple tasks at the same agent simultaneously; see the documentation on shared volumes for more information.

Persistent volumes can be created by operators and frameworks. By default, frameworks and operators can create volumes for any role and destroy any persistent volume. Authorization allows this behavior to be limited so that volumes can only be created for particular roles and only particular volumes can be destroyed. For these operations to be authorized, the framework or operator should provide a principal to identify itself. To use authorization with reserve, unreserve, create, and destroy operations, the Mesos master must be configured with the appropriate ACLs. For more information, see the authorization documentation.

The following messages are available for frameworks to send back via the acceptOffers API as a response to a resource offer:
- Offer::Operation::Create
- Offer::Operation::Destroy
- Offer::Operation::GrowVolume
- Offer::Operation::ShrinkVolume
For each message in above list, a corresponding call in HTTP Operator API is available for operators or administrative tools;
/create-volumes and /destroy-volumes HTTP endpoints allow operators to manage persistent volumes through the master.

When a persistent volume is destroyed, all the data on that volume is removed from the agent’s filesystem. Note that for persistent volumes created on Mount disks, the root directory is not removed, because it is typically the mount point used for a separate storage device.

In the following sections, we will walk through examples of each of the interfaces described above.

Framework API

### Offer::Operation::Create

A framework can create volumes through the resource offer cycle. Suppose we receive a resource offer with 2048 MB of dynamically reserved disk:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  ]
}

We can create a persistent volume from the 2048 MB of disk resources by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::Create has a volumes field which specifies the persistent volume information. We need to specify the following:

The ID for the persistent volume; this must be unique per role on each agent.
The non-nested relative path within the container to mount the volume.
The permissions for the volume. Currently, "RW" is the only possible value.

If the framework provided a principal when registering with the master, then the disk.persistence.principal field must be set to that principal. If the framework did not provide a principal when registering, then the disk.persistence.principal field can take any value, or can be left unset. Note that the principal field determines the “creator principal” when authorization is enabled, even if authentication is disabled.

 {
   "type" : Offer::Operation::CREATE,
   "create": {
     "volumes" : [
       {
         "name" : "disk",
         "type" : "SCALAR",
         "scalar" : { "value" : 2048 },
         "role" : <offer's allocation role>,
         "reservation" : {
           "principal" : <framework_principal>
         },
         "disk": {
           "persistence": {
             "id" : <persistent_volume_id>,
             "principal" : <framework_principal>
           },
           "volume" : {
             "container_path" : <container_path>,
             "mode" : <mode>
           }
         }
       }
     ]
   }
 }

If this succeeds, a subsequent resource offer will contain the following persistent volume:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

`Offer::Operation::Destroy`

A framework can destroy persistent volumes through the resource offer cycle. In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. The volume will continue to exist until it is explicitly destroyed. Suppose we would like to destroy the volume we created. First, we receive a resource offer (copy/pasted from above):

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can destroy the persistent volume by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::Destroy has a volumes field which specifies the persistent volumes to be destroyed.

{
  "type" : Offer::Operation::DESTROY,
  "destroy" : {
    "volumes" : [
      {
        "name" : "disk",
        "type" : "SCALAR",
        "scalar" : { "value" : 2048 },
        "role" : <offer's allocation role>,
        "reservation" : {
          "principal" : <framework_principal>
        },
        "disk": {
          "persistence": {
            "id" : <persistent_volume_id>
          },
          "volume" : {
            "container_path" : <container_path>,
            "mode" : <mode>
          }
        }
      }
    ]
  }
}

If this request succeeds, the persistent volume will be destroyed, and all files and directories associated with the volume will be deleted. However, the disk resources will still be reserved. As such, a subsequent resource offer will contain the following reserved disk resources:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  ]
}

Those reserved resources can then be used as normal: e.g., they can be used to create another persistent volume or can be unreserved.

### Offer::Operation::GrowVolume

Sometimes, a framework or an operator may find that the size of an existing persistent volume may be too small (possibly due to increased usage). In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. Suppose we want to grow the size of the volume to 4096 MB, we first need resource offer(s) with at least 2048 MB of disk resources with the same reservation and disk information:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can grow the persistent volume by sending an Offer::Operation message. Offer::Operation::GrowVolume has a volume field which specifies the persistent volume to grow, and an addition field which specifies the additional disk space resource.

{
  "type" : Offer::Operation::GROW_VOLUME,
  "grow_volume" : {
    "volume" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    },
   "addition" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  }
}

If this request succeeds, the persistent volume will be grown to the new size, and all files and directories associated with the volume will not be touched. A subsequent resource offer will contain the grown volume:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 4096 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

### Offer::Operation::ShrinkVolume

Similarly, a framework or an operator may find that the size of an existing persistent volume may be too large (possibly due to over provisioning), and want to free up unneeded disk space resources. In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. Suppose we want to shrink the size of the volume to 1024 MB, we first need a resource offer with the volume to shrink:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can shrink the persistent volume by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::ShrinkVolume has a volume field which specifies the persistent volume to grow, and a subtract field which specifies the scalar value of disk space to subtract from the volume:

{
  "type" : Offer::Operation::SHRINK_VOLUME,
  "shrink_volume" : {
    "volume" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    },
   "subtract" : {
      "value" : 1024
    }
  }
}

If this request succeeds, the persistent volume will be shrunk to the new size, and all files and directories associated with the volume will not be touched. A subsequent resource offer will contain the shrunk volume as well as freed up disk resources with the same reservation information:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 1024 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 1024 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

Some restrictions about resizing a volume (applicable to both Offer::Operation::GrowVolume and Offer::Operation::ShrinkVolume):

Only persistent volumes created on an agent’s local disk space with ROOT or PATH type can be resized;
A persistent volume cannot be actively used by a task when being resized;
A persistent volume cannot be shared when being resized;
Volume resize operations cannot be included in an ACCEPT call with other operations which make use of the resized volume.

Versioned HTTP Operator API

As described above, persistent volumes can be created by a framework scheduler as part of the resource offer cycle. Persistent volumes can also be managed using the HTTP Operator API.

This capability is intended for use by operators and administrative tools.

For each offer operation which interacts with persistent volume, there is an equivalent call in master’s HTTP Operator API.

Unversioned Operator HTTP Endpoints

Several HTTP endpoints like /create-volumes and /destroy-volumes can still be used to manage persisent volumes, but we generally encourage operators to use versioned HTTP Operator API instead, as new features like resize support may not be backported.

`/create-volumes`

To use this endpoint, the operator should first ensure that a reservation for the necessary resources has been made on the appropriate agent (e.g., by using the /reserve HTTP endpoint or by configuring a static reservation). The information that must be included in a request to this endpoint is similar to that of the CREATE offer operation. One difference is the required value of the disk.persistence.principal field: when HTTP authentication is enabled on the master, the field must be set to the same principal that is provided in the request’s HTTP headers. When HTTP authentication is disabled, the disk.persistence.principal field can take any value, or can be left unset. Note that the principal field determines the “creator principal” when authorization is enabled, even if HTTP authentication is disabled.

To create a 512MB persistent volume for the ads role on a dynamically reserved disk resource, we can send an HTTP POST request to the master’s /create-volumes endpoint like so:

curl -i \
     -u <operator_principal>:<password> \
     -d slaveId=<slave_id> \
     -d volumes='[
       {
         "name": "disk",
         "type": "SCALAR",
         "scalar": { "value": 512 },
         "role": "ads",
         "reservation": {
           "principal": <operator_principal>
         },
         "disk": {
           "persistence": {
             "id" : <persistence_id>,
             "principal" : <operator_principal>
           },
           "volume": {
             "mode": "RW",
             "container_path": <path>
           }
         }
       }
     ]' \
     -X POST http://<ip>:<port>/master/create-volumes

The user receives one of the following HTTP responses:

202 Accepted: Request accepted (see below).
400 BadRequest: Invalid arguments (e.g., missing parameters).
401 Unauthorized: Unauthenticated request.
403 Forbidden: Unauthorized request.
409 Conflict: Insufficient resources to create the volumes.

A single /create-volumes request can create multiple persistent volumes, but all of the volumes must be on the same agent.

This endpoint returns the 202 ACCEPTED HTTP status code, which indicates that the create operation has been validated successfully by the master. The request is then forwarded asynchronously to the Mesos agent where the reserved resources are located. That asynchronous message may not be delivered or creating the volumes at the agent might fail, in which case no volumes will be created. To determine if a create operation has succeeded, the user can examine the state of the appropriate Mesos agent (e.g., via the agent’s /state HTTP endpoint).

`/destroy-volumes`

To destroy the volume created above, we can send an HTTP POST to the master’s /destroy-volumes endpoint like so:

curl -i \
     -u <operator_principal>:<password> \
     -d slaveId=<slave_id> \
     -d volumes='[
       {
         "name": "disk",
         "type": "SCALAR",
         "scalar": { "value": 512 },
         "role": "ads",
         "reservation": {
           "principal": <operator_principal>
         },
         "disk": {
           "persistence": {
             "id" : <persistence_id>
           },
           "volume": {
             "mode": "RW",
             "container_path": <path>
           }
         }
       }
     ]' \
     -X POST http://<ip>:<port>/master/destroy-volumes

Note that the volume JSON in the /destroy-volumes request must exactly match the definition of the volume. The JSON definition of a volume can be found via the reserved_resources_full key in the master’s /slaves endpoint (see below).

The user receives one of the following HTTP responses:

202 Accepted: Request accepted (see below).
400 BadRequest: Invalid arguments (e.g., missing parameters).
401 Unauthorized: Unauthenticated request.
403 Forbidden: Unauthorized request.
409 Conflict: Insufficient resources to destroy the volumes.

A single /destroy-volumes request can destroy multiple persistent volumes, but all of the volumes must be on the same agent.

This endpoint returns the 202 ACCEPTED HTTP status code, which indicates that the destroy operation has been validated successfully by the master. The request is then forwarded asynchronously to the Mesos agent where the volumes are located. That asynchronous message may not be delivered or destroying the volumes at the agent might fail, in which case no volumes will be destroyed. To determine if a destroy operation has succeeded, the user can examine the state of the appropriate Mesos agent (e.g., via the agent’s /state HTTP endpoint).

Listing Persistent Volumes

Information about the persistent volumes at each agent in the cluster can be found by querying the /slaves master endpoint, under the reserved_resources_full key.

The same information can also be found in the /state agent endpoint (under the reserved_resources_full key). The agent endpoint is useful to confirm if changes to persistent volumes have been propagated to the agent (which can fail in the event of network partition or master/agent restarts).

Programming with Persistent Volumes

Some suggestions to keep in mind when building applications that use persistent volumes:

A single acceptOffers call make a dynamic reservation (via Offer::Operation::Reserve) and create a new persistent volume on the newly reserved resources (via Offer::Operation::Create). However, these operations are not executed atomically (i.e., either operation or both operations could fail).
Volume IDs must be unique per role on each agent. However, it is strongly recommended that frameworks use globally unique volume IDs, to avoid potential confusion between volumes on different agents with the same volume ID. Note also that the agent ID where a volume resides might change over time. For example, suppose a volume is created on an agent and then the agent’s host machine is rebooted. When the agent registers with Mesos after the reboot, it will be assigned a new AgentID—but it will retain the same volume it had previously. Hence, frameworks should not assume that using the pair <AgentID, VolumeID> is a stable way to identify a volume in a cluster.
Attempts to dynamically reserve resources or create persistent volumes might fail—for example, because the network message containing the operation did not reach the master or because the master rejected the operation. Applications should be prepared to detect failures and correct for them (e.g., by retrying the operation).
When using HTTP endpoints to reserve resources or create persistent volumes, some failures can be detected by examining the HTTP response code returned to the client. However, it is still possible for a 202 response code to be returned to the client but for the associated operation to fail—see discussion above.
When using the scheduler API, detecting that a dynamic reservation has failed is a little tricky: reservations do not have unique identifiers, and the Mesos master does not provide explicit feedback on whether a reservation request has succeeded or failed. Hence, framework schedulers typically use a combination of two techniques:
1. They use timeouts to detect that a reservation request may have failed (because they don’t receive a resource offer containing the expected resources after a given period of time).
2. To check whether a resource offer includes the effect of a dynamic reservation, applications cannot check for the presence of a “reservation ID” or similar value (because reservations do not have IDs). Instead, applications should examine the resource offer and check that it contains sufficient reserved resources for the application’s role. If it does not, the application should make additional reservation requests as necessary.
When a scheduler issues a dynamic reservation request, the reserved resources might not be present in the next resource offer the scheduler receives. There are two reasons for this: first, the reservation request might fail or be dropped by the network, as discussed above. Second, the reservation request might simply be delayed, so that the next resource offer from the master will be issued before the reservation request is received by the master. This is why the text above suggests that applications wait for a timeout before assuming that a reservation request should be retried.
A consequence of using timeouts to detect failures is that an application might submit more reservation requests than intended (e.g., a timeout fires and an application makes another reservation request; meanwhile, the original reservation request is also processed). Recall that two reservations for the same role at the same agent are “merged”: for example, role foo makes two requests to reserve 2 CPUs at a single agent and both reservation requests succeed, the result will be a single reservation of 4 CPUs. To handle this situation, applications should be prepared for resource offers that contain more resources than expected. Some applications may also want to detect this situation and unreserve any additional reserved resources that will not be required.
It often makes sense to structure application logic as a “state machine”, where the application moves from its initial state (no reserved resources and no persistent volumes) and eventually transitions toward a single terminal state (necessary resources reserved and persistent volume created). As new events (such as timeouts and resource offers) are received, the application compares the event with its current state and decides what action to take next.
Because persistent volumes are associated with roles, a volume might be offered to any of the frameworks that are subscribed to that role. For example, a persistent volume might be created by one framework and then offered to a different framework subscribed to the same role. This can be used to pass large volumes of data between frameworks in a convenient way. However, this behavior might also allow sensitive data created by one framework to be read or modified by another framework subscribed to the same role. It can also make it more difficult for frameworks to determine whether a dynamic reservation has succeeded: as discussed above, frameworks need to wait for an offer that contains the “expected” reserved resources to determine when a reservation request has succeeded. Determining what a framework should “expect” to find in an offer is more difficult when multiple frameworks can make reservations for the same role concurrently. In general, whenever multiple frameworks are allowed to subscribe to the same role, the operator should ensure that those frameworks are configured to collaborate with one another when using role-specific resources. For more information, see the discussion of multiple frameworks in the same role.

Version History

Persistent volumes were introduced in Mesos 0.23. Mesos 0.27 introduced HTTP endpoints for creating and destroying volumes. Mesos 0.28 introduced support for multiple disk resources, and also enhanced the /slaves master endpoint to include detailed information about persistent volumes and dynamic reservations. Mesos 1.0 changed the semantics of destroying a volume: in previous releases, destroying a volume would remove the Mesos-level metadata but would not remove the volume’s data from the agent’s filesystem. Mesos 1.1 introduced support for shared persistent volumes. Mesos 1.6 introduced experimental support for resizing persistent volumes.