Apache Mesos - Persistent Volumes

Persistent Volumes

Mesos supports creating persistent volumes from disk resources. When launching a task, you can create a volume that exists outside the task’s sandbox and will persist on the node even after the task dies or completes. When the task exits, its resources – including the persistent volume – can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task’s output as its input.

Persistent volumes enable stateful services such as HDFS and Cassandra to store their data within Mesos rather than having to resort to workarounds (e.g., writing task state to a distributed filesystem that is mounted at a well-known location outside the task’s sandbox).

Usage

Persistent volumes can only be created from reserved disk resources, whether it be statically reserved or dynamically reserved. A dynamically reserved persistent volume also cannot be unreserved without first explicitly destroying the volume. These rules exist to limit accidental mistakes, such as a persistent volume containing sensitive data being offered to other frameworks in the cluster. Similarly, a persistent volume cannot be destroyed if there is an active task that is still using the volume.

Please refer to the Reservation documentation for details regarding reservation mechanisms available in Mesos.

Persistent volumes can also be created on isolated and auxiliary disks by reserving multiple disk resources.

By default, a persistent volume cannot be shared between tasks running under different executors: that is, once a task is launched using a persistent volume, that volume will not appear in any resource offers until the task has finished running. Shared volumes are a type of persistent volumes that can be accessed by multiple tasks at the same agent simultaneously; see the documentation on shared volumes for more information.

Persistent volumes can be created by operators and frameworks. By default, frameworks and operators can create volumes for any role and destroy any persistent volume. Authorization allows this behavior to be limited so that volumes can only be created for particular roles and only particular volumes can be destroyed. For these operations to be authorized, the framework or operator should provide a principal to identify itself. To use authorization with reserve, unreserve, create, and destroy operations, the Mesos master must be configured with the appropriate ACLs. For more information, see the authorization documentation.

When a persistent volume is destroyed, all the data on that volume is removed from the agent’s filesystem. Note that for persistent volumes created on Mount disks, the root directory is not removed, because it is typically the mount point used for a separate storage device.

In the following sections, we will walk through examples of each of the interfaces described above.

Framework API

### Offer::Operation::Create

A framework can create volumes through the resource offer cycle. Suppose we receive a resource offer with 2048 MB of dynamically reserved disk:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  ]
}

We can create a persistent volume from the 2048 MB of disk resources by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::Create has a volumes field which specifies the persistent volume information. We need to specify the following:

  1. The ID for the persistent volume; this must be unique per role on each agent.
  2. The non-nested relative path within the container to mount the volume.
  3. The permissions for the volume. Currently, "RW" is the only possible value.
  4. If the framework provided a principal when registering with the master, then the disk.persistence.principal field must be set to that principal. If the framework did not provide a principal when registering, then the disk.persistence.principal field can take any value, or can be left unset. Note that the principal field determines the “creator principal” when authorization is enabled, even if authentication is disabled.

     {
       "type" : Offer::Operation::CREATE,
       "create": {
         "volumes" : [
           {
             "name" : "disk",
             "type" : "SCALAR",
             "scalar" : { "value" : 2048 },
             "role" : <offer's allocation role>,
             "reservation" : {
               "principal" : <framework_principal>
             },
             "disk": {
               "persistence": {
                 "id" : <persistent_volume_id>,
                 "principal" : <framework_principal>
               },
               "volume" : {
                 "container_path" : <container_path>,
                 "mode" : <mode>
               }
             }
           }
         ]
       }
     }

If this succeeds, a subsequent resource offer will contain the following persistent volume:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

Offer::Operation::Destroy

A framework can destroy persistent volumes through the resource offer cycle. In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. The volume will continue to exist until it is explicitly destroyed. Suppose we would like to destroy the volume we created. First, we receive a resource offer (copy/pasted from above):

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can destroy the persistent volume by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::Destroy has a volumes field which specifies the persistent volumes to be destroyed.

{
  "type" : Offer::Operation::DESTROY,
  "destroy" : {
    "volumes" : [
      {
        "name" : "disk",
        "type" : "SCALAR",
        "scalar" : { "value" : 2048 },
        "role" : <offer's allocation role>,
        "reservation" : {
          "principal" : <framework_principal>
        },
        "disk": {
          "persistence": {
            "id" : <persistent_volume_id>
          },
          "volume" : {
            "container_path" : <container_path>,
            "mode" : <mode>
          }
        }
      }
    ]
  }
}

If this request succeeds, the persistent volume will be destroyed, and all files and directories associated with the volume will be deleted. However, the disk resources will still be reserved. As such, a subsequent resource offer will contain the following reserved disk resources:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  ]
}

Those reserved resources can then be used as normal: e.g., they can be used to create another persistent volume or can be unreserved.

### Offer::Operation::GrowVolume

Sometimes, a framework or an operator may find that the size of an existing persistent volume may be too small (possibly due to increased usage). In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. Suppose we want to grow the size of the volume to 4096 MB, we first need resource offer(s) with at least 2048 MB of disk resources with the same reservation and disk information:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can grow the persistent volume by sending an Offer::Operation message. Offer::Operation::GrowVolume has a volume field which specifies the persistent volume to grow, and an addition field which specifies the additional disk space resource.

{
  "type" : Offer::Operation::GROW_VOLUME,
  "grow_volume" : {
    "volume" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    },
   "addition" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    }
  }
}

If this request succeeds, the persistent volume will be grown to the new size, and all files and directories associated with the volume will not be touched. A subsequent resource offer will contain the grown volume:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 4096 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

### Offer::Operation::ShrinkVolume

Similarly, a framework or an operator may find that the size of an existing persistent volume may be too large (possibly due to over provisioning), and want to free up unneeded disk space resources. In Offer::Operation::Create, we created a persistent volume from 2048 MB of disk resources. Suppose we want to shrink the size of the volume to 1024 MB, we first need a resource offer with the volume to shrink:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

We can shrink the persistent volume by sending an Offer::Operation message via the acceptOffers API. Offer::Operation::ShrinkVolume has a volume field which specifies the persistent volume to grow, and a subtract field which specifies the scalar value of disk space to subtract from the volume:

{
  "type" : Offer::Operation::SHRINK_VOLUME,
  "shrink_volume" : {
    "volume" : {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 2048 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    },
   "subtract" : {
      "value" : 1024
    }
  }
}

If this request succeeds, the persistent volume will be shrunk to the new size, and all files and directories associated with the volume will not be touched. A subsequent resource offer will contain the shrunk volume as well as freed up disk resources with the same reservation information:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 1024 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "disk",
      "type" : "SCALAR",
      "scalar" : { "value" : 1024 },
      "role" : <offer's allocation role>,
      "reservation" : {
        "principal" : <framework_principal>
      },
      "disk": {
        "persistence": {
          "id" : <persistent_volume_id>
        },
        "volume" : {
          "container_path" : <container_path>,
          "mode" : <mode>
        }
      }
    }
  ]
}

Some restrictions about resizing a volume (applicable to both Offer::Operation::GrowVolume and Offer::Operation::ShrinkVolume):

Versioned HTTP Operator API

As described above, persistent volumes can be created by a framework scheduler as part of the resource offer cycle. Persistent volumes can also be managed using the HTTP Operator API.

This capability is intended for use by operators and administrative tools.

For each offer operation which interacts with persistent volume, there is an equivalent call in master’s HTTP Operator API.

Unversioned Operator HTTP Endpoints

Several HTTP endpoints like /create-volumes and /destroy-volumes can still be used to manage persisent volumes, but we generally encourage operators to use versioned HTTP Operator API instead, as new features like resize support may not be backported.

/create-volumes

To use this endpoint, the operator should first ensure that a reservation for the necessary resources has been made on the appropriate agent (e.g., by using the /reserve HTTP endpoint or by configuring a static reservation). The information that must be included in a request to this endpoint is similar to that of the CREATE offer operation. One difference is the required value of the disk.persistence.principal field: when HTTP authentication is enabled on the master, the field must be set to the same principal that is provided in the request’s HTTP headers. When HTTP authentication is disabled, the disk.persistence.principal field can take any value, or can be left unset. Note that the principal field determines the “creator principal” when authorization is enabled, even if HTTP authentication is disabled.

To create a 512MB persistent volume for the ads role on a dynamically reserved disk resource, we can send an HTTP POST request to the master’s /create-volumes endpoint like so:

curl -i \
     -u <operator_principal>:<password> \
     -d slaveId=<slave_id> \
     -d volumes='[
       {
         "name": "disk",
         "type": "SCALAR",
         "scalar": { "value": 512 },
         "role": "ads",
         "reservation": {
           "principal": <operator_principal>
         },
         "disk": {
           "persistence": {
             "id" : <persistence_id>,
             "principal" : <operator_principal>
           },
           "volume": {
             "mode": "RW",
             "container_path": <path>
           }
         }
       }
     ]' \
     -X POST http://<ip>:<port>/master/create-volumes

The user receives one of the following HTTP responses:

A single /create-volumes request can create multiple persistent volumes, but all of the volumes must be on the same agent.

This endpoint returns the 202 ACCEPTED HTTP status code, which indicates that the create operation has been validated successfully by the master. The request is then forwarded asynchronously to the Mesos agent where the reserved resources are located. That asynchronous message may not be delivered or creating the volumes at the agent might fail, in which case no volumes will be created. To determine if a create operation has succeeded, the user can examine the state of the appropriate Mesos agent (e.g., via the agent’s /state HTTP endpoint).

/destroy-volumes

To destroy the volume created above, we can send an HTTP POST to the master’s /destroy-volumes endpoint like so:

curl -i \
     -u <operator_principal>:<password> \
     -d slaveId=<slave_id> \
     -d volumes='[
       {
         "name": "disk",
         "type": "SCALAR",
         "scalar": { "value": 512 },
         "role": "ads",
         "reservation": {
           "principal": <operator_principal>
         },
         "disk": {
           "persistence": {
             "id" : <persistence_id>
           },
           "volume": {
             "mode": "RW",
             "container_path": <path>
           }
         }
       }
     ]' \
     -X POST http://<ip>:<port>/master/destroy-volumes

Note that the volume JSON in the /destroy-volumes request must exactly match the definition of the volume. The JSON definition of a volume can be found via the reserved_resources_full key in the master’s /slaves endpoint (see below).

The user receives one of the following HTTP responses:

A single /destroy-volumes request can destroy multiple persistent volumes, but all of the volumes must be on the same agent.

This endpoint returns the 202 ACCEPTED HTTP status code, which indicates that the destroy operation has been validated successfully by the master. The request is then forwarded asynchronously to the Mesos agent where the volumes are located. That asynchronous message may not be delivered or destroying the volumes at the agent might fail, in which case no volumes will be destroyed. To determine if a destroy operation has succeeded, the user can examine the state of the appropriate Mesos agent (e.g., via the agent’s /state HTTP endpoint).

Listing Persistent Volumes

Information about the persistent volumes at each agent in the cluster can be found by querying the /slaves master endpoint, under the reserved_resources_full key.

The same information can also be found in the /state agent endpoint (under the reserved_resources_full key). The agent endpoint is useful to confirm if changes to persistent volumes have been propagated to the agent (which can fail in the event of network partition or master/agent restarts).

Programming with Persistent Volumes

Some suggestions to keep in mind when building applications that use persistent volumes:

Version History

Persistent volumes were introduced in Mesos 0.23. Mesos 0.27 introduced HTTP endpoints for creating and destroying volumes. Mesos 0.28 introduced support for multiple disk resources, and also enhanced the /slaves master endpoint to include detailed information about persistent volumes and dynamic reservations. Mesos 1.0 changed the semantics of destroying a volume: in previous releases, destroying a volume would remove the Mesos-level metadata but would not remove the volume’s data from the agent’s filesystem. Mesos 1.1 introduced support for shared persistent volumes. Mesos 1.6 introduced experimental support for resizing persistent volumes.