Apache Mesos - POSIX Resource Limits Support in Mesos Containerizer

POSIX Resource Limits Support in Mesos Containerizer

This document describes the posix/rlimits isolator. The isolator adds support for setting POSIX resource limits (rlimits) for containers launched using the Mesos containerizer.

To enable the POSIX Resource Limits support, append posix/rlimits to the --isolation flag when starting the agent.

POSIX Resource Limits

POSIX rlimits can be used control the resources a process can consume. Resource limits are typically set at boot time and inherited when a child process is forked from a parent process; resource limits can also be modified via setrlimit(2). In many interactive shells, resource limits can be inspected or modified with the ulimit shell built-in.

A POSIX resource limit consist of a soft and a hard limit. The soft limit specifies the effective resource limit for the current and forked process, while the hard limit gives the value up to which processes may increase their effective limit; increasing the hard limit is a privileged action. It is required that the soft limit is less than or equal to the hard limit. System administrators can use a hard resource limit to define the maximum amount of resources that can be consumed by a user; users can employ soft resource limits to ensure that one of their tasks only consumes a limited amount of the global hard resource limit.

Setting POSIX Resource Limits for Tasks

This isolator permits setting per-task resource limits. This isolator interprets rlimits specified as part of a task’s ContainerInfo for the Mesos containerizer, e.g.,

{
  "container": {
    "type": "MESOS",
    "rlimit_info": {
      "rlimits": [
        {
          "type": "RLMT_CORE"
        },
        {
          "type": "RLMT_STACK",
          "soft": 8192,
          "hard": 32768
        }
      ]
    }
  }
}

To enable interpretation of rlimits, agents need to be started with posix/rlimits in its --isolation flag, e.g.,

mesos-agent --master=<master ip> --ip=<agent ip>
  --work_dir=/var/lib/mesos
  --isolation=posix/rlimits[,other isolation flags]

To set a hard limit for a task larger than the current value of the hard limit, the agent process needs to be under a privileged user (with the CAP_SYS_RESOURCE capability), typically root.

POSIX currently defines a base set of resources, see the documentation; Linux defines additional resource limits, see e.g., the documentation of setrlimit(2).

Resource Comment
RLIMIT_CORE POSIX: This is the maximum size of a core file, in bytes, that may be created by a process.
RLIMIT_CPU POSIX: This is the maximum amount of CPU time, in seconds, used by a process.
RLIMIT_DATA POSIX: This is the maximum size of a process’ data segment, in bytes.
RLIMIT_FSIZE POSIX: This is the maximum size of a file, in bytes, that may be created by a process.
RLIMIT_NOFILE POSIX: This is a number one greater than the maximum value that the system may assign to a newly-created descriptor.
RLIMIT_STACK POSIX: This is the maximum size of the initial thread’s stack, in bytes.
RLIMIT_AS POSIX: This is the maximum size of a process’ total available memory, in bytes.
RLMT_LOCKS Linux: (Early Linux 2.4 only) A limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLMT_MEMLOCK Linux: The maximum number of bytes of memory that may be locked into RAM.
RLMT_MSGQUEUE Linux: Specifies the limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process.
RLMT_NICE Linux: (Since Linux 2.6.12) Specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2).
RLMT_NPROC Linux: The maximum number of processes (or, more precisely on Linux, threads) that can be created for the real user ID of the calling process.
RLMT_RSS Linux: Specifies the limit (in pages) of the process’s resident set (the number of virtual pages resident in RAM).
RLMT_RTPRIO Linux: (Since Linux 2.6.12) Specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
RLMT_RTTIME Linux: (Since Linux 2.6.25) Specifies a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call.
RLMT_SIGPENDING Linux: (Since Linux 2.6.8) Specifies the limit on the number of signals that may be queued for the real user ID of the calling process.

Mesos maps these resource types onto RLimit types, where by convention the prefix RLMT_ is used in place of RLIMIT_ above. Not all limits types are supported on all platforms.

We require either both the soft and hard RLimit value, or none to be set; the latter case is interpreted as the absence of an explicit limit.