MXNet have several settings that can be changed via environment variable. Usually you do not need to change these settings, but they are listed here for reference.
- MXNET_GPU_WORKER_NTHREADS (default=2)
- Maximum number of threads that do the computation job on each GPU.
- MXNET_GPU_COPY_NTHREADS (default=1)
- Maximum number of threads that do memory copy job on each GPU.
- MXNET_CPU_WORKER_NTHREADS (default=1)
- Maximum number of threads that do the CPU computation job.
- MXNET_CPU_PRIORITY_NTHREADS (default=4)
- Number of threads given to prioritized CPU jobs.
- MXNET_EXEC_ENABLE_INPLACE (default=true)
- Whether to enable inplace optimization in symbolic execution.
- MXNET_EXEC_MATCH_RANGE (default=10)
- The rough matching scale in symbolic execution memory allocator.
- Set this to 0 if we do not want to enable memory sharing between graph nodes(for debug purpose).
- MXNET_EXEC_NUM_TEMP (default=1)
- Maximum number of temp workspace we can allocate to each device.
- Set this to small number can save GPU memory.
- It will also likely to decrease level of parallelism, which is usually OK.
- MXNET_ENGINE_TYPE (default=ThreadedEnginePerDevice)
- The type of underlying execution engine of MXNet.
- List of choices
- NaiveEngine: very simple engine that use master thread to do computation.
- ThreadedEngine: a threaded engine that uses global thread pool to schedule jobs.
- ThreadedEnginePerDevice: a threaded engine that allocates thread per GPU.
- MXNET_KVSTORE_REDUCTION_NTHREADS (default=4)
- Number of threads used for summing of big arrays.
- MXNET_KVSTORE_BIGARRAY_BOUND (default=1e6)
- The minimum size of “big array”.
- When the array size is bigger than this threshold, MXNET_KVSTORE_REDUCTION_NTHREADS threads will be used for reduction.
Settings for Minimum Memory Usage¶
- Make sure
min(MXNET_EXEC_NUM_TEMP, MXNET_GPU_WORKER_NTHREADS) = 1
- The default setting satisfies this.
Settings for More GPU Parallelism¶
MXNET_GPU_WORKER_NTHREADSto larger number (e.g. 2)
- You may want to set
MXNET_EXEC_NUM_TEMPto reduce memory usage.
- You may want to set
- This may not speed things up, especially for image applications, because GPU is usually fully utilized even with serialized jobs.