The following factors may significant affect the performance:
- Use a fast backend. A fast BLAS library, e.g. openblas, altas, and mkl, is necessary if only using CPU. While for Nvidia GPUs, we strongly recommend to use CUDNN.
- Three important things for the input data:
- data format. If you are using the
recformat, then everything should be fine.
- decoding. In default MXNet uses 4 CPU threads for decoding the images, which are often able to decode over 1k images per second. You may increase the number of threads if either you are using a low-end CPU or you GPUs are very powerful.
- place to store the data. Any local or distributed filesystem (HDFS, Amazon S3) should be fine. There may be a problem if multiple machines read the data from the network shared filesystem (NFS) at the same time.
- Use a large batch size. We often choose the largest one which can fit into the GPU memory. But a too large value may slow down the convergence. For example, the safe batch size for CIFAR 10 is around 200, while for ImageNet 1K, the batch size can go beyond 1K.
- Choose the proper
kvstoreif using more than one GPU. (See doc/developer-guide/multi_node.md for more information)
- For a single machine, often the default
localis good enough. But you may want to use
local_allreduce_devicefor models with size >> 100MB such as AlexNet and VGG. But also note that
local_allreduce_devicetakes more GPU memory than others.
- For multiple machines, we recommend to try
dist_syncfirst. But if the model size is quite large or you use a large number of machines, you may want to use