MXNet Python Data Loading API

Introduction

This page will introduce data input method in MXNet. MXNet use iterator to provide data to the neural network. Iterators do some preprocessing and generate batch for the neural network.

  • We provide basic iterators for MNIST image and RecordIO image.
  • To hide the IO cost, prefetch strategy is used to allow parallelism of learning process and data fetching. Data will automatically fetched by an independent thread.

Parameters For Data Iterator

Generally to create a data iterator, you need to provide five kinds of parameters:

  • Dataset Param gives the basic information for the dataset, e.g. file path, input shape.
  • Batch Param gives the information to form a batch, e.g. batch size.
  • Augmentation Param tells which augmentation operations(e.g. crop, mirror) should be taken on an input image.
  • Backend Param controls the behavior of the backend threads to hide data loading cost.
  • Auxiliary Param provides options to help checking and debugging.

Usually, Dataset Param and Batch Param MUST be given, otherwise data batch can’t be create. Other parameters can be given according to algorithm and performance need. Examples and detail explanation of the options will be provided in the later Section.

Create A Data Iterator

The IO API provides a simple way for you to create data iterator in python. The following code gives an example of creating a Cifar data iterator.

>>>dataiter = mx.io.ImageRecordIter(
>>>        # Utility Parameter
>>>        # Optional
>>>        # Name of the data, should match the name of the data input of the network
>>>        # data_name='data',
>>>        # Utility Parameter
>>>        # Optional
>>>        # Name of the label, should match the name of the label parameter of the network.
>>>        # Usually, if the loss layer is named 'foo', then the label input has the name
>>>        # 'foo_label', unless overwritten
>>>        # label_name='softmax_label',
>>>        # Dataset Parameter
>>>        # Impulsary
>>>        # indicating the data file, please check the data is already there
>>>        path_imgrec="data/cifar/train.rec",
>>>        # Dataset Parameter
>>>        # Impulsary
>>>        # indicating the image size after preprocessing
>>>        data_shape=(3,28,28),
>>>        # Batch Parameter
>>>        # Impulsary
>>>        # tells how many images in a batch
>>>        batch_size=100,
>>>        # Augmentation Parameter
>>>        # Optional
>>>        # when offers mean_img, each image will substract the mean value at each pixel
>>>        mean_img="data/cifar/cifar10_mean.bin",
>>>        # Augmentation Parameter
>>>        # Optional
>>>        # randomly crop a patch of the data_shape from the original image
>>>        rand_crop=True,
>>>        # Augmentation Parameter
>>>        # Optional
>>>        # randomly mirror the image horizontally
>>>        rand_mirror=True,
>>>        # Augmentation Parameter
>>>        # Optional
>>>        # randomly shuffle the data
>>>        shuffle=False,
>>>        # Backend Parameter
>>>        # Optional
>>>        # Preprocessing thread number
>>>        preprocess_threads=4,
>>>        # Backend Parameter
>>>        # Optional
>>>        # Prefetch buffer size
>>>        prefetch_buffer=1)

From the above code, we could find how to create a data iterator. First, you need to explicitly point out what kind of data(MNIST, ImageRecord etc) to be fetched. Then provide the options about the dataset, batching, image augmentation, multi-tread processing and prefetching. Our code will automatically check the validity of the params, if a compulsary param is missing, an error will occur.

How To Get Data

We provide the script to download MNIST data and Cifar10 ImageRecord data. If you would like to create your own dataset, Image RecordIO data format is recommended.

Create Dataset Using RecordIO

RecordIO implements a file format for a sequence of records. We recommend storing images as records and pack them together. The benefits are:

  • Storing images in compacted format, e.g. JPEG, for records can have different size. Compacted format will greatly reduce the dataset size in disk.
  • Packing data together allow continous reading on the disk.
  • RecordIO has a simple way of partition, which makes it easier for distributed setting. Example about this will be provided later.

We provide the im2rec tool to create Image RecordIO dataset by yourself. Here’s the walkthrough:

0.Before you start

Make sure you have downloaded the data. You don’t need to resize the images by yourself, currently im2rec could resize it automatically. You could check the promoting message of im2rec for details.

1.Make the image list

After you get the data, you need to make a image list file first. The format is

integer_image_index \t label_index \t path_to_image

In general, the program will take a list of names of all image, shuffle them, then separate them into training files name list and testing file name list. Write down the list in the format.

A sample file is provided here

895099  464     n04467665_17283.JPEG
10025081        412     ILSVRC2010_val_00025082.JPEG
74181   789     n01915811_2739.JPEG
10035553        859     ILSVRC2010_val_00035554.JPEG
10048727        929     ILSVRC2010_val_00048728.JPEG
94028   924     n01980166_4956.JPEG
1080682 650     n11807979_571.JPEG
972457  633     n07723039_1627.JPEG
7534    11      n01630670_4486.JPEG
1191261 249     n12407079_5106.JPEG

2.Make the binary file

To generate binary image, you need to use im2rec in the tool folder. The im2rec will take the path of image list file you generated just now, root path of the images and the output file path as input. These processes usually take several hours, so be patient. :)

A sample command:

./bin/im2rec image.lst image_root_dir output.bin resize=256

More details can be found by running ./bin/im2rec.

Extension: Mutliple Labels for a Single Image

The im2rec tool and mx.io.ImageRecordIter also has a mutli-label support for a single image. Assume you have 4 labels for a single image, you can take the following steps to utilize the RecordIO tools.

  1. Write the the image list files as follows:
integer_image_index \t label_1 \t label_2 \t label_3 \t label_4 \t path_to_image
  1. When use im2rec tools, add a ‘label_width=4’ to the command argument, e.g.
./bin/im2rec image.lst image_root_dir output.bin resize=256 label_width=4
  1. In your iterator generation code, set label_width=4 and path_imglist=<<The PATH TO YOUR image.lst>>, e.g.
dataiter = mx.io.ImageRecordIter(
  path_imgrec="data/cifar/train.rec",
  data_shape=(3,28,28),
  path_imglist="data/cifar/image.lst",
  label_width=4
)

Then you’re all set for a multi-label image iterator.

IO API Reference

NDArray interface of mxnet

class mxnet.io.DataBatch(data, label, pad=None, index=None, bucket_key=None, provide_data=None, provide_label=None)

Default object for holding a mini-batch of data and related information.

class mxnet.io.DataIter

DataIter object in mxnet.

reset()

Reset the iterator.

next()

Get next data batch from iterator. Equivalent to self.iter_next() DataBatch(self.getdata(), self.getlabel(), self.getpad(), None)

Returns:data – The data of next batch.
Return type:DataBatch
iter_next()

Iterate to next batch.

Returns:has_next – Whether the move is successful.
Return type:boolean
getdata()

Get data of current batch.

Returns:data – The data of current batch.
Return type:NDArray
getlabel()

Get label of current batch.

Returns:label – The label of current batch.
Return type:NDArray
getindex()

Get index of the current batch.

Returns:index – The index of current batch
Return type:numpy.array
getpad()

Get the number of padding examples in current batch.

Returns:pad – Number of padding examples in current batch
Return type:int
class mxnet.io.ResizeIter(data_iter, size, reset_internal=True)

Resize a DataIter to given number of batches per epoch. May produce incomplete batch in the middle of an epoch due to padding from internal iterator.

Parameters:
  • data_iter (DataIter) – Internal data iterator.
  • size (number of batches per epoch to resize to.) –
  • reset_internal (whether to reset internal iterator on ResizeIter.reset) –
class mxnet.io.PrefetchingIter(iters, rename_data=None, rename_label=None)

Base class for prefetching iterators. Takes one or more DataIters ( or any class with “reset” and “read” methods) and combine them with prefetching. For example:

Parameters:
  • iters (DataIter or list of DataIter) – one or more DataIters (or any class with “reset” and “read” methods)
  • rename_data (None or list of dict) – i-th element is a renaming map for i-th iter, in the form of {‘original_name’ : ‘new_name’}. Should have one entry for each entry in iter[i].provide_data
  • rename_label (None or list of dict) – Similar to rename_data

Examples

iter = PrefetchingIter([NDArrayIter({‘data’: X1}), NDArrayIter({‘data’: X2})],
rename_data=[{‘data’: ‘data1’}, {‘data’: ‘data2’}])
class mxnet.io.NDArrayIter(data, label=None, batch_size=1, shuffle=False, last_batch_handle='pad')

NDArrayIter object in mxnet. Taking NDArray or numpy array to get dataiter. :param data: NDArrayIter supports single or multiple data and label. :type data: NDArray or numpy.ndarray, a list of them, or a dict of string to them. :param label: Same as data, but is not fed to the model during testing. :type label: NDArray or numpy.ndarray, a list of them, or a dict of them. :param batch_size: Batch Size :type batch_size: int :param shuffle: Whether to shuffle the data :type shuffle: bool :param last_batch_handle: How to handle the last batch :type last_batch_handle: ‘pad’, ‘discard’ or ‘roll_over’

Note

This iterator will pad, discard or roll over the last batch if the size of data does not match batch_size. Roll over is intended for training and can cause problems if used for prediction.

provide_data

The name and shape of data provided by this iterator

provide_label

The name and shape of label provided by this iterator

hard_reset()

Igore roll over data and set to start

class mxnet.io.MXDataIter(handle, data_name='data', label_name='softmax_label', **_)

DataIter built in MXNet. List all the needed functions here. :param handle: the handle to the underlying C++ Data Iterator

debug_skip_load()

Set the iterator to simply return always first batch. .. rubric:: Notes

This can be used to test the speed of network without taking the loading delay into account.

mxnet.io.CSVIter(*args, **kwargs)

Create iterator for dataset in csv.

Parameters:
  • data_csv (string, required) – Dataset Param: Data csv path.
  • data_shape (Shape(tuple), required) – Dataset Param: Shape of the data.
  • label_csv (string, optional, default='NULL') – Dataset Param: Label csv path. If is NULL, all labels will be returned as 0
  • label_shape (Shape(tuple), optional, default=(1,)) – Dataset Param: Shape of the label.
  • name (string, required.) – Name of the resulting data iterator.
Returns:

iterator – The result iterator.

Return type:

DataIter

mxnet.io.ImageRecordIter(*args, **kwargs)

Create iterator for dataset packed in recordio.

Parameters:
  • path_imglist (string, optional, default='') – Dataset Param: Path to image list.
  • path_imgrec (string, optional, default='./data/imgrec.rec') – Dataset Param: Path to image record file.
  • aug_seq (string, optional, default='aug_default') – Augmentation Param: the augmenter names to represent sequence of augmenters to be applied, seperated by comma. Additional keyword parameters will be seen by these augmenters.
  • label_width (int, optional, default='1') – Dataset Param: How many labels for an image.
  • data_shape (Shape(tuple), required) – Dataset Param: Shape of each instance generated by the DataIter.
  • preprocess_threads (int, optional, default='4') – Backend Param: Number of thread to do preprocessing.
  • verbose (boolean, optional, default=True) – Auxiliary Param: Whether to output parser information.
  • num_parts (int, optional, default='1') – partition the data into multiple parts
  • part_index (int, optional, default='0') – the index of the part will read
  • shuffle (boolean, optional, default=False) – Augmentation Param: Whether to shuffle data.
  • seed (int, optional, default='0') – Augmentation Param: Random Seed.
  • batch_size (int (non-negative), required) – Batch Param: Batch size.
  • round_batch (boolean, optional, default=True) – Batch Param: Use round robin to handle overflow batch.
  • prefetch_buffer (long (non-negative), optional, default=4) – Backend Param: Number of prefetched parameters
  • rand_crop (boolean, optional, default=False) – Augmentation Param: Whether to random crop on the image
  • crop_y_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on y.
  • crop_x_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on x.
  • max_rotate_angle (int, optional, default='0') – Augmentation Param: rotated randomly in [-max_rotate_angle, max_rotate_angle].
  • max_aspect_ratio (float, optional, default=0) – Augmentation Param: denotes the max ratio of random aspect ratio augmentation.
  • max_shear_ratio (float, optional, default=0) – Augmentation Param: denotes the max random shearing ratio.
  • max_crop_size (int, optional, default='-1') – Augmentation Param: Maximum crop size.
  • min_crop_size (int, optional, default='-1') – Augmentation Param: Minimum crop size.
  • max_random_scale (float, optional, default=1) – Augmentation Param: Maxmum scale ratio.
  • min_random_scale (float, optional, default=1) – Augmentation Param: Minimum scale ratio.
  • max_img_size (float, optional, default=1e+10) – Augmentation Param: Maxmum image size after resizing.
  • min_img_size (float, optional, default=0) – Augmentation Param: Minimum image size after resizing.
  • random_h (int, optional, default='0') – Augmentation Param: Maximum value of H channel in HSL color space.
  • random_s (int, optional, default='0') – Augmentation Param: Maximum value of S channel in HSL color space.
  • random_l (int, optional, default='0') – Augmentation Param: Maximum value of L channel in HSL color space.
  • rotate (int, optional, default='-1') – Augmentation Param: Rotate angle.
  • fill_value (int, optional, default='255') – Augmentation Param: Maximum value of illumination variation.
  • inter_method (int, optional, default='1') – Augmentation Param: 0-NN 1-bilinear 2-cubic 3-area 4-lanczos4 9-auto 10-rand.
  • pad (int, optional, default='0') – Augmentation Param: Padding size.
  • mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image.
  • rand_mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image randomly.
  • mean_img (string, optional, default='') – Augmentation Param: Mean Image to be subtracted.
  • mean_r (float, optional, default=0) – Augmentation Param: Mean value on R channel.
  • mean_g (float, optional, default=0) – Augmentation Param: Mean value on G channel.
  • mean_b (float, optional, default=0) – Augmentation Param: Mean value on B channel.
  • mean_a (float, optional, default=0) – Augmentation Param: Mean value on Alpha channel.
  • scale (float, optional, default=1) – Augmentation Param: Scale in color space.
  • max_random_contrast (float, optional, default=0) – Augmentation Param: Maximum ratio of contrast variation.
  • max_random_illumination (float, optional, default=0) – Augmentation Param: Maximum value of illumination variation.
  • name (string, required.) – Name of the resulting data iterator.
Returns:

iterator – The result iterator.

Return type:

DataIter

mxnet.io.MNISTIter(*args, **kwargs)

Create iterator for MNIST hand-written digit number recognition dataset.

Parameters:
  • image (string, optional, default='./train-images-idx3-ubyte') – Dataset Param: Mnist image path.
  • label (string, optional, default='./train-labels-idx1-ubyte') – Dataset Param: Mnist label path.
  • batch_size (int, optional, default='128') – Batch Param: Batch Size.
  • shuffle (boolean, optional, default=True) – Augmentation Param: Whether to shuffle data.
  • flat (boolean, optional, default=False) – Augmentation Param: Whether to flat the data into 1D.
  • seed (int, optional, default='0') – Augmentation Param: Random Seed.
  • silent (boolean, optional, default=False) – Auxiliary Param: Whether to print out data info.
  • num_parts (int, optional, default='1') – partition the data into multiple parts
  • part_index (int, optional, default='0') – the index of the part will read
  • prefetch_buffer (long (non-negative), optional, default=4) – Backend Param: Number of prefetched parameters
  • name (string, required.) – Name of the resulting data iterator.
Returns:

iterator – The result iterator.

Return type:

DataIter