MXNet Python Data Loading API¶
- Introduction introduces the main feature of data loader in MXNet.
- Parameters For Data Iterator clarifies the different usages for dataiter parameters.
- Create A Data Iterator introduces how to create a data iterator in MXNet python.
- How To Get Data introduces the data resource and data preparation tools.
- IO API Reference reference for the IO API and their explanation.
Introduction¶
This page will introduce data input method in MXNet. MXNet use iterator to provide data to the neural network. Iterators do some preprocessing and generate batch for the neural network.
- We provide basic iterators for MNIST image and RecordIO image.
- To hide the IO cost, prefetch strategy is used to allow parallelism of learning process and data fetching. Data will automatically fetched by an independent thread.
Parameters For Data Iterator¶
Generally to create a data iterator, you need to provide five kinds of parameters:
- Dataset Param gives the basic information for the dataset, e.g. file path, input shape.
- Batch Param gives the information to form a batch, e.g. batch size.
- Augmentation Param tells which augmentation operations(e.g. crop, mirror) should be taken on an input image.
- Backend Param controls the behavior of the backend threads to hide data loading cost.
- Auxiliary Param provides options to help checking and debugging.
Usually, Dataset Param and Batch Param MUST be given, otherwise data batch can’t be create. Other parameters can be given according to algorithm and performance need. Examples and detail explanation of the options will be provided in the later Section.
Create A Data Iterator¶
The IO API provides a simple way for you to create data iterator in python. The following code gives an example of creating a Cifar data iterator.
>>>dataiter = mx.io.ImageRecordIter(
>>> # Utility Parameter
>>> # Optional
>>> # Name of the data, should match the name of the data input of the network
>>> # data_name='data',
>>> # Utility Parameter
>>> # Optional
>>> # Name of the label, should match the name of the label parameter of the network.
>>> # Usually, if the loss layer is named 'foo', then the label input has the name
>>> # 'foo_label', unless overwritten
>>> # label_name='softmax_label',
>>> # Dataset Parameter
>>> # Impulsary
>>> # indicating the data file, please check the data is already there
>>> path_imgrec="data/cifar/train.rec",
>>> # Dataset Parameter
>>> # Impulsary
>>> # indicating the image size after preprocessing
>>> data_shape=(3,28,28),
>>> # Batch Parameter
>>> # Impulsary
>>> # tells how many images in a batch
>>> batch_size=100,
>>> # Augmentation Parameter
>>> # Optional
>>> # when offers mean_img, each image will substract the mean value at each pixel
>>> mean_img="data/cifar/cifar10_mean.bin",
>>> # Augmentation Parameter
>>> # Optional
>>> # randomly crop a patch of the data_shape from the original image
>>> rand_crop=True,
>>> # Augmentation Parameter
>>> # Optional
>>> # randomly mirror the image horizontally
>>> rand_mirror=True,
>>> # Augmentation Parameter
>>> # Optional
>>> # randomly shuffle the data
>>> shuffle=False,
>>> # Backend Parameter
>>> # Optional
>>> # Preprocessing thread number
>>> preprocess_threads=4,
>>> # Backend Parameter
>>> # Optional
>>> # Prefetch buffer size
>>> prefetch_buffer=1)
From the above code, we could find how to create a data iterator. First, you need to explicitly point out what kind of data(MNIST, ImageRecord etc) to be fetched. Then provide the options about the dataset, batching, image augmentation, multi-tread processing and prefetching. Our code will automatically check the validity of the params, if a compulsary param is missing, an error will occur.
How To Get Data¶
We provide the script to download MNIST data and Cifar10 ImageRecord data. If you would like to create your own dataset, Image RecordIO data format is recommended.
Create Dataset Using RecordIO¶
RecordIO implements a file format for a sequence of records. We recommend storing images as records and pack them together. The benefits are:
- Storing images in compacted format, e.g. JPEG, for records can have different size. Compacted format will greatly reduce the dataset size in disk.
- Packing data together allow continous reading on the disk.
- RecordIO has a simple way of partition, which makes it easier for distributed setting. Example about this will be provided later.
We provide the im2rec tool to create Image RecordIO dataset by yourself. Here’s the walkthrough:
0.Before you start¶
Make sure you have downloaded the data. You don’t need to resize the images by yourself, currently im2rec
could resize it automatically. You could check the promoting message of im2rec
for details.
1.Make the image list¶
After you get the data, you need to make a image list file first. The format is
integer_image_index \t label_index \t path_to_image
In general, the program will take a list of names of all image, shuffle them, then separate them into training files name list and testing file name list. Write down the list in the format.
A sample file is provided here
895099 464 n04467665_17283.JPEG
10025081 412 ILSVRC2010_val_00025082.JPEG
74181 789 n01915811_2739.JPEG
10035553 859 ILSVRC2010_val_00035554.JPEG
10048727 929 ILSVRC2010_val_00048728.JPEG
94028 924 n01980166_4956.JPEG
1080682 650 n11807979_571.JPEG
972457 633 n07723039_1627.JPEG
7534 11 n01630670_4486.JPEG
1191261 249 n12407079_5106.JPEG
2.Make the binary file¶
To generate binary image, you need to use im2rec in the tool folder. The im2rec will take the path of image list file you generated just now, root path of the images and the output file path as input. These processes usually take several hours, so be patient. :)
A sample command:
./bin/im2rec image.lst image_root_dir output.bin resize=256
More details can be found by running ./bin/im2rec
.
Extension: Mutliple Labels for a Single Image¶
The im2rec
tool and mx.io.ImageRecordIter
also has a mutli-label support for a single image.
Assume you have 4 labels for a single image, you can take the following steps to utilize the RecordIO tools.
- Write the the image list files as follows:
integer_image_index \t label_1 \t label_2 \t label_3 \t label_4 \t path_to_image
- When use
im2rec
tools, add a ‘label_width=4’ to the command argument, e.g.
./bin/im2rec image.lst image_root_dir output.bin resize=256 label_width=4
- In your iterator generation code, set
label_width=4
andpath_imglist=<<The PATH TO YOUR image.lst>>
, e.g.
dataiter = mx.io.ImageRecordIter(
path_imgrec="data/cifar/train.rec",
data_shape=(3,28,28),
path_imglist="data/cifar/image.lst",
label_width=4
)
Then you’re all set for a multi-label image iterator.
IO API Reference¶
NDArray interface of mxnet
-
class
mxnet.io.
DataBatch
(data, label, pad=None, index=None, bucket_key=None, provide_data=None, provide_label=None)¶ Default object for holding a mini-batch of data and related information.
-
class
mxnet.io.
DataIter
¶ DataIter object in mxnet.
-
reset
()¶ Reset the iterator.
-
next
()¶ Get next data batch from iterator. Equivalent to self.iter_next() DataBatch(self.getdata(), self.getlabel(), self.getpad(), None)
Returns: data – The data of next batch. Return type: DataBatch
-
iter_next
()¶ Iterate to next batch.
Returns: has_next – Whether the move is successful. Return type: boolean
-
getlabel
()¶ Get label of current batch.
Returns: label – The label of current batch. Return type: NDArray
-
getindex
()¶ Get index of the current batch.
Returns: index – The index of current batch Return type: numpy.array
-
getpad
()¶ Get the number of padding examples in current batch.
Returns: pad – Number of padding examples in current batch Return type: int
-
-
class
mxnet.io.
ResizeIter
(data_iter, size, reset_internal=True)¶ Resize a DataIter to given number of batches per epoch. May produce incomplete batch in the middle of an epoch due to padding from internal iterator.
Parameters: - data_iter (DataIter) – Internal data iterator.
- size (number of batches per epoch to resize to.) –
- reset_internal (whether to reset internal iterator on ResizeIter.reset) –
-
class
mxnet.io.
PrefetchingIter
(iters, rename_data=None, rename_label=None)¶ Base class for prefetching iterators. Takes one or more DataIters ( or any class with “reset” and “read” methods) and combine them with prefetching. For example:
Parameters: - iters (DataIter or list of DataIter) – one or more DataIters (or any class with “reset” and “read” methods)
- rename_data (None or list of dict) – i-th element is a renaming map for i-th iter, in the form of {‘original_name’ : ‘new_name’}. Should have one entry for each entry in iter[i].provide_data
- rename_label (None or list of dict) – Similar to rename_data
Examples
- iter = PrefetchingIter([NDArrayIter({‘data’: X1}), NDArrayIter({‘data’: X2})],
- rename_data=[{‘data’: ‘data1’}, {‘data’: ‘data2’}])
-
class
mxnet.io.
NDArrayIter
(data, label=None, batch_size=1, shuffle=False, last_batch_handle='pad')¶ NDArrayIter object in mxnet. Taking NDArray or numpy array to get dataiter. :param data: NDArrayIter supports single or multiple data and label. :type data: NDArray or numpy.ndarray, a list of them, or a dict of string to them. :param label: Same as data, but is not fed to the model during testing. :type label: NDArray or numpy.ndarray, a list of them, or a dict of them. :param batch_size: Batch Size :type batch_size: int :param shuffle: Whether to shuffle the data :type shuffle: bool :param last_batch_handle: How to handle the last batch :type last_batch_handle: ‘pad’, ‘discard’ or ‘roll_over’
Note
This iterator will pad, discard or roll over the last batch if the size of data does not match batch_size. Roll over is intended for training and can cause problems if used for prediction.
-
provide_data
¶ The name and shape of data provided by this iterator
-
provide_label
¶ The name and shape of label provided by this iterator
-
hard_reset
()¶ Igore roll over data and set to start
-
-
class
mxnet.io.
MXDataIter
(handle, data_name='data', label_name='softmax_label', **_)¶ DataIter built in MXNet. List all the needed functions here. :param handle: the handle to the underlying C++ Data Iterator
-
debug_skip_load
()¶ Set the iterator to simply return always first batch. .. rubric:: Notes
This can be used to test the speed of network without taking the loading delay into account.
-
-
mxnet.io.
CSVIter
(*args, **kwargs)¶ Create iterator for dataset in csv.
Parameters: - data_csv (string, required) – Dataset Param: Data csv path.
- data_shape (Shape(tuple), required) – Dataset Param: Shape of the data.
- label_csv (string, optional, default='NULL') – Dataset Param: Label csv path. If is NULL, all labels will be returned as 0
- label_shape (Shape(tuple), optional, default=(1,)) – Dataset Param: Shape of the label.
- name (string, required.) – Name of the resulting data iterator.
Returns: iterator – The result iterator.
Return type:
-
mxnet.io.
ImageRecordIter
(*args, **kwargs)¶ Create iterator for dataset packed in recordio.
Parameters: - path_imglist (string, optional, default='') – Dataset Param: Path to image list.
- path_imgrec (string, optional, default='./data/imgrec.rec') – Dataset Param: Path to image record file.
- aug_seq (string, optional, default='aug_default') – Augmentation Param: the augmenter names to represent sequence of augmenters to be applied, seperated by comma. Additional keyword parameters will be seen by these augmenters.
- label_width (int, optional, default='1') – Dataset Param: How many labels for an image.
- data_shape (Shape(tuple), required) – Dataset Param: Shape of each instance generated by the DataIter.
- preprocess_threads (int, optional, default='4') – Backend Param: Number of thread to do preprocessing.
- verbose (boolean, optional, default=True) – Auxiliary Param: Whether to output parser information.
- num_parts (int, optional, default='1') – partition the data into multiple parts
- part_index (int, optional, default='0') – the index of the part will read
- shuffle (boolean, optional, default=False) – Augmentation Param: Whether to shuffle data.
- seed (int, optional, default='0') – Augmentation Param: Random Seed.
- batch_size (int (non-negative), required) – Batch Param: Batch size.
- round_batch (boolean, optional, default=True) – Batch Param: Use round robin to handle overflow batch.
- prefetch_buffer (long (non-negative), optional, default=4) – Backend Param: Number of prefetched parameters
- rand_crop (boolean, optional, default=False) – Augmentation Param: Whether to random crop on the image
- crop_y_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on y.
- crop_x_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on x.
- max_rotate_angle (int, optional, default='0') – Augmentation Param: rotated randomly in [-max_rotate_angle, max_rotate_angle].
- max_aspect_ratio (float, optional, default=0) – Augmentation Param: denotes the max ratio of random aspect ratio augmentation.
- max_shear_ratio (float, optional, default=0) – Augmentation Param: denotes the max random shearing ratio.
- max_crop_size (int, optional, default='-1') – Augmentation Param: Maximum crop size.
- min_crop_size (int, optional, default='-1') – Augmentation Param: Minimum crop size.
- max_random_scale (float, optional, default=1) – Augmentation Param: Maxmum scale ratio.
- min_random_scale (float, optional, default=1) – Augmentation Param: Minimum scale ratio.
- max_img_size (float, optional, default=1e+10) – Augmentation Param: Maxmum image size after resizing.
- min_img_size (float, optional, default=0) – Augmentation Param: Minimum image size after resizing.
- random_h (int, optional, default='0') – Augmentation Param: Maximum value of H channel in HSL color space.
- random_s (int, optional, default='0') – Augmentation Param: Maximum value of S channel in HSL color space.
- random_l (int, optional, default='0') – Augmentation Param: Maximum value of L channel in HSL color space.
- rotate (int, optional, default='-1') – Augmentation Param: Rotate angle.
- fill_value (int, optional, default='255') – Augmentation Param: Maximum value of illumination variation.
- inter_method (int, optional, default='1') – Augmentation Param: 0-NN 1-bilinear 2-cubic 3-area 4-lanczos4 9-auto 10-rand.
- pad (int, optional, default='0') – Augmentation Param: Padding size.
- mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image.
- rand_mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image randomly.
- mean_img (string, optional, default='') – Augmentation Param: Mean Image to be subtracted.
- mean_r (float, optional, default=0) – Augmentation Param: Mean value on R channel.
- mean_g (float, optional, default=0) – Augmentation Param: Mean value on G channel.
- mean_b (float, optional, default=0) – Augmentation Param: Mean value on B channel.
- mean_a (float, optional, default=0) – Augmentation Param: Mean value on Alpha channel.
- scale (float, optional, default=1) – Augmentation Param: Scale in color space.
- max_random_contrast (float, optional, default=0) – Augmentation Param: Maximum ratio of contrast variation.
- max_random_illumination (float, optional, default=0) – Augmentation Param: Maximum value of illumination variation.
- name (string, required.) – Name of the resulting data iterator.
Returns: iterator – The result iterator.
Return type:
-
mxnet.io.
MNISTIter
(*args, **kwargs)¶ Create iterator for MNIST hand-written digit number recognition dataset.
Parameters: - image (string, optional, default='./train-images-idx3-ubyte') – Dataset Param: Mnist image path.
- label (string, optional, default='./train-labels-idx1-ubyte') – Dataset Param: Mnist label path.
- batch_size (int, optional, default='128') – Batch Param: Batch Size.
- shuffle (boolean, optional, default=True) – Augmentation Param: Whether to shuffle data.
- flat (boolean, optional, default=False) – Augmentation Param: Whether to flat the data into 1D.
- seed (int, optional, default='0') – Augmentation Param: Random Seed.
- silent (boolean, optional, default=False) – Auxiliary Param: Whether to print out data info.
- num_parts (int, optional, default='1') – partition the data into multiple parts
- part_index (int, optional, default='0') – the index of the part will read
- prefetch_buffer (long (non-negative), optional, default=4) – Backend Param: Number of prefetched parameters
- name (string, required.) – Name of the resulting data iterator.
Returns: iterator – The result iterator.
Return type: