sklearn.datasets.fetch_openml

sklearn.datasets.fetch_openml(name=None, version='active', data_id=None, data_home=None, target_column='default-target', cache=True, return_X_y=False)[source]

Fetch dataset from openml by name or dataset id.

Datasets are uniquely identified by either an integer ID or by a combination of name and version (i.e. there might be multiple versions of the ‘iris’ dataset). Please give either name or data_id (not both). In case a name is given, a version can also be provided.

Read more in the User Guide.

Note

EXPERIMENTAL

The API is experimental in version 0.20 (particularly the return value structure), and might have small backward-incompatible changes in future releases.

Parameters:
name : str or None

String identifier of the dataset. Note that OpenML can have multiple datasets with the same name.

version : integer or ‘active’, default=’active’

Version of the dataset. Can only be provided if also name is given. If ‘active’ the oldest version that’s still active is used. Since there may be more than one active version of a dataset, and those versions may fundamentally be different from one another, setting an exact version is highly recommended.

data_id : int or None

OpenML ID of the dataset. The most specific way of retrieving a dataset. If data_id is not given, name (and potential version) are used to obtain a dataset.

data_home : string or None, default None

Specify another download and cache folder for the data sets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

target_column : string, list or None, default ‘default-target’

Specify the column name in the data to use as target. If ‘default-target’, the standard target column a stored on the server is used. If None, all columns are returned as data and the target is None. If list (of strings), all columns with these names are returned as multi-target (Note: not all scikit-learn classifiers can handle all types of multi-output combinations)

cache : boolean, default=True

Whether to cache downloaded datasets using joblib.

return_X_y : boolean, default=False.

If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target objects.

Returns:
data : Bunch

Dictionary-like object, with attributes:

data : np.array or scipy.sparse.csr_matrix of floats

The feature matrix. Categorical features are encoded as ordinals.

target : np.array

The regression target or classification labels, if applicable. Dtype is float if numeric, and object if categorical.

DESCR : str

The full description of the dataset

feature_names : list

The names of the dataset columns

categories : dict

Maps each categorical feature name to a list of values, such that the value encoded as i is ith in the list.

details : dict

More metadata from OpenML

(data, target) : tuple if return_X_y is True

Note

EXPERIMENTAL

This interface is experimental as at version 0.20 and subsequent releases may change attributes without notice (although there should only be minor changes to data and target).

Missing values in the ‘data’ are represented as NaN’s. Missing values in ‘target’ are represented as NaN’s (numerical target) or None (categorical target)