sklearn.metrics.pairwise_distances_chunked

sklearn.metrics.pairwise_distances_chunked(X, Y=None, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds)[source]

Generate a distance matrix chunk by chunk with optional reduction

In cases where not all of a pairwise distance matrix needs to be stored at once, this is used to calculate pairwise distances in working_memory-sized chunks. If reduce_func is given, it is run on each chunk and its return values are concatenated into lists, arrays or sparse matrices.

Parameters:
X : array [n_samples_a, n_samples_a] if metric == “precomputed”, or,

[n_samples_a, n_features] otherwise Array of pairwise distances between samples, or a feature array.

Y : array [n_samples_b, n_features], optional

An optional second feature array. Only allowed if metric != “precomputed”.

reduce_func : callable, optional

The function which is applied on each chunk of the distance matrix, reducing it to needed values. reduce_func(D_chunk, start) is called repeatedly, where D_chunk is a contiguous vertical slice of the pairwise distance matrix, starting at row start. It should return an array, a list, or a sparse matrix of length D_chunk.shape[0], or a tuple of such objects.

If None, pairwise_distances_chunked returns a generator of vertical chunks of the distance matrix.

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

n_jobs : int or None, optional (default=None)

The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

working_memory : int, optional

The sought maximum memory for temporary distance matrix chunks. When None (default), the value of sklearn.get_config()['working_memory'] is used.

`**kwds` : optional keyword parameters

Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Yields:
D_chunk : array or sparse matrix

A contiguous slice of distance matrix, optionally processed by reduce_func.

Examples

Without reduce_func:

>>> X = np.random.RandomState(0).rand(5, 3)
>>> D_chunk = next(pairwise_distances_chunked(X))
>>> D_chunk  # doctest: +ELLIPSIS
array([[0.  ..., 0.29..., 0.41..., 0.19..., 0.57...],
       [0.29..., 0.  ..., 0.57..., 0.41..., 0.76...],
       [0.41..., 0.57..., 0.  ..., 0.44..., 0.90...],
       [0.19..., 0.41..., 0.44..., 0.  ..., 0.51...],
       [0.57..., 0.76..., 0.90..., 0.51..., 0.  ...]])

Retrieve all neighbors and average distance within radius r:

>>> r = .2
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r) for d in D_chunk]
...     avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1)
...     return neigh, avg_dist
>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func)
>>> neigh, avg_dist = next(gen)
>>> neigh
[array([0, 3]), array([1]), array([2]), array([0, 3]), array([4])]
>>> avg_dist  # doctest: +ELLIPSIS
array([0.039..., 0.        , 0.        , 0.039..., 0.        ])

Where r is defined per sample, we need to make use of start:

>>> r = [.2, .4, .4, .3, .1]
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r[i])
...              for i, d in enumerate(D_chunk, start)]
...     return neigh
>>> neigh = next(pairwise_distances_chunked(X, reduce_func=reduce_func))
>>> neigh
[array([0, 3]), array([0, 1]), array([2]), array([0, 3]), array([4])]

Force row-by-row generation by reducing working_memory:

>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func,
...                                  working_memory=0)
>>> next(gen)
[array([0, 3])]
>>> next(gen)
[array([0, 1])]