Split-Apply-Combine – Grouping¶
Grouping operations break a table into pieces and perform some reduction on
each piece. Consider the iris
dataset:
>>> from blaze import data, by
>>> from blaze.utils import example
>>> d = data('sqlite:///%s::iris' % example('iris.db'))
>>> d
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
We find the average petal length, grouped by species:
>>> by(d.species, avg=d.petal_length.mean())
species avg
0 Iris-setosa 1.462
1 Iris-versicolor 4.260
2 Iris-virginica 5.552
Split-apply-combine operations are a concise but powerful way to describe many useful transformations. They are well supported in all backends and are generally efficient.
Arguments¶
The by
function takes one positional argument, the expression on which we
group the table, in this case d.species
, and any number of keyword
arguments which define reductions to perform on each group. These must be
named and they must be reductions.
>>> by(grouper, name=reduction, name=reduction, ...)
>>> by(d.species, minimum=d.petal_length.min(),
... maximum=d.petal_length.max(),
... ratio=d.petal_length.max() - d.petal_length.min())
species maximum minimum ratio
0 Iris-setosa 1.9 1.0 0.9
1 Iris-versicolor 5.1 3.0 2.1
2 Iris-virginica 6.9 4.5 2.4
Limitations¶
This interface is restrictive in two ways when compared to in-memory dataframes
like pandas
or dplyr
.
- You must specify both the grouper and the reduction at the same time
- The “apply” step must be a reduction
These restrictions make it much easier to translate your intent to databases and to efficiently distribute and parallelize your computation.
Things that you can’t do¶
So, as an example, you can’t “just group” a table separately from a reduction
>>> groups = by(mytable.mycolumn) # Can't do this
You also can’t do non-reducing apply operations (although this could change for some backends with work)
>>> groups = by(d.A, result=d.B / d.B.max()) # Can't do this