Split-Apply-Combine – Grouping

Grouping operations break a table into pieces and perform some reduction on each piece. Consider the iris dataset:

>>> from blaze import data, by
>>> from blaze.utils import example
>>> d = data('sqlite:///%s::iris' % example('iris.db'))
>>> d  
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa

We find the average petal length, grouped by species:

>>> by(d.species, avg=d.petal_length.mean())
           species    avg
0      Iris-setosa  1.462
1  Iris-versicolor  4.260
2   Iris-virginica  5.552

Split-apply-combine operations are a concise but powerful way to describe many useful transformations. They are well supported in all backends and are generally efficient.

Arguments

The by function takes one positional argument, the expression on which we group the table, in this case d.species, and any number of keyword arguments which define reductions to perform on each group. These must be named and they must be reductions.

>>> by(grouper, name=reduction, name=reduction, ...)  
>>> by(d.species, minimum=d.petal_length.min(),
...               maximum=d.petal_length.max(),
...               ratio=d.petal_length.max() - d.petal_length.min())
           species  maximum  minimum  ratio
0      Iris-setosa      1.9      1.0    0.9
1  Iris-versicolor      5.1      3.0    2.1
2   Iris-virginica      6.9      4.5    2.4

Limitations

This interface is restrictive in two ways when compared to in-memory dataframes like pandas or dplyr.

  1. You must specify both the grouper and the reduction at the same time
  2. The “apply” step must be a reduction

These restrictions make it much easier to translate your intent to databases and to efficiently distribute and parallelize your computation.

Things that you can’t do

So, as an example, you can’t “just group” a table separately from a reduction

>>> groups = by(mytable.mycolumn)  # Can't do this  

You also can’t do non-reducing apply operations (although this could change for some backends with work)

>>> groups = by(d.A, result=d.B / d.B.max())  # Can't do this