Split-Apply-Combine -- Grouping =============================== Grouping operations break a table into pieces and perform some reduction on each piece. Consider the ``iris`` dataset: .. code-block:: python >>> from blaze import data, by >>> from blaze.utils import example >>> d = data('sqlite:///%s::iris' % example('iris.db')) >>> d # doctest: +SKIP sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa We find the average petal length, grouped by species: .. code-block:: python >>> by(d.species, avg=d.petal_length.mean()) species avg 0 Iris-setosa 1.462 1 Iris-versicolor 4.260 2 Iris-virginica 5.552 Split-apply-combine operations are a concise but powerful way to describe many useful transformations. They are well supported in all backends and are generally efficient. Arguments --------- The ``by`` function takes one positional argument, the expression on which we group the table, in this case ``d.species``, and any number of keyword arguments which define reductions to perform on each group. These must be named and they must be reductions. .. code-block:: python >>> by(grouper, name=reduction, name=reduction, ...) # doctest: +SKIP .. code-block:: python >>> by(d.species, minimum=d.petal_length.min(), ... maximum=d.petal_length.max(), ... ratio=d.petal_length.max() - d.petal_length.min()) species maximum minimum ratio 0 Iris-setosa 1.9 1.0 0.9 1 Iris-versicolor 5.1 3.0 2.1 2 Iris-virginica 6.9 4.5 2.4 Limitations ----------- This interface is restrictive in two ways when compared to in-memory dataframes like ``pandas`` or ``dplyr``. 1. You must specify both the grouper and the reduction at the same time 2. The "apply" step must be a reduction These restrictions make it *much* easier to translate your intent to databases and to efficiently distribute and parallelize your computation. Things that you can't do ------------------------ So, as an example, you can't "just group" a table separately from a reduction .. code-block:: python >>> groups = by(mytable.mycolumn) # Can't do this # doctest: +SKIP You also can't do non-reducing apply operations (although this could change for some backends with work) .. code-block:: python >>> groups = by(d.A, result=d.B / d.B.max()) # Can't do this # doctest: +SKIP