Computation Pipeline¶
This is a developer level document. It conveys some of the design decisions around the use of expressions and their lowering to computational backends. It is intended for developers. It is not necessary to understand this document in order to use Blaze.
Problem¶
Given an expression:
>>> from blaze import symbol, sum
>>> x = symbol('x', '5 * int')
>>> y = symbol('y', '5 * int')
>>> expr = sum(x ** 2 + y)
>>> expr
sum((x ** 2) + y)
And data arranged into a namespace
>>> import numpy as np
>>> xdata = np.array([1, 2, 3, 4, 5])
>>> ydata = np.array([10, 20, 30, 40, 50])
>>> ns = {x: xdata, y: ydata}
Our goal is to produce the result implied by the expression
>>> np.sum(xdata ** 2 + ydata)
205
Using many small functions defined for each backend to do small pieces of this computation
@dispatch(blaze.expr.sum, numpy.ndarray)
def compute_up(expr, data):
return numpy.sum(data)
Simple Solution¶
A simple solution to this problem is to walk from the leaves of the expression
tree, applying compute_up
functions to data resources until we reach the
top. In cases like the above example this suffices. This is called a bottom
up traversal.
Complications¶
Some backends require more sophistication. In principle we may want to do the following:
- Modify/optimize the expression tree for a given backend.
optimize(expr, data) -> expr
- Modify the data resources before we start execution.
pre_compute(expr, data) -> data
- Modify the data resources as they change type throughout the computation
pre_compute(expr, data) -> data
- Clean up the data result after we complete execution.
post_compute(expr, data) -> data
- Process a leaf of the tree in a bottom up fashion as described above.
compute_up(expr, data) -> data
- Process large chunks of the tree at once, rather than always start from the
bottom.
compute_down(expr, data) -> data
Each of these steps is critical to one backend or another. We describe each in turn and then give the complete picture of the entire pipeline.
optimize :: expr, data -> expr
¶
Optimize takes an expression and some data and changes the expression based on the data type.
For example in columnar stores (like bcolz.ctable
) we insert projections in
the expression to reduce the memory footprint. In numpy-based array backends
we insert Broadcast
operations to perform loop fusion.
This function is applied throughout the tree at the top-most point at which it is applicable. It is not applied at leaves which have little to optimize.
pre_compute :: expr, data -> data
¶
Pre-compute is applied to leaf data elements prior to computation
(xdata
and ydata
in the example above). It might be used for example,
to load data into memory.
We apply pre_compute
at two stages of the pipeline
- At the beginning of the computation
- Any time that the data significantly changes type
So for example for the dataset:
data = {'my_foo': Foo(...)}
If we apply the computation:
X -> X.my_foo.distinct()
Then after the X -> X.my_foo
computation as the type changes from dict
to Foo
we will call pre_compute
again on the Foo
object with the
remaining expression:
data = pre_compute(X.my_foo.distinct(), Foo(...))
A real use case is the streaming Python backend which consumes either sequences
of tuples or sequences of dicts. precompute(expr, Sequence)
detects which
case we are in and normalizes to sequences of tuples. This pre-computation
allows the rest of the Python backend to make useful assumptions.
Another use case is computation on CSV files. If the CSV file is small we’d
like to transform it into a pandas DataFrame. If it is large we’d like to
transform it into a Python iterator. This logic can be encoded as a
pre_compute
function and so will be triggered whenever a CSV
object is
first found.
post_compute :: expr, data -> data
¶
Post-compute finishes a computation. It is handed the data after all computation has been done.
For example, in the case of SQLAlchemy queries the post_compute
function
actually sends the query to the SQL engine and collects results. This occurs
only after Blaze finishes translating everything.
compute_up :: expr, data -> data
¶
Compute up walks the expression tree bottom up and processes data step by step.
Compute up is the most prolific function in the computation pipeline and encodes most of the logic. A brief example
@dispatch(blaze.expr.Add, np.ndarray, np.ndarray)
def compute_up(expr, lhs, rhs):
return lhs + rhs
compute_down :: expr, data -> data
¶
In some cases we want to process large chunks of the expression tree at once. Compute-down operates on the tree top-down, being given the root node / full expression first, and proceeding down the tree while it can not find a match.
Compute-down is less common than compute-up. It is most often used when one backend wants to ship an entire expression over to another. This is done, for example, in the SparkSQL backend in which we take the entire expression and execute it against a SQL backend, and then finally apply that computation onto the SchemaRDD.
It is also used extensively in backends that leverage chunking. These backends want to process a large part of the expression tree at once.
Full Pipeline¶
The full pipeline looks like the following
Pre-compute
all leaves of dataOptimize
the expression- Try calling
compute_down
on the entire expression tree - Otherwise, traverse up the tree from the leaves, calling
compute_up
. Repeat this until the data significantly changes type (e.g.list
toint
after asum
operation) - Reevaluate
optimize
on the expression andpre_compute
on all of the data elements. - Go to step 3
- Call
post_compute
on the result
This is outlined in blaze/compute/core.py
in the functions compute(Expr,
dict)
and top_then_bottom_then_top_again_etc
.
History¶
This design is ad-hoc. Each of the stages listed above arose from need, not
from principled fore-thought. Undoubtedly this system could be improved. In
particular much of the complexity comes from the fact that compute_up/down
functions may transform our data arbitrarily. This, along with various
particular needs from all of the different data types, forces the
flip-flopping between top-down and bottom-up traversals. Please note that
while this strategy works well most of the time pathalogical cases do exist.