Computation Pipeline¶

This is a developer level document. It conveys some of the design decisions around the use of expressions and their lowering to computational backends. It is intended for developers. It is not necessary to understand this document in order to use Blaze.

Problem¶

Given an expression:

>>> from blaze import symbol, sum
>>> x = symbol('x', '5 * int')
>>> y = symbol('y', '5 * int')
>>> expr = sum(x ** 2 + y)
>>> expr
sum((x ** 2) + y)

And data arranged into a namespace

>>> import numpy as np
>>> xdata = np.array([1, 2, 3, 4, 5])
>>> ydata = np.array([10, 20, 30, 40, 50])
>>> ns = {x: xdata, y: ydata}

Our goal is to produce the result implied by the expression

>>> np.sum(xdata ** 2 + ydata)
205

Using many small functions defined for each backend to do small pieces of this computation

@dispatch(blaze.expr.sum, numpy.ndarray)
def compute_up(expr, data):
    return numpy.sum(data)

Simple Solution¶

A simple solution to this problem is to walk from the leaves of the expression tree, applying compute_up functions to data resources until we reach the top. In cases like the above example this suffices. This is called a bottom up traversal.

Complications¶

Some backends require more sophistication. In principle we may want to do the following:

Modify/optimize the expression tree for a given backend. optimize(expr, data) -> expr
Modify the data resources before we start execution. pre_compute(expr, data) -> data
Modify the data resources as they change type throughout the computation pre_compute(expr, data) -> data
Clean up the data result after we complete execution. post_compute(expr, data) -> data
Process a leaf of the tree in a bottom up fashion as described above. compute_up(expr, data) -> data
Process large chunks of the tree at once, rather than always start from the bottom. compute_down(expr, data) -> data

Each of these steps is critical to one backend or another. We describe each in turn and then give the complete picture of the entire pipeline.

`optimize :: expr, data -> expr`¶

Optimize takes an expression and some data and changes the expression based on the data type.

For example in columnar stores (like bcolz.ctable) we insert projections in the expression to reduce the memory footprint. In numpy-based array backends we insert Broadcast operations to perform loop fusion.

This function is applied throughout the tree at the top-most point at which it is applicable. It is not applied at leaves which have little to optimize.

`pre_compute :: expr, data -> data`¶

Pre-compute is applied to leaf data elements prior to computation (xdata and ydata in the example above). It might be used for example, to load data into memory.

We apply pre_compute at two stages of the pipeline

At the beginning of the computation
Any time that the data significantly changes type

So for example for the dataset:

data = {'my_foo':  Foo(...)}

If we apply the computation:

X -> X.my_foo.distinct()

Then after the X -> X.my_foo computation as the type changes from dict to Foo we will call pre_compute again on the Foo object with the remaining expression:

data = pre_compute(X.my_foo.distinct(), Foo(...))

A real use case is the streaming Python backend which consumes either sequences of tuples or sequences of dicts. precompute(expr, Sequence) detects which case we are in and normalizes to sequences of tuples. This pre-computation allows the rest of the Python backend to make useful assumptions.

Another use case is computation on CSV files. If the CSV file is small we’d like to transform it into a pandas DataFrame. If it is large we’d like to transform it into a Python iterator. This logic can be encoded as a pre_compute function and so will be triggered whenever a CSV object is first found.

`post_compute :: expr, data -> data`¶

Post-compute finishes a computation. It is handed the data after all computation has been done.

For example, in the case of SQLAlchemy queries the post_compute function actually sends the query to the SQL engine and collects results. This occurs only after Blaze finishes translating everything.

`compute_up :: expr, data -> data`¶

Compute up walks the expression tree bottom up and processes data step by step.

Compute up is the most prolific function in the computation pipeline and encodes most of the logic. A brief example

@dispatch(blaze.expr.Add, np.ndarray, np.ndarray)
def compute_up(expr, lhs, rhs):
    return lhs + rhs

`compute_down :: expr, data -> data`¶

In some cases we want to process large chunks of the expression tree at once. Compute-down operates on the tree top-down, being given the root node / full expression first, and proceeding down the tree while it can not find a match.

Compute-down is less common than compute-up. It is most often used when one backend wants to ship an entire expression over to another. This is done, for example, in the SparkSQL backend in which we take the entire expression and execute it against a SQL backend, and then finally apply that computation onto the SchemaRDD.

It is also used extensively in backends that leverage chunking. These backends want to process a large part of the expression tree at once.

Full Pipeline¶

The full pipeline looks like the following

Pre-compute all leaves of data
Optimize the expression
Try calling compute_down on the entire expression tree
Otherwise, traverse up the tree from the leaves, calling compute_up. Repeat this until the data significantly changes type (e.g. list to int after a sum operation)
Reevaluate optimize on the expression and pre_compute on all of the data elements.
Go to step 3
Call post_compute on the result

This is outlined in blaze/compute/core.py in the functions compute(Expr, dict) and top_then_bottom_then_top_again_etc.

History¶

This design is ad-hoc. Each of the stages listed above arose from need, not from principled fore-thought. Undoubtedly this system could be improved. In particular much of the complexity comes from the fact that compute_up/down functions may transform our data arbitrarily. This, along with various particular needs from all of the different data types, forces the flip-flopping between top-down and bottom-up traversals. Please note that while this strategy works well most of the time pathalogical cases do exist.

Computation Pipeline¶

Problem¶

Simple Solution¶

Complications¶

optimize :: expr, data -> expr¶

pre_compute :: expr, data -> data¶

post_compute :: expr, data -> data¶

compute_up :: expr, data -> data¶

compute_down :: expr, data -> data¶