Expressions and Computation

This is a developer level document. It conveys some of the design decisions around the use of expressions and their lowering to computational backends. It is intended for new developers. It is not necessary to understand this document in order to use Blaze.

Expressions

Blaze represents expressions as Python objects. Classes include

  • Symbol: leaf expression, t
  • Projection: subset of columns, t[['name', 'amount']]
  • Selection: subset of rows t[t.amount < 0]
  • Field: single column of data or field of record dataset t.name
  • Broadcast: a scalar expression broadcast to a collection, t.amount + 1
  • Join: join two expressions on shared fields , join(t, s, 'id')
  • Reduction: perform a sum or min or max on a collection, t.amount.sum()
  • By: split-apply-combine operation, by(t.name, total=t.amount.sum())
  • Also: Sort, Distinct, Head, Label, Map, Merge, ...

In each case an operation (like Selection) is a Python class. Each expression defines a fixed set of fields in the _arguments attribute

class Selection(Expr):
    _arguments = '_child', 'predicate'

class Field(ElemWise):
    _arguments = '_child', 'fieldname'

To create a node in the tree explicitly we create a Python object of this class

>>> from blaze.expr import *
>>> t = symbol('t', 'var * {id: int, name: string, amount: int}')
>>> amounts = Field(t, 'amount')

This object contains its information in a .args attribute

>>> amounts._args == (t, 'amount')
True

And the set of input expressions in a ._inputs attribute

>>> amounts._inputs == (t,)
True

By traversing ._args one can traverse the tree of all identifying information (including annotating strings and values like 'amount') or by traversing ._inputs one can inspect the much sparser tree of just the major expressions, skipping parameters like the particular field name to be selected.

Most terms have only a single child input. And so often the ._inputs tree is just a single line of nodes. Notable exceptions include operations like Join and BinOp which contain two inputs.

Expression Invariants

Blaze expressions adhere to the following properties:

  1. They and all of their stored fields are immutable
  2. Their string representations evaluate to themselves. E.g. eval(str(expr)) == expr
  3. They have simple __init__ constructors that only copy in fields to the object. For intelligent argument handling they have functions. E.g. the Join class has an analagous join function that should be used by users. Same with the internal By class as the user-level by function.
  4. They can compute their datashape .dshape given the datashape of their children and their arguments.

Organization

All expr code occurs in blaze/expr/. This directory should be self-contained and not dependent on other parts of Blaze like compute or api.

  • blaze/expr/core.py contains code related to abstract tree traversal
  • blaze/expr/expr.py contains code related to datashape imbued expressions
  • blaze/expr/collections.py contains operations related to expressions with datashapes that contain a dimension. Operations like Selection and Join live here
  • blaze/expr/datetime.py, blaze/expr/string.py, … all contain specialized operations for particular domains.

Computation

Once we have a Blaze expression like the following:

>>> deadbeats = t[t.amount < 0].name

and some data like the following:

>>> data = [[1, 'Alice', 100],
...         [2, 'Bob', -200],
...         [3, 'Charlie', 300]]

and a mapping of Symbols to data like the following:

>>> namespace = {t: data}

then we need to evaluate the intent of the expression on the data. We do this in a step-by-step system outlined by various compute functions. The user experience is as follows

>>> from blaze import compute
>>> list(compute(deadbeats, namespace))
['Bob']

But internally compute traverses our expression from the leaves (like t) on up, transforming data as it goes. At each step it looks at a node in the Blaze expression graph like the following:

>>> selection_t = t[t.amount < 0]

and transforms the data appropriately, like the following:

>>> predicate = lambda amt: amt < 0
>>> data = filter(predicate, data)

This step-by-step approach is easy to define through dispatched compute_up functions. We create a small recipe for how to compute each expression type (e.g. Projection, Selection, By) against each data type (e.g., list, DataFrame, sqlalchemy.Table, ….) Here is the recipe mapping a Selection to a DataFrame:

>>> @dispatch(Selection, DataFrame)   
... def compute_up(t, df, **kwargs):
...     predicate = compute(t.predicate, df)
...     return df[predicate]

This approach is modular and allows interpretation systems to be built up as a collection of small pieces. One can begin the construction of a new backend by showing Blaze how to perform each individual operation on a new data type. For example here is a start of a backend for PyTables:

>>> @dispatch(Selection, tb.Table)    
... def compute_up(expr, data):
...     s = eval_str(expr.predicate)  # Produce string like 'amount < 0'
...     return data.read_where(s)     # Use PyTables read_where method

>>> @dispatch(Head, tb.Table)         
... def compute_up(expr, data):
...     return data[:expr.n]          # PyTables supports standard indexing

>>> @dispatch(Field, tb.Table)       
... def compute_up(expr, data):
...     return data.col(expr._name)  # Use the PyTables .col method

These small functions are isolated enough from Blaze to be easy for new developers to write, even without deep knowledge of Blaze internals.

Compute Traversal

The compute_up functions expect to be given:

  1. The expression containing information about the computation to be performed
  2. The data elements corresponding to the .inputs of that expression

The compute function orchestrates compute_up functions and performs the actual traversal, accruing intermediate results from the use of compute_up. By default compute performs a bottom_up traversal. First it evaluates the leaves of the computation by swapping out keys for values in the input dictionary, {t: data}. It then calls compute_up functions on these leaves to find intermediate nodes in the tree. It repeats this process, walking up the tree, and at each stage translating a Blaze expression into the matching data element given the data elements of the expression’s children. It continues this process until it reaches the root node, at which point it can return the result to the user.

Sometimes we want to perform pre-processing or post-processing on the expression or the result. For example when calling compute on a blaze.data.SQL object we actually want to pre-process this input to extract out the sqlalchemy.Table object and call compute_up on that. When we’re finished and have successfully translated our Blaze expression to a SQLAlchemy expression we want to post-process this result by actually running the query in our SQL database and returning the concrete results.

Adding Expressions

Expressions can be added by creating a new subclass of blaze.expr.expressions.Expr. When adding a class, one should define all of the arguments the type will accept in the _arguments attribute. Blaze expressions are memoized based on these arguments. Memoization happens at object construction time so any custom initialization logic should happen in the __new__ instead of the __init__. Construction should always forward to the superclass’s __new__. By default, the __new__ will reflect a signature from the _arguments attribute.

To define the shape of our new expression, we should implement the _dshape method. This method should use the shapes of the arguments passed in the constructor plus knowledge of this type of transformation to return the datashape of this expression. For example, thinking of sum, we would probably want a method like:

def _dshape(self):
    # Drop the dimension of the child reducing to a scalar type.
    return self._child.schema.measure

Here we see the .schema attribute being used. This attribute dispatches to another optional method: _schema. This method should return the datashape with the shape stripped off, or just the data type. If this is not defined, it will be implemented in terms of the _dshape method. This is often convenient for subclasses where the rules about the schema change but the rules for the dimensions are all the same, like blaze.expr.reductions.Reduction.

The constructor is not public construction point for a blaze expression. After the class is defined a pairing function should be added to construct and type check the new node. For example, if our node is Concat, then the functions should be called concat. We will want to decorate this function with odo.utils.copydoc() to pull the docstring from the class. This function’s main job is type checking the operands. Any constructed node should be in a valid state. If the types do not check out, simply raise a TypeError with a helpful message to the user.

Now that the new expression class is defined and the types work out, it must be dispatched to in the compute backends. For each backend that can implement this new feature, a corrosponding compute_up dispatch should be defined. For example, assuming we just defined sum, we would need to implement something like:

@dispatch(sum, np.ndarray)
def compute_up(expr, arr, **kwargs):
    ...

@dispatch(sum, pd.Series)
def compute_up(expr, arr, **kwargs):
    ...

@dispatch(sum, (list, tuple))
def compute_up(expr, arr, **kwargs):
    ...

...

Each of these function definitions should appear in the blaze.compute.* module for the given backend. For example, the ndarray definition should go in blaze.compute.numpy.

After implementing the various compute up functions, tests should be written for this behavior. Tests should be added to blaze/expr/tests for the expression itself, including tests against the construction and the dshape. Tests are also needed for each of the particular backend implementations to assert that the results of performing the computation is correct accross our various backends.