Metagraph is a declarative visualization library built on top of a Numpy-like library capable of deferred array computation.

Numpy's core concept is the dtype, a unifying way to think about structured array data. This structured description of data allows for translation of high-level, Python functions into vectorized, high-performance low-level loops. While this allows programmers to express a large number of structured array transformations at a high level, there are still some operations that are difficult to express, like LINQ-style queries and conditionals. Furthermore, even some of the possible transformations frequently require the selection of boolean mask arrays, as serializations of per-element conditional evaluation which can be used for later compacting or selection.

There have been many, many efforts to take Numpy one step further, and generalize the concept of Universal Functions, or provide easier ways for end users to write pseudo-Python code that becomes turned into Ufuncs on the fly or with little effort from the user. Some of these approaches include numexpr, weave.inline or weave.blitz, CoreFunc (part of CorePy), and Cython.

My view is that from an architectural standpoint, most of these tools either are subject to the tyranny of Numpy's broadcasting and iteration core, or they circumvent it altogether and implement direct C or C++ functions to access the raw data buffer. Both have their fundamental drawbacks. The former is generally bound to return an array or arrays of some size that scales with their inputs, thereby creating storage requirements for unnecessary intermediates, and the latter generally fail to implement the full Numpy slicing and broadcasting syntax intrinsically, thereby needing input data to be already a copy of the appropriate shapes.

A fundamentally different view of array transformations is to look at all of the core array operations - slicing, striding, concatenation, transposition, etc. - along with the core set of simple ufuncs and consider them from a functional programming perspective. From this perspective, all of Numpy currently operates in what can be termed "immediate" mode. That is, every function performs a transformation on concrete data, and returns a concrete return array or scalar.

However, an alternative formulation is to consider every array transformation (including slicing) as a generator that consumes an input iterator and returns an output iterator. In the same way that the well-known and powerful itertools library allows the programmer to "chain together" transformations into a pipeline, Numpy's ufuncs can be seen as generators that operate on iterators. Generalizing this to Numpy's "generalized ufuncs" (which allow arbitrary slices of input and output instead of operating on single scalars at a time), if the input shape and output shape of every generator ufunc is specified, then when they are chained together in a Numpy expression, it is possible to statically reason about the manner in which the data will be transformed. More specifically, one can know how to slice a dataset for parallelization, or decimate it for display. Additionally, one can take an expression graph of these Numpy generator ufuncs and compile the entire thing into a single C function that efficiently iterates, computes, and returns values of the right dtype and shape, without creating unnecessarily large intermediate storage arrays.

From an architectural standpoint, another term for a "generator ufunc" is simply a "procedural array". That is, instead of Numpy storing information about striding and slicing in the header of a traditional ndarray object, a procedural array stores the slice algebra needed to transform its upstreams procedural or traditional arrays' shapes into an output shape. It also stores a pointer to a fast transformation function, if one exists, which can be used by the runtime to stitch together a fast evaluation of an array expression graph.

As an aside, this entire approach is analogous to introduction of programmable shaders to computer graphics. Initially there were only high-level arrays operations that were applied across all geometry, with the rendering equation/lighting model applied at every single pixel or ray-geometry intersection. Eventually, this was turned around, and the lighting model was no longer primarily in the core renderer, but rather dynamically modified by snippets of shader code. With the Procedural Array approach, the Numpy core is no longer just going to walk a data array and apply simple functions at each point; it also needs to evaluate an dynamic expression graph and apply that at each point.

For more information, please email me at pwang [at] streamitive.com.

I gave a talk at the SciPy 2011 conference about Metagraph: video of the talk and PDF slides

Books

__Semiology of Graphics: Diagrams, Networks, Maps__by Jacques Bertin__Exploratory Data Analysis__by John W. Tukey__The Elements of Graphing Data__by William Cleveland__Creating More Effective Graphs__by Naomi Robbins__The Grammar of Graphics__by Leland Wilkinson__Visualization for Information Retrieval__by Jin Zhang__The Paralation Model: Architecture-Independent Parallel Programming__by Gary Sabot

Papers

*An APL Machine*- Philip Abrams*Manticore - Heterogeneous Parallel Language**Merrimac - High performance and highly efficient scientific computing with streams**From Lists to Streams to Nothing at All**The Stream Virtual Machine*- Labonte, Mattson, Buck, et al*FIFTH - A Stack Based GP Language for Vector Processing**The Next 700 Programming Languages**OOPAL - Integrating Array Programming in Object-Oriented Programming**Characterizing the Paralation Model Using Dynamic Assignment**Protovis: A Graphical Toolkit for Visualization**Copperhead - Compiling an Embedded Data Parallel Language**Can Programming Be Liberated from Von Neumann Style*- John Backus