Friday, December 21, 2012

Javascript Dagre & D3 to visualize Big Data dataflows

Big Data projects often involve a lot of stovepipe processing, and visualizing data flows is a powerful way to convey data provenance to the end user, and even allow control of the involved processes.

There are a number of tools available to visualize data flows, but most suffer from some limitation. Some allow labeling of only nodes, but not edges. Some do not have a provision for the node label to be inside of the node. Most use general-purpose graph layout algorithms such as "gravity", whereas dataflow diagrams have distinct start nodes and end nodes and are better represented mostly orthoganlly either top-down or left-right. (The well-known dataflow language G from LabVIEW is left-right, but top-down is better suited for web browsers.)  Graphviz can generate a nice top-down layout, but for the web can only at this time produce static images (albeit dynamically), with no mouseovers to facilitate drill-downs into processes or data stores.

Dagre, a Javascript library built on top of the D3.js visualization toolkit, is very well-suited to visualizing Big Data data flows. Below is an example (contrived, but illustrates the idea):


Dagre generated the above from the succinct input below:

digraph {
    A [label="Apache"];
    B [label="fetchlog.sh"]
    C [label="parseaccesslogs.pig"];
    D [label="parseerrorlogs.pig"];
    E [label="PageViewsPer503"];
    A -> B [label="access_log"];
    A -> B [label="error_log"];
    B -> C [label="/logs/access_log_$DATE.txt"]
    B -> D [label="/logs/error_log_$DATE.txt"]
    C -> E [label="MySQL:VISITORS"];
    D -> E [label="MySQL:ERRORS"];
}

You can paste it into Dagre's live demo yourself, and you can see that mouseover highlighting is possible, meaning it is possible to hook it to do drill-downs.

No comments: