Big Data projects often involve a lot of stovepipe processing, and visualizing data flows is a powerful way to convey data provenance to the end user, and even allow control of the involved processes.
There are a number of tools available to visualize data flows, but most suffer from some limitation. Some allow labeling of only nodes, but not edges. Some do not have a provision for the node label to be inside of the node. Most use general-purpose graph layout algorithms such as "gravity", whereas dataflow diagrams have distinct start nodes and end nodes and are better represented mostly orthoganlly either top-down or left-right. (The well-known dataflow language G from LabVIEW is left-right, but top-down is better suited for web browsers.) Graphviz can generate a nice top-down layout, but for the web can only at this time produce static images (albeit dynamically), with no mouseovers to facilitate drill-downs into processes or data stores.
Dagre, a Javascript library built on top of the D3.js visualization toolkit, is very well-suited to visualizing Big Data data flows. Below is an example (contrived, but illustrates the idea):
Dagre generated the above from the succinct input below:
digraph {
A [label="Apache"];
B [label="fetchlog.sh"]
C [label="parseaccesslogs.pig"];
D [label="parseerrorlogs.pig"];
E [label="PageViewsPer503"];
A -> B [label="access_log"];
A -> B [label="error_log"];
B -> C [label="/logs/access_log_$DATE.txt"]
B -> D [label="/logs/error_log_$DATE.txt"]
C -> E [label="MySQL:VISITORS"];
D -> E [label="MySQL:ERRORS"];
}
You can paste it into Dagre's live demo yourself, and you can see that mouseover highlighting is possible, meaning it is possible to hook it to do drill-downs.
No comments:
Post a Comment