Friday, December 21, 2012

Javascript Dagre & D3 to visualize Big Data dataflows

Big Data projects often involve a lot of stovepipe processing, and visualizing data flows is a powerful way to convey data provenance to the end user, and even allow control of the involved processes.

There are a number of tools available to visualize data flows, but most suffer from some limitation. Some allow labeling of only nodes, but not edges. Some do not have a provision for the node label to be inside of the node. Most use general-purpose graph layout algorithms such as "gravity", whereas dataflow diagrams have distinct start nodes and end nodes and are better represented mostly orthoganlly either top-down or left-right. (The well-known dataflow language G from LabVIEW is left-right, but top-down is better suited for web browsers.)  Graphviz can generate a nice top-down layout, but for the web can only at this time produce static images (albeit dynamically), with no mouseovers to facilitate drill-downs into processes or data stores.

Dagre, a Javascript library built on top of the D3.js visualization toolkit, is very well-suited to visualizing Big Data data flows. Below is an example (contrived, but illustrates the idea):


Dagre generated the above from the succinct input below:

digraph {
    A [label="Apache"];
    B [label="fetchlog.sh"]
    C [label="parseaccesslogs.pig"];
    D [label="parseerrorlogs.pig"];
    E [label="PageViewsPer503"];
    A -> B [label="access_log"];
    A -> B [label="error_log"];
    B -> C [label="/logs/access_log_$DATE.txt"]
    B -> D [label="/logs/error_log_$DATE.txt"]
    C -> E [label="MySQL:VISITORS"];
    D -> E [label="MySQL:ERRORS"];
}

You can paste it into Dagre's live demo yourself, and you can see that mouseover highlighting is possible, meaning it is possible to hook it to do drill-downs.

Saturday, December 1, 2012

Apache Thrift in Windows

Apache Thrift, originally developed by Facebook, is an immensely useful general-purpose inter-process communication (IPC) code generation tool and library. Although it supports a variety of IPC mechanisms, sockets are its primary conduit, and as such, is naturally language agnostic and actually its tool can generate code for a dozen different languages.

I use it to provide PHP web interfaces to monitor and control C++ scientific/industrial semi-embedded systems (desktop PCs loaded with data acquisition and control hardware).

Sometimes, those PCs are running Windows. With the recent 0.90 release, Apache Thrift support for Windows is leaps and bounds beyond what it used to be, but it's still "only" 98%. Here are the missing steps:

  1. First of all the good news: Use on Windows no longer requires Cygwin or MinGW, despite what the outdated documentation states.
  2. You can download a pre-built Thrift compiler directly from apache.org
  3. You will, however, still need to compile the Thrift libraries yourself, if you plan to use Thrift with a compiled language such as C++. Thankfully, the Thrift distribution comes with a Microsoft Visual C++ .sln solution file. The thing to know, however, is that it is a Visual C++ 2010 .sln file, and will not work work with Visual C++ 2008. You can use Visual Studio 2012, but recall Visual Studio 2012 does not work with XP, which I still use for development because of both data acquisition hardware drivers and some legacy software development tools (for some legacy codebases). Thankfully, you can use the freely available Visual C++ 2010 Express, which is still available for download even though Visual Studio 2012 has been released. To download an ISO (to preserve your ability to reinstall in the future) instead of a stub/Internet download, select the option for the "All-in-One ISO".
  4. The \thrift-0.9.0\lib\cpp\thrift.sln contains two projects: libthrift and libthriftnb. The libthriftnb is for the non-blocking server, and if you want to use it from a server, you must link in both libthrift and libthriftnb, as well as utilize TNonblockingServer instead of TSimpleServer. Note that "non-blocking" means non-blocking from the client perspective. On the server side, the call server->serve() actually blocks. To make either TNonblockingServer or TSimpleServer non-blocking from the server code perspective, just wrap it inside a new boost::thread().
  5. Compiling libthriftnb is trickier. First it requires libevent. To compile libevent, Start->All Programs->Microsoft Visual Studio 2010 Express->Visual Studio Command Prompt (2010), navigate to the libevent directory, and nmake -f Makefile.nmake. Second, libthriftnb pulls in Thrift library code that does #include <tr1/functional>, but since Visual C++ 2010 doesn't support TR1, you can just replace it with <boost/functional.hpp>.
  6. To compile the libthrift project (and this applies to libthriftnb as well), from the Microsoft Visual C++ drop-down menu, Project->Properties and Configuration Properties->C++->General->Additional Include Directories: C:\Program Files\boost\boost_1_51 (of course download Boost first).
  7. Then, to compile your Visual C++ server code that links to libthrift, from the Microsoft Visual C++ drop-down menu, Project->Properties:
    • Configuration Properties->C/C++->General->Additional Include Directories: C:\Program Files\boost\boost_1_41;C:\thrift-0.9.0\lib\cpp\src (for libthriftnb, also include C:\libevent-2.0.21-stable\include;C:\libevent-2.0.21-stable\WIN32-Code;C:\libevent-2.0.21-stable)
    • Configuration Properties->Linker->General->Additional Library Directories: C:\thrift-0.9.0\lib\cpp\Release;C:\Program Files\boost\boost_1_51\lib
    • Configuration Properties->Linker->Input->Additional Dependencies: libboost_thread-vc100-mt-1_51.lib;libboost_chrono-vc100-mt-1_51.lib;libthrift.lib. (For the Debug version, substitute mt-gd for mt.)
  8. In your server code, include the following code prior to invocation of any of the Thrift code:
    WSADATA wsaData = {};
    WORD wVersionRequested = MAKEWORD(2, 2);
    int err = WSAStartup(wVersionRequested, &wsaData);