Saturday, January 11, 2014

Science Data Science

We commonly hear about "data science" in the context of mining marketing or business data, especially "Big Data", but of course scientists have been practicing statistics for centuries.

Timeline of Statistics.

But recall data science is more than just statistics, from Drew Conway's famous Venn Diagram:

Data Science Venn Diagram

For scientists to move out of "traditional research" and into data science, they need to add computer-based skills such as machine learning and big data.

Just data management has long been a problem in the scientific realm. We learned last month that 80% of scientific data from science conducted over the past 20 years is already lost due to poor data retention policies. In recognition of that, U.S. government funding grants now require a data retention plan to be included in funding proposals. And just two days ago, IEEE Spectrum posted a blog article on Gordon Moore's new law, that "big data will lead to big science," and his philanthropic efforts to support that.

But why is scientific data retention so poor? Having worked in the scientific software development field for half my career, I can speculate on several reasons:

  • Scientific data is not amenable to conventional (e.g. relational) databases. Scientific data sets are typically array-based (2D, 3D, 4D, and higher), where the array indices rather than relational metadata describe the data. This is a fancy way of saying a bunch of flat unannotated binary files, but there are reasons scientists use such files: they are compact relative to, say, XML and JSON; they are easy to write software to read and write; and they are not tied to a particular software vendor. With their ease of use, though, comes ease of deletion. Corporate and institutional cultures pay no heed when files get deleted, but try and delete a database and suddenly the resistance increases dramatically. Along with the convenience of binary files preferred by scientists comes the disrespect of files.
  • Until this focus over the past 3-5 years on data retention, funding proposals never included funding for prolonged data retention. Data retention is expensive. Data formats change, from 8" floppies, to 5.25" floppies, to Bernoulli drives, to 3.5" floppies, to Zip drives, to Jaz drives, to QIC tapes, to CD-ROM, to DVD-R, to LTO tapes, to USB drives, not to mention data on raw hard drives: MFM, RLL, SCSI, Ultra SCSI, IDE, SATA, SAS. It takes both labor and capital investment to continually propagate data from one format to another. Not only must data be format shifted, even within a single format it must be refreshed to guard against physical or magenetic decay. Properly maintained data also involves maintaining multiple backups, including at off-site locations.
  • Compounding the issue is that scientific data would retain its value even more than business data. Longitudinal studies of humans or civil engineering edifices can leverage data spanning a century or more. New scientific studies frequently make use of old data sets, applying new techniques or new insights, when such data is available (or, perhaps if it were available).
  • Scientists are not experts in computers, let alone IT "best practices". Scientists typically know enough about computers to get by, but have not yet generally added that third bubble from the Venn diagram.
  • Scientists often have to rely on proprietary and commercial software and systems for data collection. These systems are specialized, and have no open source counterparts due to the lack of economic forces that propel open source software in the realm of business software. Scientific software even often comes tethered to dongles, or works only on proprietary operating systems no longer available (such as MS-DOS). I have even blogged and presented on an alternative to all this, XML/XSL/HTML5 for reports instead of PDF, where I suggest visualization and presentation software programs be encoded in the form of open-source Javascript instead of closed-source proprietary and commercial binaries, but I know of no uptake outside of my singular implementation of the idea.

Data retention is the critical first step to expanding data science in the scientific realm. Without data retention, there can be no statistics and machine learning over data sets that include data from the past or data from other researchers. I can imagine scientists universally adopting tools like R, iPython Notebook, and Weka like a fish to water, but without data, there is no water.

No comments: