Wednesday, June 24, 2015

My Spark Summit Presentation On Word2Vec and Semi-Supervised Learning

Abstract

MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.

Video



Slides


Wednesday, June 17, 2015

Spark 1.4 for Data Scientists; Spark 1.5 & 1.6 for core improvements

The theme at Spark Summit 2015 this week can be boiled down to "Spark 1.4 is for data scientists". The "first new supported language in over a year" is the highlight of Spark 1.4: SparkR, originally an AMPLab project, is now part of the Apache Spark distribution. Another data science improvement is Spark ML (which has the ML pipelines, and also which may eventually replace Spark MLLib) is now out of beta alpha. On the commercial side, the Databricks Cloud offering is now GA, with its ability to spin up an arbitrarily large Spark cluster at the touch of a button -- and give you a notebook-style interface to Spark. Amazon also announced turn-key Spark spin-up on AWS (but no notebook).
On the one hand, Spark is moving really fast. Half of the 8000+ Jira tickets have been entered in 2015 alone. On the other hand, there is so much people want in it that in some respects it seems it's not moving fast enough. With all that went in to Spark 1.4 for data science, improvements to Spark Core are to come in Spark 1.5 and 1.6. Although some of Project Tungsten made it into Spark 1.4, most of it is targeted for Spark 1.5, and the most interesting part -- on-the-fly compilation to Intel SIMD -- is slated for Spark 1.6. That will lay the groundwork to on-the-fly compilation to GPU, which will presumably come even later.
Another contentious issue with Spark developers (i.e. not data scientists) is the inability to launch, track, and control Spark YARN jobs via a Java API. The spark-submit.sh Bash shell is the only official way to submit Spark YARN jobs, making it impossible for Spark to, for example, serve as the back-end to a web app that expects to tightly control, monitor, and launch Spark jobs. This issue was raised again at the Bay Area Spark Meetup that was held on-site during Spark Summit, and the Databricks panel members reluctantly predicted that the upcoming "Launcher" mechanism would be in Spark 1.5 or Spark 1.6.

Monday, May 11, 2015

Yes, Neural Networks Have Grandmother Cells


The neural network portions of the above image come from Wikipedia.

The age-old debate about neural networks (both artificial and biological) is whether they have a grandmother cell, a neuron cell/node somewhere in the net that is activated when one's grandmother is viewed (assuming a biological vision scenario or computer vision application).

For biological neural networks, the jury is out, and the answer is leaning toward the "no" side. For artificial neural networks, if you Google for the answer, you'll almost always come across the admonishment to avoid grandmother cells in your neural networks. But to beginners to neural networks, this advice can be easily misunderstood.

More precisely, grandmother cells are to be avoided in the internal nodes. The reason is that internal nodes are supposed to be for latent variables, i.e., intermediate properties like "big eyes" or "big teeth". If an internal node is already recognizing grandma, then that is an indication of overfitting and that perhaps the neural network was created too large for the amount of training data or search space.

But all artificial neural networks whose job it is to classify images where one of the classes is grandma will have a grandmother cell, namely in the output layer. That's simply how you get your output from a classifier artificial neural network.

Thursday, May 7, 2015

RDF vs. Property Graphs: Excerpt From "Spark GraphX In Action"


The above is an animated interpretation of one of the (static) images from chapter 3 of my upcoming book Spark GraphX In Action. Chapter 3 is now available for download for those who have purchased the MEAP. Today, May 7, 2015, the MEAP is available for a 50% discount using the discount code dotd050715au.

Thursday, March 12, 2015

My new book: Spark GraphX In Action

My new book became available on MEAP on March 6, 2015, and the print version is expected later in 2015.

Spark and Graphs are two tools rapidly being adopted by Data Scientists. Combine them and you get the GraphX component of Spark. Written in a way that assumes minimal Scala, the book quickly brings readers up to speed on all four: Scala, Spark, definitions from graph theory, and GraphX itself. An appendix provides three quick ways to get Spark running, and the chapters provide explicit examples from the most basic up through machine learning.

Monday, March 2, 2015

Spark Ecosystem

Presentation I gave to the Data Science Association on February 28, 2015:

  1. How fast Spark is moving
  2. Ways to keep up
  3. Current status of the ever-expanding Spark ecosystem

Original video

Re-recorded later as a screencast

Tuesday, February 24, 2015

State of the Stream 2015Q1

The big news this week -- and for the past couple of months, really -- is of course eBay open sourcing theirPulsar real-time analytics framework, which allows SQL queries into a real-time data stream. It's built on top of their Jetstream data streaming framework, also open sourced this month, which, from its list of capabilities seems to be a layer akin to Storm and Spark Streaming, with sinks and sources, Kafka connectivity, and REST interfaces for cluster monitoring.
With all the breathless news items out there about Pulsar, they neglect to mention one important fact: it is GPL, which precludes its use at a lot of organizations.
This is in stark contrast to some other big news this past week: Druid just switched to an Apache license from its use of the GPL.
Streaming is important because as I've previously blogged, every Big Data project eventually becomes a data streaming project, because once insights are found in Big Data, the consumers of those insights request it be re-run with ever more up-to-date data, until eventually they request real-time updates.
I've also previously blogged, over a year ago now, on the comparison of streaming frameworks, especially comparing Storm with Spark Streaming. At the time, I relayed how at the February, 2013 Strata, several speakers spoke of how Trident, the layer that gives Storm exactly once semantics, makes Storm non-performant. I also wrote at that time how Spark Streaming, even though it had exactly-once semantics, lacked things like graceful shutdown and transactions.
Well, there is recent news about about both Spark Streaming and Storm. First, Spark Streaming gained high availability in 1.2.0, released in December, 2014. And although it has exactly once semantics, doing so with Kafka is a bit tricky since Kafka itself does not guarantee exactly once when a node in a Kafka cluster goes down. However, Spark 1.3.0 will have Kafka exactly once semantics. Spark 1.3.0 release candidates are being voted on now, and will probably be released in the first half of March, 2015.
As for performance, a University of Toronto grad student recently benchmarkedStorm vs. Spark Streaming:

"Storm was around 40% faster than Spark, processing tuples of small size (around the size of a tweet). However, as the tuple's size increased, Spark had better performance maintaining the processing times."