MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.
The theme at Spark Summit 2015 this week can be boiled down to "Spark 1.4 is for data scientists". The "first new supported language in over a year" is the highlight of Spark 1.4: SparkR, originally an AMPLab project, is now part of the Apache Spark distribution. Another data science improvement is Spark ML (which has the ML pipelines, and also which may eventually replace Spark MLLib) is now out of beta alpha. On the commercial side, the Databricks Cloud offering is now GA, with its ability to spin up an arbitrarily large Spark cluster at the touch of a button -- and give you a notebook-style interface to Spark. Amazon also announced turn-key Spark spin-up on AWS (but no notebook).
On the one hand, Spark is moving really fast. Half of the 8000+ Jira tickets have been entered in 2015 alone. On the other hand, there is so much people want in it that in some respects it seems it's not moving fast enough. With all that went in to Spark 1.4 for data science, improvements to Spark Core are to come in Spark 1.5 and 1.6. Although some of Project Tungsten made it into Spark 1.4, most of it is targeted for Spark 1.5, and the most interesting part -- on-the-fly compilation to Intel SIMD -- is slated for Spark 1.6. That will lay the groundwork to on-the-fly compilation to GPU, which will presumably come even later.
Another contentious issue with Spark developers (i.e. not data scientists) is the inability to launch, track, and control Spark YARN jobs via a Java API. The spark-submit.sh Bash shell is the only official way to submit Spark YARN jobs, making it impossible for Spark to, for example, serve as the back-end to a web app that expects to tightly control, monitor, and launch Spark jobs. This issue was raised again at the Bay Area Spark Meetup that was held on-site during Spark Summit, and the Databricks panel members reluctantly predicted that the upcoming "Launcher" mechanism would be in Spark 1.5 or Spark 1.6.