tag:blogger.com,1999:blog-61043630476599658302024-03-13T11:54:48.737-06:00Technical Tidbit of the DayMichael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.comBlogger113125tag:blogger.com,1999:blog-6104363047659965830.post-88602513468535332692018-11-29T08:44:00.001-07:002018-11-29T08:44:44.710-07:00Sentiment Analysis pending patent published today"Techniques for Sentiment Analysis of Data Using a Convolutional Neural Network and a Co-Occurrence Network"
<br />
<br />
<pre wrap=""></pre>
<pre wrap=""><a class="moz-txt-link-freetext" href="http://pdfaiw.uspto.gov/.aiw?docid=20180341839&PageNum=1&IDKey=8234088170C0&HomeUrl=http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1%2526Sect2=HITOFF%2526d=PG01%2526p=1%2526u=%25252Fnetahtml%25252FPTO%25252Fsrchnum.html%2526r=1%2526f=G%2526l=50%2526s1=%25252220180341839%252522.PGNR.%2526OS=DN/20180341839%2526RS=DN/20180341839">http://pdfaiw.uspto.gov/.aiw?docid=20180341839&PageNum=1&IDKey=8234088170C0&HomeUrl=http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1%2526Sect2=HITOFF%2526d=PG01%2526p=1%2526u=%25252Fnetahtml%25252FPTO%25252Fsrchnum.html%2526r=1%2526f=G%2526l=50%2526s1=%25252220180341839%252522.PGNR.%2526OS=DN/20180341839%2526RS=DN/20180341839</a></pre>
<pre wrap=""></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZMQcZjOD13myjFLTp3nG-Geu48g7r0gMs19K6iWF-le_OnbyGPaxBLi3y2_jsXZxJJK7OrBP7G94QJHc3pdKV_j4RfBRyORg0wURrvjXMHiThpQkXsgcK849ycELPM7I9QCv5Kqn2ElI/s1600/SentimentAnalysisPatentDiagram.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="274" data-original-width="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZMQcZjOD13myjFLTp3nG-Geu48g7r0gMs19K6iWF-le_OnbyGPaxBLi3y2_jsXZxJJK7OrBP7G94QJHc3pdKV_j4RfBRyORg0wURrvjXMHiThpQkXsgcK849ycELPM7I9QCv5Kqn2ElI/s1600/SentimentAnalysisPatentDiagram.jpg" /></a></div>
<pre wrap=""></pre>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-20531584280092593232018-07-03T10:45:00.003-06:002018-07-03T10:50:40.017-06:00Quickest inline d3.js in JupyterEveryone has their own way to use d3.js in Jupyter. Here is the shortest and most concise I've been able to put together.<br />
<h3>
Step 1</h3>
Download d3.js and put it into the directory <span style="font-family: "courier new" , "courier" , monospace;">~/.jupyter/custom</span><br />
<h3>
Step 2</h3>
Create a file <span style="font-family: "courier new" , "courier" , monospace;">~/.jupyter/custom/custom.js</span> and put the following into that file (note there is no ".js" in the quoted filepath):<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">require.config({paths:{d3:"/custom/d3.v5.min"}})</span><br />
<h3>
Step 3</h3>
In a Jupyter cell:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">from IPython.core.display import display, Javascript</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">display(Javascript('''</span><span style="font-family: "courier new", courier, monospace; font-size: x-small;">require(['d3'], function(d3) {</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">var svg = d3.select(element.get(0)).append('svg').attr('width',600).attr('height',200)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">svg.append('circle').attr('cx',30).attr('cy',30).attr('r',20)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">})'''))</span><br />
<br />Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com1tag:blogger.com,1999:blog-6104363047659965830.post-77393986977685922242018-02-09T11:17:00.000-07:002018-02-09T11:17:10.212-07:00Minimal Scala PlayIf you just need Scala Play for some quick testing/demo of Scala code, even the <a href="https://www.playframework.com/download" target="_blank">Scala Play Starter Example</a> is too heavy. It has a lot of example code that is not needed and too much security for something to be run and accessed only locally.<br />
<br />
Here is how to trim down the Scala Play Starter Example. First is the <span style="font-family: Courier New, Courier, monospace;"><b>conf/application.conf</b></span> file. All that is needed for the whole file is:<br />
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">play.filters {</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> hosts { allowed = ["."] }</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> headers { contentSecurityPolicy = "default-src * 'unsafe-inline'" }</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">}</span><br />
<div>
<br /></div>
<div>
The <span style="font-family: Courier New, Courier, monospace;">hosts.allowed</span> allows connections from external sources, and <span style="font-family: Courier New, Courier, monospace;">headers.contentSecurityPolicy</span> allows things like remotely hosted Javascript (e.g. http://code.jquery.com/jquery-3.3.1.min.js) and Javascript inline directly in HTML elements (i.e. disable CSP and go back to 2016).</div>
<div>
<br /></div>
<div>
Then the <span style="font-family: Courier New, Courier, monospace;"><b>conf/routes</b></span> file:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">GET / controllers.HomeController.index</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">GET /mywebservice controllers.MyWebServiceController.get(inputdata)</span></div>
</div>
<div>
<br /></div>
<div>
Specifically, you can delete the <span style="font-family: Courier New, Courier, monospace;">/count</span> and <span style="font-family: Courier New, Courier, monospace;">/message</span> routes, and then add whatever routes you need for web services (like <span style="font-family: Courier New, Courier, monospace;">/mywebservice</span> above).</div>
<div>
<br /></div>
<div>
In the <span style="font-family: Courier New, Courier, monospace;"><b>app</b></span> directory:</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm -rf filters</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm -rf services</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm Module.scala</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm controllers/AyncController.scala</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm controllers/CountController.scala</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm views/main.scala.html</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">rm views/welcome.scala.html</span></div>
<div>
<br /></div>
<div>
And then in <span style="font-family: Courier New, Courier, monospace;">views/index.scala.html</span> you can just delete all the code therein and write your own regular HTML and not bother with the Twirl template language if you don't need it.</div>
<div>
<br /></div>
<div>
Finally, you'll need to create <span style="font-family: Courier New, Courier, monospace;">controllers/MyWebServiceController.scala</span>. You can use <span style="font-family: Courier New, Courier, monospace;">HomeController.scala</span> as a template and add in <span style="font-family: Courier New, Courier, monospace;">import play.api.libs.json._</span> to gain access to the Play JSON APIs for parsing and generating JSON.</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><b>controllers/MyWebServiceController.scala</b></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">package controllers</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">import javax.inject._</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">import play.api.libs.json._</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">import play.api.mvc._</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: "Courier New", Courier, monospace; font-size: x-small;">@Singleton</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">class MyWebServiceController @Inject()(cc: ControllerComponents) extends AbstractController(cc) {</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: "Courier New", Courier, monospace; font-size: x-small;"> def get(inputdata:String) = Action {</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> val a = Json.parse(inputdata)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> val r = // do stuff with a</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> Ok(Json.toJson</span><span style="font-family: "Courier New", Courier, monospace; font-size: x-small;">(r))</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
<div>
<br /></div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-65221352888544200312017-11-21T13:49:00.001-07:002017-11-21T13:49:32.228-07:001PB in 1U17 months ago I blogged about <a href="https://technicaltidbit.blogspot.com/2016/07/900tb-4u-60k.html" target="_blank">900TB (nearly 1PB) in 1U of rack space</a> for only $60k. There I noted that a 1U server for the new high-density SSDs wasn't commonly available. Well, that has changed. A couple of months ago, Super Micro announced their <a href="https://www.supermicro.com/newsroom/pressreleases/2017/press170914_32_JBOF.cfm" target="_blank">32-bay 1U unit for 2.5" drives</a>. With the 32TB SSDs that Samsung <a href="http://www.tomshardware.com/news/seagate-samsung-huawei-worlds-largest-ssd,32452.html" target="_blank">announced</a> last year to be available this year, that yields 1PB.<span id="goog_790554157"></span><br />
<br />
It won't be cheap. Recall that the 900TB 4U for $60k was for spinning drives. Given that the 16TB SSDs go for nearly <a href="https://www.cdw.com/shop/products/Samsung-PM1633a-MZILS15THMLS-solid-state-drive-15.36-TB-SAS-12Gb-s/4079174.aspx" target="_blank">$12k</a> a pop, the 32TB drive that has been slated for later this year would be at least twice that and likely much more initially. Even at $24k for each 32TB SSD, this 1U of 1PB SSD would set you back $800k.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjchvFnzVnPBcq7p6tyKzPlMv8T_EdYTc1kNql9zgIKhKSbg6RMA1agu2UfzRjokHEEhyhVN7u4eB9o9z8brFvrDtkwg7yxlEeXqY7ey0jGkk2evU3CO7xzmC3kXM9q2_1yoQjqC5bdFLg/s1600/1u1pb.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="220" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjchvFnzVnPBcq7p6tyKzPlMv8T_EdYTc1kNql9zgIKhKSbg6RMA1agu2UfzRjokHEEhyhVN7u4eB9o9z8brFvrDtkwg7yxlEeXqY7ey0jGkk2evU3CO7xzmC3kXM9q2_1yoQjqC5bdFLg/s1600/1u1pb.jpg" /></a></div>
<br />Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-91686546366963445552017-10-27T12:36:00.001-06:002017-10-27T12:36:36.426-06:00The Spark GraphX of actual combatEarlier this year, my book was translated into a Chinese edition. It actually has sold extremely well. I just noticed that Amazon has a product page for it, and they've given it the title <a href="https://www.amazon.com/Spark-GraphX-actual-combat-Chinese/dp/B06XRKHW6K/">The Spark GraphX of actual combat (Chinese Edition)</a>.<br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhj1Ie4yQ1gCPlAM4Gu-ymlJuJiuaTSWMqc5CLu7dkbZtAdUXjSGfBwpyTX4gn-Nj85gDUhSPyo-7UEfyvgepjzxRezYfI5d0oF3uy-ol8tmTS0q0g5npgsk9jfmnucpkTQwDeZUEbkDko/s1600/GraphXChineseAmazon20171027_500.jpg" imageanchor="1"><img border="0" data-original-height="467" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhj1Ie4yQ1gCPlAM4Gu-ymlJuJiuaTSWMqc5CLu7dkbZtAdUXjSGfBwpyTX4gn-Nj85gDUhSPyo-7UEfyvgepjzxRezYfI5d0oF3uy-ol8tmTS0q0g5npgsk9jfmnucpkTQwDeZUEbkDko/s1600/GraphXChineseAmazon20171027_500.jpg" /></a>
<br />
<br />
My hunch is that's what one gets if one translates "Spark GraphX in Action" into Chinese and then back into English.Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-73856325507305853672017-10-20T09:50:00.000-06:002017-10-27T12:30:45.504-06:00Neo4j's query language Cypher coming to SparkIn my 2016 Spark Summit presentation <a href="https://spark-summit.org/2016/events/finding-graph-isomorphisms-in-graphx-and-graphframes/">Finding Graph Isomorphisms in GraphX and GraphFrames</a> I reviewed the history of graphs in Spark, and how to query a graph in Spark GraphX required many <a href="https://www.slideshare.net/SparkSummit/finding-graph-isomorphisms-in-graphx-and-graphframes/17">more lines of code</a> than an equivalent query in Neo4j using its Cypher language. Even <a href="https://github.com/graphframes">Spark GraphFrames</a>, which implements a tiny, tiny subset of Cypher requires more code than full Cypher.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBkbilNykgPMU_gzl3RCIX00NtGkqMTpPZdkCxAutk2zgbeG-wpI0AiSqCaowS3TZFRn9XUmpJqvE0Ddrkv1F5B5bf12HT8SzFf6j3cXYtz16Uolc-UwTNbYixz7HWrmcw4n5Gz_36dnk/s1600/CAPS.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="200" data-original-width="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBkbilNykgPMU_gzl3RCIX00NtGkqMTpPZdkCxAutk2zgbeG-wpI0AiSqCaowS3TZFRn9XUmpJqvE0Ddrkv1F5B5bf12HT8SzFf6j3cXYtz16Uolc-UwTNbYixz7HWrmcw4n5Gz_36dnk/s1600/CAPS.png" /></a></div>
<br />
Two years ago at the 2015 GraphConnect (an event sponsored by Neo4j), Ion Stoica of Databricks <a href="http://www.marketwired.com/press-release/neo4j-opens-up-its-graph-query-language-cypher-with-support-from-leading-companies-2065909.htm">announced</a>:<br />
<blockquote>
We look forward to bringing Cypher's graph pattern matching capabilities into the Spark stack, making graph querying more accessible to the masses.</blockquote>
Well, two years later, Neo4j announced <a href="https://insidebigdata.com/2017/10/24/apache-spark-expands-cypher-neo4js-sql-graphs-adds-declarative-graph-querying/">yesterday</a>:<br />
<blockquote>
Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. </blockquote>
<blockquote>
[...] Until now, data scientists have been using Spark and query tools like GraphX to define extensions to their graphs. Once identified, they would then re-implement and deploy that work within their applications. Now, with Cypher for Apache Spark, these scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly. </blockquote>
<blockquote>
[...] This announcement builds on Neo4j’s unveiling of openCypher in October 2015, as an effort to push the whole graph industry forward by tapping into the open source community and making Cypher’s evolution an open exercise while avoiding redundant research.</blockquote>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-57545739591747338652017-06-12T22:30:00.000-06:002017-06-12T22:32:26.486-06:00My Spark Summit presentation: Neuro-Symbolic AI for Sentiment Analysis <h3>Video</h3>
<iframe width="560" height="315" src="https://www.youtube.com/embed/SqfjEYfpa44" frameborder="0" allowfullscreen></iframe>
<h3>Slides</h3>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/KpHKJAf9xC2VUe" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe>Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-2615503244753643532017-06-07T16:28:00.000-06:002017-06-07T16:33:51.457-06:00Spark Summit 2017 Review<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg50JG16MLqxH_Xf4n3oCE1vdwBugr6P-UHn605PWatS2Bq514Jz_Tlx1HBUXz2ZddE_PLNkWis6B7985ePfbnU57fntQk9Jmb8N6ZQ3OwAhVDeLW7gDdXL-wlBb379YYa52xVVj9x_2IE/s1600/SparkSummit2017Logo.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="117" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg50JG16MLqxH_Xf4n3oCE1vdwBugr6P-UHn605PWatS2Bq514Jz_Tlx1HBUXz2ZddE_PLNkWis6B7985ePfbnU57fntQk9Jmb8N6ZQ3OwAhVDeLW7gDdXL-wlBb379YYa52xVVj9x_2IE/s1600/SparkSummit2017Logo.jpg" /></a></div>
<br />
Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, <a href="https://databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html">announced</a> they are open sourcing a deep learning library <a href="https://github.com/databricks/spark-deep-learning">Deep Learning Pipelines</a> which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, <a href="https://github.com/intel-analytics/BigDL">BigDL</a>, also without GPU support, because Intel is <a href="https://spark-summit.org/2017/events/accelerating-sparkml-workloads-on-the-intel-xeon-fpga-platform/">pushing</a> their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first <a href="http://datascienceassn.org/content/intels-fpgaxeon-data-streaming">blogged</a> about three years ago).<br />
<br />
For now, the leading contender for Spark GPU deep learning still seems to be <a href="https://deeplearning4j.org/">DeepLearning4j</a>, which is what I used in my Spark Summit 2017 presentation <a href="https://spark-summit.org/2017/events/neuro-symbolic-ai-for-sentiment-analysis/">Neuro-Symbolic AI for Sentiment Analysis</a>. (I will link video and slides once they are posted.)<br />
<br />
The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.<br />
<br />
A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's <a href="https://spark-summit.org/2017/events/machine-learning-innovation-fireside-chat/">Fireside Chat</a> touted <a href="https://rise.cs.berkeley.edu/projects/tegra/">Tegra</a> for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in <a href="http://technicaltidbit.blogspot.com/2016/06/spark-summit-2016-review.html">my review of last year's Spark Summit</a>.)<br />
<br />
There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging <a href="https://spark-summit.org/2017/events/combating-abusive-language-in-chat-with-apache-spark/">abusive language detection</a>. And, of course, my own presentation was on NLP.<br />
<br />
Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a <a href="http://www.youtube.com/watch?v=qAZ5XUz32yM&t=33m0s">demo</a> of 10ms latency, to compete with Storm and Flink. No more micro-batches!Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-37320784144820694472017-03-07T12:05:00.001-07:002017-03-07T12:08:34.257-07:00Zeppelin installation tipsIf you need to run Apache Zeppelin either a) on a headless server or b) behind a proxy, see below.<br />
<h3>
Headless server</h3>
From your expanded zeppelin directory:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">nano conf/zeppelin-site.xml</span><br />
<br />
And change zeppelin.server.addr to be either the IP address or the domain name of this server. This is to allow outside connections.<br />
<h3>
Proxy</h3>
Zeppelin seems to need npm from node.js, which in turn needs to know your proxy settings. To get around this, install node.js yourself (instead of relying on what is built in to Zeppelin) and execute npm config to set its proxy settings. Below includes the instructions for installing node.js onto RedHat-type Linux distributions (CentOS, Oracle Linux, etc.). See <a href="https://nodejs.org/en/download/package-manager/">nodejs.org</a> for other OS's.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">export http_proxy=<your http proxy></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">export https_proxy=<your https proxy></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">wget curl --silent --location https://rpm.nodesource.com/setup_6.x</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">chmod 777 setup_6.x</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">sudo ./setup_6.x</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">sudo yum install -y nodejs</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">npm config set proxy <your http proxy></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">npm config set https-proxy <your https proxy></span><br />
<div>
<br /></div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-87624313803407791082017-01-04T07:41:00.001-07:002017-01-04T08:47:19.865-07:00Spark Structured Streaming Supports Kafka Since November 2016As I noted in my May 14, 2016 blog post, <a href="http://technicaltidbit.blogspot.com/2016/05/structured-streaming-for-lambda.html" target="_blank">Spark Structured Streaming</a>, which brings the ability to stream a data source into a DataFrame and query it with SQL in real-time, was announced with much fanfare (along with Spark 2.0) at Spark Summit 2016, but notably absent at the time was its support for Kafka.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr0ohq5Xz-FSl_Kvlzbwu6MtilvyKVWhEdw2SPqbVvRQQ3vGgMRiJPLBTdTLlUN5hTJ_h2d4na3EJxNe9R5d1I5mIwUYWEP3EKwDBmYi6LIwCKq9mcMbqyywQ5v8xk-cYDmCi2HkeN-ok/s1600/structured-model-1-600.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr0ohq5Xz-FSl_Kvlzbwu6MtilvyKVWhEdw2SPqbVvRQQ3vGgMRiJPLBTdTLlUN5hTJ_h2d4na3EJxNe9R5d1I5mIwUYWEP3EKwDBmYi6LIwCKq9mcMbqyywQ5v8xk-cYDmCi2HkeN-ok/s1600/structured-model-1-600.png" /></a></div>
<div style="text-align: center;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Diagram from <a href="https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html">databricks.com</a></b></span></div>
<br />
Yes, Spark 2.1, released last week, now supports Kafka in Spark Structured Streaming. But so does Spark 2.0.2, quietly released on November 14, 2016.<br />
<br />
So we no longer "have to wait for it" as I blogged last May.Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-49881098873035799812016-10-26T09:36:00.000-06:002016-10-26T09:39:28.712-06:00Drizzle Brings Low-Latency Streaming to Spark; but RISE Lab is Just a Change in Funding<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1d0gzL48HSGXFmvq5NbtsXdg8lrMzGS96YNP69MXAKhh1C7SRZMDALsDq3mDK34DVHtSiQx8Iav1a_xVlBC-XBnK7mrWMICDw1c9FB2tbM_mPhSEyCLtiXCr7rs9ymgamQbj1qrZmy94/s1600/drizzle600.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1d0gzL48HSGXFmvq5NbtsXdg8lrMzGS96YNP69MXAKhh1C7SRZMDALsDq3mDK34DVHtSiQx8Iav1a_xVlBC-XBnK7mrWMICDw1c9FB2tbM_mPhSEyCLtiXCr7rs9ymgamQbj1qrZmy94/s1600/drizzle600.png" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
This morning at Spark Summit Europe 2016, Ion Stoica announced during his keynote the Drizzle project, which promises to reduce streaming data latency in Spark to be less than Flink and Storm. Ion announced this in the context of the new <a href="https://www2.eecs.berkeley.edu/Courses/Data/1162.html" target="_blank">RISE Lab</a> at UC Berkeley.</div>
<br />
Drizzle is an exciting and important new technology. RISE Lab is simply a change in funding at Berkeley. In fact, Drizzle was <a href="http://www.slideshare.net/JenAman/low-latency-execution-for-apache-spark/14" target="_blank">announced</a> at Spark Summit (West) this past summer in the context of <a href="http://www.slideshare.net/JenAman/low-latency-execution-for-apache-spark/1" target="_blank">amplab</a>, not RISE Lab.<br />
<br />
Stoica also repeated the common wisdom that Spark came out of amplab, but in fact Matei's <a href="http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf" target="_blank">first paper</a> on Spark and RDDs came out in 2010 under <a href="http://radlab.cs.berkeley.edu/" target="_blank">RAD Lab</a>, the funding model that preceded amplab.<br />
<br />
These changes, from RAD Lab to amplab to RISE lab are just changes in funding. The important things -- the people and the projects -- stay throughout. And Drizzle is an important project. By making the streaming tasks long-lived on Spark workers -- as opposed to launching all-new fresh Spark jobs for every micro-batch as in today's Spark Streaming -- latency and resiliency are vastly improved. They are reported to be better than Flink, but keep in mind that the comparison there is between a research project vs. something that is available today to put into production. Flink might improve further by the time Drizzle is released (I don't think the code is even available to download yet to try out).<br />
<br />
To watch Ion's keynote, go to about 1:15:00 at <a href="http://livestream.com/fourstream/sparksummiteu16-tracka/videos/140168779">http://livestream.com/fourstream/sparksummiteu16-tracka/videos/140168779</a><br />
<br />
For more meaty details on Drizzle, see the Spark Summit (West) 2016 presentation <a href="https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/" target="_blank">Low Latency Execution for Apache Spark</a>.Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com1tag:blogger.com,1999:blog-6104363047659965830.post-12693762182757074932016-08-27T08:22:00.000-06:002016-08-29T09:09:27.110-06:00Installation Quickstart: TensorFlow, Anaconda, Jupyter<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWC41DPtUgsZPzvy9HvCwfVL_XWyyNf0AZr4_5aZBujw7grb79-l_2ZeVPfRaf86JfTw_FzGg65L6AuGQPEGByxYc4IE7tmqgrV900wMgRUq0RrlGzxTNi2JcKElLZLoL4h9pzX7unqRs/s1600/TensorflowAnacondaJupyter.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWC41DPtUgsZPzvy9HvCwfVL_XWyyNf0AZr4_5aZBujw7grb79-l_2ZeVPfRaf86JfTw_FzGg65L6AuGQPEGByxYc4IE7tmqgrV900wMgRUq0RrlGzxTNi2JcKElLZLoL4h9pzX7unqRs/s1600/TensorflowAnacondaJupyter.png" /></a></div>
<br />
What better way to start getting into TensorFlow than with a notebook technology like Jupyter, the successor to IPython Notebook? There are two little hurdles to achieve this:<br />
<ol>
<li>Choice of OS. Trying to use Windows with TensorFlow is as painful as trying to use Windows with Spark. But even within Linux, it turns out you need a recent version. CentOS 7 works a lot better than CentOS 6 because it has a more recent glibc.</li>
<li>A step is missing from the TensorFlow installation page. From <a href="http://stackoverflow.com/a/36613481/1624911" target="_blank">StackOverflow</a>:<br /><span style="font-family: "courier new" , "courier" , monospace;">conda install notebook ipykernel</span></li>
</ol>
Here then are the complete set of steps to achieve Hello World in Tensorflow on Jupyter via Anaconda:<br />
<ol>
<li>Use <a href="http://isoredirect.centos.org/centos/7.2.1511/isos/x86_64/" target="_blank">CentOS 7.2 (aka 1511)</a>, for example using <a href="https://www.virtualbox.org/" target="_blank">VirtualBox</a> if under Windows. This step may be unnecessary if you use OSX, but I just haven't tried it.</li>
<li>Download and install <a href="https://www.continuum.io/downloads" target="_blank">Anaconda</a> for Python 3.5.</li>
<li>From the <a href="https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html" target="_blank">TensorFlow</a> installation instructions:<br /><span style="font-family: "courier new" , "courier" , monospace;">conda create -n tensorflow python=3.5<br />source activate tensorflow<br />conda install -c conda-forge tensorflow</span></li>
<li>From <a href="http://stackoverflow.com/a/36613481/1624911" target="_blank">StackOverflow</a>:<br /><span style="font-family: "courier new" , "courier" , monospace;">conda install notebook ipykernel</span></li>
<li>Launch Jupyter:<br /><span style="font-family: "courier new" , "courier" , monospace;">jupyter notebook</span></li>
<li>Create a notebook and type in <a href="https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/1_Introduction/helloworld.ipynb" target="_blank">Hello World</a>:<br /><span style="font-family: "courier new" , "courier" , monospace;">import tensorflow as tf<br />hello = tf.constant('Hello, TensorFlow!')<br />sess = tf.Session()<br />print(sess.run(hello))</span></li>
</ol>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-35459277646480800192016-07-30T11:48:00.003-06:002016-07-30T11:48:42.415-06:00900TB, 4U, $60k<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8sVlpf0xZp5ZFhBu0uH6iXPXXq4YrfjgBV2hofUdXwUVbNUtExWCkrz6UKEQLvoE4UHjpwuPc0pZokYUDvseONNctSPe6ls8nPAtEqV07tnCOhzgZg7cQYGNa3BMOPpq2NGzZ6fip4Ps/s1600/Density20160730.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8sVlpf0xZp5ZFhBu0uH6iXPXXq4YrfjgBV2hofUdXwUVbNUtExWCkrz6UKEQLvoE4UHjpwuPc0pZokYUDvseONNctSPe6ls8nPAtEqV07tnCOhzgZg7cQYGNa3BMOPpq2NGzZ6fip4Ps/s640/Density20160730.png" width="640" /></a></div>
<br /></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The impasse of <a href="http://technicaltidbit.blogspot.com/2014/05/peak-hard-drive.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Peak Hard Drive</a> that I identified two years ago -- that era when 4TB hard drives had been the tops for three years straight -- has been breached, thanks to advances in both hard drives (HDDs) and Solid State Disks (SSDs). 10TB 3.5" HDDs can be ordered from Amazon now for less than $600, and the density of SSDs is skyrocketing as <a href="http://technicaltidbit.blogspot.com/2014/06/ssd-to-rescue-of-peak-hard-drive.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">manufacturers promised</a> two years ago. At $10,000, the 2.5" 15TB SSD is indeed dense but at $670/TB, is twice as expensive per terabyte as enterprise 1TB SSDs.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Going forward, Toshiba is <a href="http://www.theregister.co.uk/2015/09/03/128tb_ssd_toshiba_promises_by_2018/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">predicting</a> 128TB SSDs by 2018 and <a href="http://www.geek.com/chips/toshiba-hard-drives-will-be-40tb-by-2020-ssds-will-be-128tb-by-2018-1632425/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">40TB HDDs by 2020</a>. As shown in the chart above, SSDs would then firmly breach the long-term exponential growth line in density, while HDDs would merely continue their anemic density growth.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4Yg85IB_6YYdX9dM-u3lmJ0kVlDIVoOLhbTOqA_w12Vhkm5R_t4gkSMnXNc1B2FtY4Br7K3BLsu9GfffxVaeIS98oO7h6LaCdn0Qzxjh0N_78m1SJBRdvcMqnjr6-uDc8MamtI5Cz1bI/s1600/HighCapacityDrives20160730.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4Yg85IB_6YYdX9dM-u3lmJ0kVlDIVoOLhbTOqA_w12Vhkm5R_t4gkSMnXNc1B2FtY4Br7K3BLsu9GfffxVaeIS98oO7h6LaCdn0Qzxjh0N_78m1SJBRdvcMqnjr6-uDc8MamtI5Cz1bI/s1600/HighCapacityDrives20160730.jpg" /></a></div>
<br /></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Rack chassis are getting more dense as well. SuperMicro now has a <a href="https://www.supermicro.com/products/chassis/4u/946/sc946ed-r2kjbod.cfm" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">4U chassis</a> that can hold 90 3.5" HDDs.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPqW76LKeoUXtTg-jdkjkQTcQSVUxnnQ1sANK_Kam54dsTrBIxGzPpdztWtWeV8Qty9JA5iRx3YF7cFmkpN5Y0uK-RCJ2iVB3xlCxyy0rvh9pDae0aZO4KaosjkuyBZUtnKMRPgmLPLRk/s1600/90bay.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPqW76LKeoUXtTg-jdkjkQTcQSVUxnnQ1sANK_Kam54dsTrBIxGzPpdztWtWeV8Qty9JA5iRx3YF7cFmkpN5Y0uK-RCJ2iVB3xlCxyy0rvh9pDae0aZO4KaosjkuyBZUtnKMRPgmLPLRk/s1600/90bay.jpg" /></a></div>
<br /></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Filled with the 10TB HDDs, that would be about $60k for 900TB storage in 4U of rack space. There is no similar solution I've found for 2.5" drives, so using the same chassis for the 15TB SSDs would be $900k for 1.4PB.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
But of course that ignores the compute side of the equation. Such ultra-density in a rack would only be suitable for deep-freeze storage. For a Hadoop/Spark application to leverage <a href="http://technicaltidbit.blogspot.com/2014/11/data-locality-hpc-vs-hadoop-vs-spark.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">data locality</a>, a 1U chassis that accommodates <a href="https://www.supermicro.com/products/chassis/1U/113/SC113TQ-600C.cfm" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">8 2.5" drives and dual processors</a> from SuperMicro would be more appropriate. Populated with the 15TB SSDs, that would be 120TB/1U, or 960TB/8U. If each CPU socket were populated with <a href="http://ark.intel.com/products/91317/Intel-Xeon-Processor-E5-2699-v4-55M-Cache-2_20-GHz" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">22-core Xeons</a>, that would be a total of 352 cores in that 8U. Total cost would be about $750k for that 8U.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-58573024657970829342016-07-18T09:52:00.001-06:002016-07-18T09:52:44.535-06:00Spark Sometimes Forgets to Put Distantly Scoped Variables into Your ClosureThis has always been a gotcha with Spark and still is today, yet I don't see a caution of it mentioned much of anywhere.<br />
<br />
If your <span style="font-family: "courier new" , "courier" , monospace;">.map()</span> needs access to a variable (or value), and that variable is not defined in the same immediate local scope, "sometimes" Spark will not include it in the closure, leading to erroneous results.<br />
<br />
I've never been able to define "sometimes", and I've never been able to come up with a tiny example that demonstrates it. Nevertheless, below is a tiny bit of source code (which does work; that is it does <i>not</i> demonstrate the problem) just to make clear what I'm talking about.<br />
<br />
<pre>import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object closure {
val a = Array(1,2)
def main(args: Array[String]) {</pre>
<pre> // Sometimes the line of code below is necessary (and change the</pre>
<pre> // reference to a in the map() to a2 as well)</pre>
<pre> // val a2 = a
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("closure"))
println(sc.makeRDD(Array(3,4)).map(_ + a.sum).collect.mkString(";"))
sc.stop
}
}
</pre>
<div>
<br />
I've seen it where <span style="font-family: Courier New, Courier, monospace;">a</span> "sometimes" gets transmitted to the cluster as a zero-length array.<br />
<br />
As background, functional languages like Scala compute closures, which means that when you pass a function as a parameter, it's not just the function that gets passed but all the variables and values that it requires along with it. Scala does compute closures, but not serialize closures for distributed computing. Spark has to <a href="https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L52" target="_blank">compute</a> and serialize its own closures, and sometimes it makes mistakes. Sometimes, it's necessary to give it some help by moving the data you need into the same local scope so that it can pick it up.<br />
<br /></div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-18622690392528488152016-06-17T08:46:00.002-06:002016-06-17T09:38:46.168-06:00My Spark Summit 2016 Presentation: Finding Graph Isomorphisms In GraphX And GraphFrames<h2>
Video</h2>
<iframe allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/B6_dSfPKDXk" width="560"></iframe>
<br />
<h3>
</h3>
<h2>
<br /></h2>
<h2>
Slides</h2>
<iframe allowfullscreen="" frameborder="0" height="485" marginheight="0" marginwidth="0" scrolling="no" src="//www.slideshare.net/slideshow/embed_code/key/FQn86CpTTB3qh1" style="border-width: 1px; border: 1px solid #ccc; margin-bottom: 5px; max-width: 100%;" width="595"> </iframe> <br />
<div style="margin-bottom: 5px;">
<strong> <a href="https://www.slideshare.net/SparkSummit/finding-graph-isomorphisms-in-graphx-and-graphframes" target="_blank" title="Finding Graph Isomorphisms In GraphX And GraphFrames">Finding Graph Isomorphisms In GraphX And GraphFrames</a> </strong> from <strong><a href="https://www.slideshare.net/SparkSummit" target="_blank">Spark Summit</a></strong> </div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-29566295938841620572016-06-11T15:10:00.004-06:002017-06-07T16:28:35.429-06:00Spark Summit 2016 Review<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihiG89AdHW9o2VYtsE2AJZ9en_TlNSBRx7yV3zUElZ-otiSeB2bvt4ssKBFOBucSKBhygLdmXwLZ9wj_i09qOzuEf5GaHcvdJvtBP6PqdJcoI3ML2MSyH525c2l7Zu47XABGACLMaH9UU/s1600/SparkSummit2016Keynote.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihiG89AdHW9o2VYtsE2AJZ9en_TlNSBRx7yV3zUElZ-otiSeB2bvt4ssKBFOBucSKBhygLdmXwLZ9wj_i09qOzuEf5GaHcvdJvtBP6PqdJcoI3ML2MSyH525c2l7Zu47XABGACLMaH9UU/s1600/SparkSummit2016Keynote.png" /></a></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<br /></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Summit (West) 2016 took place this past week in San Francisco, with the big news of course being Spark 2.0 which among other things ushers in yet another <a href="https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">10x performance improvement through whole-stage code generation</a>.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
This is on top of the 2x performance improvement going from <a href="https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">RDDs to 1.4 Dataframes</a>, the <a href="https://databricks.com/blog/2015/08/18/apache-spark-1-5-preview-now-available-in-databricks.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">3.5x improvement going from 1.4 Dataframes to 1.5 Dataframes</a>, and the miscellaneous improvements in Spark 1.6 including<a href="http://www.datanami.com/2015/11/30/3-major-things-you-should-know-about-apache-spark-1-6/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">automatic cache vs. execution memory balancing</a>. Overall, this is perhaps a 100x improvement from Spark 0.9 RDDs (July 2014) to Spark 2.0 Dataframes (July 2016).</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
And that 100x is on top of the improvement over Hadoop's disk-based MapReduce, which itself was <a href="http://spark.apache.org/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">another 100x speedup</a>.So combined that's 10,000x speedup from disk-based Hadoop MapReduce to memory-based Spark 2.0 Dataframes.</div>
<h3 style="background-color: white; border: 0px; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 16px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Overhyped?</h3>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
During a panel at Spark Summit, a question was put to panelist <a href="https://spark-summit.org/east-2016/speakers/thomas-dinsmore/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Thomas Dinsmore</a> as to <a href="https://www.youtube.com/watch?v=YqbMF86SrSc&t=12m11s" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">whether Spark has been overhyped</a>. The day before, I had actually met up with Thomas in the Speaker's lounge and he wanted to get the input from others as to the answer to this question. My response was: Spark might have been overhyped during the 1.x days, but with Spark 2.0 it's caught up to the hype generated during the 1.0 days.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The mantra with Spark has always been: it's in-memory so it's fast -- with an unstated implication that it'snot possible to go any faster than that. Well, as we've seen, it was possible to go faster -- 100x as fast. Spark 2.0 delivers on the Spark 1.0 buzz. Now, Spark 1.x was certianly useful, and it's not like there was anything faster at the time for clusters of commodity hardware, but Spark 1.x carried with it a buzz that didn't get fully realized until Spark 2.0.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
In another sign of maturity, Spark 2.0 was not rushed out the door prior to the Summit. This is in contrast to Spark 1.0, which two years ago I <a href="http://datascienceassn.org/content/apache-spark-10-almost-here-it-ready-16-unresolved-blockers-jira-updated-x2" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">criticized as not being stable enough for a 1.0 moniker</a>. (I think Spark 1.2 was really the 1.0 of Spark.)</div>
<h3 style="background-color: white; border: 0px; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 16px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Graphs</h3>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The next-most obvious thing at the Summit was the proliferation of graphs, including a <a href="https://spark-summit.org/east-2016/events/keynotes-day-3/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">keynote by Capital One</a>(although they used an external graph database rather than GraphX or GraphFrames) and of course <a href="https://spark-summit.org/2016/events/finding-graph-isomorphisms-in-graphx-and-graphframes/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">my own</a>. But besides these, Ankur Dave gave two talks, one on <a href="https://spark-summit.org/2016/events/time-evolving-graph-processing-on-commodity-clusters/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Tegra</a> and one on <a href="https://spark-summit.org/2016/events/graphframes-graph-queries-in-spark-sql/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">GraphFrames</a>. Tegra is interesting because it introduces to Spark for the first time graphs that can grow. It's research work that was done built upon GraphX, but hopefully that technology will get added to GraphFrames.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Besides these four talks, there was yet another presentation, this one from Salesforce, on <a href="https://spark-summit.org/2016/events/a-graph-based-method-for-cross-entity-threat-detection/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">threat detection</a>.</div>
<h3 style="background-color: white; border: 0px; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 16px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Crowds</h3>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The third-most obvious thing to note about Spark Summit 2016 were the crowds. There were 2500 attendees this year compared to last year's 1500. I've attended some crowded events in my lifetime -- including the <a href="http://www.youtube.com/watch?v=uDBEjtJmMvM&t=9m2s" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">1985 Beach Boys concert</a> on the National Mall (lawn) attended by 800,000 and the <a href="http://files.ctctcdn.com/766c6672201/6eb7d5d4-b870-4090-bed3-5faa6e8e75a0.jpg" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">1993 papal Mass</a> in Denver's Cherry Creek State Park attended by 500,000 -- but neither was as crowded as Spark Summit 2016. I'm just glad I was able to hide out in the Speaker lounge during mealtimes. And I feel sorry for the vendors, as the vendor expo this time had to be shunted off to a separate tower of the hotel conference center.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
It seems they're going to need to Moscone this next time.</div>
<h3 style="background-color: white; border: 0px; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 16px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Research</h3>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
There were five parallel tracks this time compared to three tracks a year ago. I spent most of my time in the "research" track, which is one of the two new tracks. And there were a lot of interesting talks there:</div>
<ul style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin: 0px 0px 1.5em 2em; outline: 0px; padding: 0px; vertical-align: baseline;">
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="https://spark-summit.org/2016/events/yggdrasil-faster-decision-trees-using-column-partitioning-in-spark/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Yggdrasil: Faster Decision Trees Using Column Partitioning in Spark</a></li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Low-latency Execution for Apache Spark</a> - a modification to Spark whereby the number of communication round trips to the driver is minimized.</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="https://spark-summit.org/2016/events/re-architecting-spark-for-performance-understandability/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Re-architecting Spark for Performance Understandability</a> - a rewrite of Spark such that each task consumes only one type of resource (CPU, disk, or network) so that performance and bottlenecks can be visualized as well as be reasoned about by a scheduler. Although Kay Ousterhout's previous scheduling work, <a href="https://github.com/radlab/sparrow" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Sparrow</a>, never made it into Spark, my hunch is that this eventually will. My hunch is that at the time three years ago, there were so many other ways to improve performance (such as Tungsten which as noted above has given a 100x performance improvement through Spark 2.0) that the cost/benefit at the time for incorporating Sparrow wouldn't have been worth it. But now that other avenues have been maxed out, the time is ripe for scheduler improvements.</li>
</ul>
<h3 style="background-color: white; border: 0px; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 16px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Corporate Support</h3>
<div style="background-color: white; border: 0px; color: #555555; font-family: "Droid Serif", "Helvetica Nue", Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The final message was just all the big name vendors jumping on board: Microsoft, IBM, Vertica. I remember back when even Cloudera refused to support Spark (in early 2013).</div>
<br />Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-31505180293480029222016-05-14T13:05:00.001-06:002016-05-14T13:05:38.422-06:00Structured Streaming for Lambda Architecture in Spark But Have To Wait For It<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQwtkVb3ZWyBXaoRBjQgVFWWH9S9IpGVNKhtzbdOLomZHx1rcoWHkZekixIOJPS7ApC-k2y3TfwdyBrF3qJTiifotWgdnPI3EyRLCUqMJopoChWuW8vFXRdWLGiVrf6CxMEp-_Der-VgM/s1600/LambdaStructuredStreaming.jpg" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQwtkVb3ZWyBXaoRBjQgVFWWH9S9IpGVNKhtzbdOLomZHx1rcoWHkZekixIOJPS7ApC-k2y3TfwdyBrF3qJTiifotWgdnPI3EyRLCUqMJopoChWuW8vFXRdWLGiVrf6CxMEp-_Der-VgM/s1600/LambdaStructuredStreaming.jpg" /></a><br />
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<span style="border: 0px; font-family: inherit; font-style: inherit; font-weight: 700; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="border: 0px; font-family: inherit; font-size: 9.75px; font-style: inherit; font-weight: inherit; height: 0px; line-height: 0; margin: 0px; outline: 0px; padding: 0px; position: relative; top: 0.5ex; vertical-align: baseline;"><br /></span></span></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<span style="border: 0px; font-family: inherit; font-style: inherit; font-weight: 700; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="border: 0px; font-family: inherit; font-size: 9.75px; font-style: inherit; font-weight: inherit; height: 0px; line-height: 0; margin: 0px; outline: 0px; padding: 0px; position: relative; top: 0.5ex; vertical-align: baseline;">Image credits: <a href="http://lambda-architecture.net/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">lambda-architecture.net</a> and <a href="http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust/28" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Michael Armbrust's Spark Summit East presentation</a>.</span></span></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Some have the misconception that Lambda Architecture just means you have separate paths for batch and realtime. They miss a key part of Lambda Architecture: the ability to query a unified view of both batch and realtime.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Structured Streaming, also known as Structured Dataframes, will provide a critical piece: the ability to stream directly into a Dataframe, which can then of course be queried with SQL.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
To provide the unified view, it will probably be possible to join such a Streaming Dataframe containing the realtime data with an ORC-backed Dataframe containing the historical data. However, as of today (May 14, 2016), the only two data sources available to populate a Streaming Dataframe are <a href="https://github.com/apache/spark/blob/a234cc61465bbefafd9e69c1cabe9aaaf968a91f/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">memory</a> and <a href="https://github.com/apache/spark/blob/a234cc61465bbefafd9e69c1cabe9aaaf968a91f/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">file</a>. Notably absent are streaming sources such as Apache Kafka, and last week Michael Armbrust indicated support for non-file data sources <a href="https://mail-archives.apache.org/mod_mbox/spark-user/201605.mbox/%3CCAAswR-6MrYRUZTz2rx3RSnc--_kMuM6X2e0Z7pNLLF0snQMYSw%40mail.gmail.com%3E" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">might come after Spark 2.0</a>. And then this week Reynold Xin <a href="https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">advised</a>:</div>
<blockquote class="tr_bq" style="background-color: white; border: 0px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<span style="color: #555555; font-family: Droid Serif, Helvetica Nue, Arial, Helvetica, sans-serif;"><span style="font-size: 13px; line-height: 22.1px;">stay tuned to this blog for more details on Structured Streaming in Spark 2.0, including details on what is possible in this release and what is on the roadmap for the near future</span></span></blockquote>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
There are still key adds in Spark 2.0: full SQL support including subqueries, and yet <a href="http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">another 10x performance</a>improvement due to "Tungsten 2.0" (on top of the 2x-10x improvement Tungsten brought over Spark 1.4, 1.5, and 1.6). Currently, Druid is still the reigning champ when it comes to <a href="http://www.datascienceassn.org/content/druid-lambda-box" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Lambda in a Box</a>. But Spark will likely take that crown before the end of this year.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com9tag:blogger.com,1999:blog-6104363047659965830.post-70277752207585651392016-04-07T08:57:00.000-06:002016-04-07T08:57:13.620-06:00Declarative Machine Learning<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5JBISsvm15Fm127uhmo9-I5d9i8gd7iwA7sMF5yIcpACgNg_Rbikn7FXageWI0PVwXfDL-oQcPOcpSrUy1AiBzgPNhneQtLbtisNkfcq2gBTalbOrVKKpe8r1Td7zv-hj76yKiirEGH8/s1600/tupaq600.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5JBISsvm15Fm127uhmo9-I5d9i8gd7iwA7sMF5yIcpACgNg_Rbikn7FXageWI0PVwXfDL-oQcPOcpSrUy1AiBzgPNhneQtLbtisNkfcq2gBTalbOrVKKpe8r1Td7zv-hj76yKiirEGH8/s1600/tupaq600.png" /></a></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<small style="border: 0px; font-family: inherit; font-size: 9.75px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><b>Image is from Evan Sparks et al <a href="http://arxiv.org/pdf/1502.00068v2.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries</a></b></small></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
SQL is commonly referred to as a 4GL, or fourth-generation programming language, as opposed to all of the 3GL's like Java, C++, Python, Scala, etc. SQL is referred to as a <i>declarative</i> language as opposed to an <i>imperative</i>language like the 3GL's. You tell SQL <i>what</i> to do, not <i>how</i> to do it.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Well, <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/07/163-sparks.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">TuPAQ</a> is the SQL for machine learning. You give it a high-level goal, and it figures out which machine learning algorithm to use, and tunes the hyperparameters for you. Example code for speech-to-text translation from Evan Sparks et al:</div>
<pre style="background: rgb(238, 238, 238); border: 1px solid rgb(221, 221, 221); color: #555555; font-family: inherit; font-size: 13px; line-height: 22.1px; margin-bottom: 20px; margin-top: 20px; outline: 0px; padding: 10px; vertical-align: baseline; white-space: pre-wrap; word-wrap: break-word;"><b>SELECT</b> vm.sender, vm.arrived,
<b>PREDICT</b>(vm.text, vm.audio)
<b>GIVEN</b> LabeledVoiceMails
<b>FROM</b> VoiceMails vm
<b>WHERE</b> vm.user = 'Bob' <b>AND</b> vm.listened is <b>NULL</b>
<b>ORDER BY</b> vm.arrived
<b>DESC LIMIT</b> 50
</pre>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
When will you be able to use this in production? Hopefully, it's not too far away -- maybe a year, as a wild guess. At Spark Summit in June, 2015, Evan Sparks indicated <a href="http://keystone-ml.org/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">KeystoneML</a> would <a href="http://www.slideshare.net/SparkSummit/evan-sparks/20" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">"soon" integrate with TuPAQ</a>as both KeystoneML and TuPAQ are AMPLab projects.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Although I gave KeystoneML a <a href="http://www.datascienceassn.org/content/spark-computer-vision-wish-list" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">tepid review</a> when it first came out, the new <a href="https://amplab.cs.berkeley.edu/whats-new-in-keystoneml/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">0.3 version announced last week</a>shows the impressive direction they're headed in. Although not quite as declarative as TuPAQ, it is <a href="http://keystone-ml.org/#what-is-keystoneml-for" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">still declarative</a>. An example of declaring a machine learning pipeline in KeystoneML:</div>
<pre style="background: rgb(238, 238, 238); border: 1px solid rgb(221, 221, 221); color: #555555; font-family: inherit; font-size: 13px; line-height: 22.1px; margin-bottom: 20px; margin-top: 20px; outline: 0px; padding: 10px; vertical-align: baseline; white-space: pre-wrap; word-wrap: break-word;">val trainData = <b>NewsGroupsDataLoader</b>(sc, trainingDir)
val predictor = <b>Trim</b> andThen
<b>LowerCase</b>() andThen
<b>Tokenizer</b>() andThen
<b>NGramsFeaturizer</b>(1 to conf.nGrams) andThen
<b>TermFrequency</b>(x => 1) andThen
(<<b>CommonSparseFeatures</b>(conf.commonFeatures), trainData.data) andThen
(<b>NaiveBayesEstimator</b>(numClasses), trainData.data, trainData.labels) andThen
<b>MaxClassifier</b>
</pre>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Sure, the <a href="http://spark.apache.org/docs/latest/ml-guide.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">spark.ml package from Spark MLlib is also pipeline-centric</a>, but whereas spark.ml simply relies on DataFrames/Catalyst/Tungsten to optimize each stage of the pipeline, KeystoneML analyzes and <a href="http://keystone-ml.org/release.html#whole-pipeline-optimization" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">optimizes the pipeline as a whole.</a> It "inspects the pipeline DAG and automatically decides where to cache intermediate output using a greedy algorithm."</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Are there other declarative machine learning systems out there? <a href="https://systemml.apache.org/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Apache SystemML</a> claims to be declarative, but it is only in that automatically plans deployment to a cluster based on data locality, available memory, etc.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
SystemML claims that the high-level languages it provides, DML and PyDML, are <a href="http://apache.github.io/incubator-systemml/#running-systemml" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">"declarative"</a>, but they are not. They are still imperative languages. Their purpose is to allow non-Spark developers to write machine learning programs in languages they are comfortable in (like Python), yet be able to compile down to Spark Scala when the time comes to deploy to production. Thus, these are high-level languages as SystemML claims, but they are still imperative and not declarative. The ability of SystemML to plan optimal deployment to a cluster, however, is declarative.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-89109645383885687682016-03-29T21:21:00.002-06:002016-03-29T21:30:26.212-06:00Table of XX2Vec Algorithms<table style="background: rgb(255, 255, 255); border-bottom-color: rgb(221, 221, 221); border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-width: 1px; border-spacing: 0px; border-style: solid none solid solid; border-top-color: rgb(221, 221, 221); border-top-width: 1px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin: 0px 0px 1.5em; outline: 0px; padding: 0px; vertical-align: baseline; width: 685px;"><tbody style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><th style="background: rgb(85, 85, 85); border: 0px; color: white; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 10px; vertical-align: baseline;">XX2Vec</th><th style="background: rgb(85, 85, 85); border: 0px; color: white; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 10px; vertical-align: baseline;">Embed</th><th style="background: rgb(85, 85, 85); border: 0px; color: white; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 10px; vertical-align: baseline;">In</th><th style="background: rgb(85, 85, 85); border: 0px; color: white; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 10px; vertical-align: baseline;">Sup/Unsup</th><th style="background: rgb(85, 85, 85); border: 0px; color: white; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 10px; vertical-align: baseline;">Algorithms used</th></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="http://arxiv.org/abs/1508.06615" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Char2Vec</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Character</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Sentence</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Unsupervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">CNN</a> -> <a href="https://en.wikipedia.org/wiki/Long_short-term_memory" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">LSTM</a></td></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Word2Vec</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Word</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Sentence</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Unsupervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Artificial_neural_network" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">ANN</a></td></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="http://www-nlp.stanford.edu/pubs/glove.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">GloVe</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Word</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Sentence</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Unsupervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">SGD</a></td></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://cs.stanford.edu/~quocle/paragraph_vector.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Doc2Vec</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Paragraph Vector</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Document</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Supervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Artificial_neural_network" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">ANN</a> -> <a href="https://en.wikipedia.org/wiki/Logistic_regression" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Logistic Regression</a></td></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="http://arxiv.org/abs/1507.08818" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Image2Vec</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Image Elements</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Image</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Unsupervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Deep_learning" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">DNN</a></td></tr>
<tr style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 5px; vertical-align: baseline;"><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://www.dropbox.com/s/m99k5md8461xi0s/ICIP_Paper_Revised.pdf" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Video2Vec</a></td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Video Elements</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Video</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;">Supervised</td><td style="border-right-color: rgb(221, 221, 221); border-right-style: solid; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 1px 1px 0px 0px; font-family: inherit; font-style: inherit; margin: 0px; outline: 0px; padding: 5px 10px; vertical-align: baseline;"><a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">CNN</a> -> <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">MLP</a></td></tr>
</tbody></table>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
The powerful word2vec algorithm has inspired a host of other algorithms listed in the table above. (For a description of word2vec, see my <a href="http://www.youtube.com/watch?v=A2XNlnXGeSY&t=12m41s" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Spark Summit 2015 presentation</a>.) word2vec is a convenient way to assign vectors to words, and of course vectors are the currency of machine learning. Once you've vectorized your data, you are then free to apply any number of machine learning algorithms.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
word2vec is able to come up with vectors by leveraging the concept of <em style="border: 0px; font-family: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">embedding</em>. In a corpus, a word appears in the context of surrounding words, and word2vec uses those co-occurrences to infer relationships between those words.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
All of the XX2Vec algorithms listed in the table above assign vectors to X's, where those X's are embedded in some larger context Y.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
But the similarities end there. Each XX2Vec algorithm not only goes about it through means suited for its domain, but their use cases aren't even analagous. Doc2Vec, for example, is supervised learning whereas most of the others are unsupervised learning. The goal of Doc2Vec is to be able to apply labels to documents, whereas the goal of word2Vec and most of the other XX2Vec algorithms is simply to spit out vectors that you can then go and do other machine learning and analyses on (such as analogy detection).</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Here is a brief description of each XX2Vec:</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Char2Vec</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Like word2vec but because it operates at the character level, it is much more tolerant of misspellings and thus better for analysis of tweets, user product reviews, etc.</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Word2Vec</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Described above. But one more note: it's one of those <a href="http://www.datascienceassn.org/content/do-we-deserve-unreasonable-effectiveness" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">unreasonably effective</a> algorithms -- a kind of getting lucky, if you will.</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
GloVe</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Instead of just getting lucky, there have been a number of efforts to ground the idea of word embeddings in something more mathematical than just pulling weights out of a neural network and hoping they work. GloVe is the current standard-bearer in this regard. Its model is designed from the ground up to support finding analogies, instead of just getting them by chance in word2vec.</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Doc2Vec</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Actually, Doc2Vec uses Word2Vec as a first pass. It then comes up with a composite vector for each sentence or paragraph from the contributing Word2Vec word vectors. This composite gives some kind of overall context to the sentence or paragraph, and then this composite vector is plopped down into the beginning of the sentence or pargraph as an "extra word". The paragraph vectors togeher with the word vectors are used to train a supervised-learning classifier using human labels of the documents.</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Image2Vec</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Whereas word2vec intentionally uses a shallow neural network, Image2Vec uses a deep neural network and composes the resultant vectors from the weights from multiple layers of the network. Image elements that might be represented by these weights include image fragments (grass, bird, fence, etc.) or overall image qualities like color.</div>
<h2 style="background-color: white; border: 0px; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 18px; font-weight: normal; line-height: 1.5em; margin: 30px 0px 20px; outline: 0px; padding: 0px; vertical-align: baseline; word-spacing: 2px;">
Video2Vec</h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
If machine learning on images involves high dimensions, videos involve even higher dimensions. Video2Vec does some initial dimension reduction by doing a first pass with convolutional neural networks.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com1tag:blogger.com,1999:blog-6104363047659965830.post-81800264477400716002016-03-21T20:00:00.000-06:002016-03-21T20:08:36.941-06:00DataFrame/DataSet swap places in Spark 2.0<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6vrLqjjkd2H_iE_QB4U_slJK6FrlN-yi6J2VuSY_LYP51NlrSToyHOhIeGITBzo2ybU70DMLutFcTw3O5fzDSeqkXi7goYOEWrehfr5HgWxtA2iXP0F3DxPnLtX91Wsc8OtHk4teTGr8/s1600/DataFrameDataSetUML.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6vrLqjjkd2H_iE_QB4U_slJK6FrlN-yi6J2VuSY_LYP51NlrSToyHOhIeGITBzo2ybU70DMLutFcTw3O5fzDSeqkXi7goYOEWrehfr5HgWxtA2iXP0F3DxPnLtX91Wsc8OtHk4teTGr8/s640/DataFrameDataSetUML.png" /></a><br />
<br />
In Spark 1.6, the developers behind Spark created DataSets by copying and pasting the code from DataFrames (and then added genericization and type safety). But in Spark 2.0, the tables are turned. Last week, Reynold Xin resolved <a href="https://issues.apache.org/jira/browse/SPARK-13880" target="_blank">SPARK-13880 "Rename DataFrame.scala as DataSet.scala</a>. So what happens to DataFrames in Spark 2.0? Reduced to a <a href="https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L45" target="_blank">single line of code</a>:
<br />
<br />
<span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;"> </span><span class="pl-k" style="background-color: white; box-sizing: border-box; color: #a71d5d; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">type</span><span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;"> </span><span class="pl-en" style="background-color: white; box-sizing: border-box; color: #795da3; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">DataFrame</span><span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;"> </span><span class="pl-k" style="background-color: white; box-sizing: border-box; color: #a71d5d; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">=</span><span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;"> </span><span class="pl-en" style="background-color: white; box-sizing: border-box; color: #795da3; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">Dataset</span><span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">[</span><span class="pl-en" style="background-color: white; box-sizing: border-box; color: #795da3; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">Row</span><span style="background-color: white; color: #333333; font-family: "consolas" , "liberation mono" , "menlo" , "courier" , monospace; font-size: 12px; line-height: 16.8px; white-space: pre;">]</span><br />
<br />
So whereas it could be said in Spark 1.6 that DataSets are a derivation of DataFrames, it is specifically the case in Spark 2.0 that DataFrames are a derivation of DataSets.<br />
<br />Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-5880022686064199332016-03-11T09:30:00.001-07:002016-03-11T09:47:19.092-07:00Symmetric Difference in GraphXA <a href="https://forums.manning.com/posts/list/37842.page" target="_blank">question</a> was posed over at the online forums for my <a href="http://www.manning.com/malak?a_aid=sparkgraphx&bid=28876901" target="_blank">book</a> about how to implement <a href="https://en.wikipedia.org/wiki/Symmetric_difference" target="_blank">symmetry difference</a> in GraphX. The answer is the code below.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">import org.apache.spark.graphx._</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> </span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">val g = Graph(sc.makeRDD(Array((1L,""),(2L,""),(3L,""))),</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> sc.makeRDD(Array(Edge(1L,2L,0),Edge(1L,3L,0))))</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> </span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">val ids = g.vertices.map(_._1).cache</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> </span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">Graph(g.vertices, ids.cartesian(ids).filter(x => x._1 < x._2)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .map(x => Edge(x._1,x._2,0)))</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .outerJoinVertices(g.collectNeighborIds(EdgeDirection.Either))(</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> (_,_,u) => u.get.toSet)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .mapTriplets(et => ((et.srcAttr | et.dstAttr) &~</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> (et.srcAttr & et.dstAttr)).size)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .triplets</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .map(et => (et.srcId, et.dstId, et.attr))</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> .collect</span><br />
<div>
<br /></div>
<div>
This short piece of code pulls a number of tricks. First is the overall strategy. The goal is to identify the symmetric difference size for every possible pair of vertices in the graph. This suggests that we need to do a Cartesian product to obtain all possible pairs of vertices. But rather than just getting the Cartesian product and doing an RDD map() directly off that, we instead create a whole new Graph where the edges are that Cartesian product. The reason is so that we can leverage <span style="font-family: "courier new" , "courier" , monospace;">outerJoinVertices()</span> and glom on the set of nearest neighbors using <span style="font-family: "courier new" , "courier" , monospace;">collectNeighborIds()</span>(which returns a VertexRDD, suitable for <span style="font-family: "courier new" , "courier" , monospace;">outerJoinVertices()</span>).</div>
<div>
<br /></div>
<div>
And <span style="font-family: "courier new" , "courier" , monospace;">collectNeighborIds()</span> itself is a powerful function that didn't get covered in my book. It's a convenient way to, for each vertex, gather the vertex Ids of all the neighbor vertices.</div>
<div>
<br /></div>
<div>
Finally, to compute the symmetry difference we use Scala Set operations, as the symmetry difference is defined as:</div>
<div>
<br /></div>
<div>
A Δ B = (A ∪ B) - (A ∩ B)</div>
<div>
<span style="background-color: white; color: #003366; font-family: "times" , "times new roman" , serif;"><br /></span></div>
<div>
<span style="background-color: white; color: #003366; font-family: "times" , "times new roman" , serif;">Note in Scala the set difference operator is </span><span style="background-color: white; color: #003366;"><span style="font-family: "courier new" , "courier" , monospace;">&~</span></span><span style="background-color: white; color: #003366; font-family: "times" , "times new roman" , serif;"> rather than </span><span style="background-color: white; color: #003366;"><span style="font-family: "courier new" , "courier" , monospace;">-</span></span><span style="background-color: white; color: #003366; font-family: "times" , "times new roman" , serif;">.</span></div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-10387916802152681002016-03-05T12:42:00.002-07:002016-03-05T12:42:20.105-07:00Beyond GraphX in graphs for Spark<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
This week Databricks announced <a href="https://databricks.com/blog/2016/03/03/introducing-graphframes.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">GraphFrames</a>, a library posted to <a href="http://spark-packages.org/package/graphframes/graphframes" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">spark-packages.org</a> that is based on Spark SQL Dataframes rather than RDDs (as GraphX is). GraphFrames is still a work in progress -- it is currently at the 0.1 version -- so it provides interoperability with GraphX (graphs can be converted back and forth).</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
GraphFrames provides the graph <em style="border: 0px; font-family: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">querying</em> capability that GraphX always had trouble with. GraphFrames, because it uses DataFrames from Spark SQL, allows you to query graphs using SQL. Plus GraphFrames sports a subset of Cypher, the query language from Neo4j.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
I describe GraphFrames and provide some interesting examples in chapter 10 of my book. Chapter 10 was just released to the MEAP (Manning Early Access Program) for my book this week.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<a href="http://www.manning.com/malak?a_aid=sparkgraphx&bid=28876901" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank"><img src="http://datascienceassn.org/sites/default/files/users/user1/Cover20150613_150.png" style="border: 0px; max-width: 100%;" /></a></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
GraphFrames is also performant due to the two optimization layers built-in to Spark SQL: Catalyst and Tungsten. Catalyst is an RDBMS-style query plan optimizer, and Tungsten leverages the sun.misc.unsafe API to do direct OS memory access, bypassing the JVM (as well as avoiding garbage collection). Tungsten also performs code generation, generating JVM bytecode on the fly to access Tungsten-laid-out memory structures in a maximally efficient manner. One of the examples in my book shows an 8x speedup compared to the GraphX version.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
And, in a hat tip to <a href="https://twitter.com/noootsab/status/705882491043840000" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Andy Petrella</a>, author of <a href="https://github.com/andypetrella/spark-notebook" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Spark Notebook</a>, GraphFrames is not the only new library published on spark-projects.org. There are also:</div>
<ul style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin: 0px 0px 1.5em 2em; outline: 0px; padding: 0px; vertical-align: baseline;">
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="http://spark-packages.org/package/webgeist/spark-centrality" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Spark Centrality</a> - Library for computing centrality for graph nodes</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="http://spark-packages.org/package/dmarcous/spark-beetweenness" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">spark-beetweenness</a> - k Betweenness Centrality algorithm for Spark using GraphX</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><a href="http://spark-packages.org/package/sparkling-graph/sparkling-graph" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">sparkling-graph</a> - Large scale, distributed graph processing made easy! Load your graph from multiple formats and compute measures (but not only)</li>
</ul>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-4362466687616531162015-12-17T13:11:00.004-07:002015-12-17T13:11:57.537-07:00GPU off Apache Spark roadmap: Deeplearning4j best bet for Spark GPU<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxGg95Mv2RTzcSWn0TNfyM6gNwWgFMsNp6mtP3NE0XLDmoO2lGBweARANh9jLAIBDNdTEJ86PJf-1WuwUMmkH9lGgE4GiuuDQ7rnWttlq4jebd-kMwDu9Z6hyphenhyphenVNHUtw5fyX-agG6UEPZE/s1600/dl4jspark.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxGg95Mv2RTzcSWn0TNfyM6gNwWgFMsNp6mtP3NE0XLDmoO2lGBweARANh9jLAIBDNdTEJ86PJf-1WuwUMmkH9lGgE4GiuuDQ7rnWttlq4jebd-kMwDu9Z6hyphenhyphenVNHUtw5fyX-agG6UEPZE/s1600/dl4jspark.png" /></a></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<br /></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Last night, Reynold Xin took <a href="https://issues.apache.org/jira/browse/SPARK-3785?focusedCommentId=15061621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15061621" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">SPARK-3785</a> "Support off-loading computations to a GPU" off the Apache Spark road map, marking it "Closed" with a resolution of "Later". This is a little different than when GPU was mentioned at Spark Summit in June, 2015 as a <a href="http://www.slideshare.net/databricks/2015-0616-spark-summit/25" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">possibility</a> for <a href="https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;">Project Tungsten</a> for <a href="http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen/25" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">1.6 and beyond</a>.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
So for now, the best bet for using GPUs on Spark is <a href="http://deeplearning4j.org/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Deeplearning4j</a>, from which their architecture diagram above came. As I've <a href="http://datascienceassn.org/content/deeplearning4j-adds-spark-gpu-support" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">blogged</a> previously, the DL4J folks are waiting until they have solid benchmarks before advertising them. Nevertheless, today, you can do deep learning on GPU-powered Spark.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-62144486691509646862015-12-01T10:01:00.001-07:002015-12-01T10:01:23.636-07:00Free book excerpt: Semi-Supervised Learning With GraphX<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZaW-5LaooUAzXT-ANJ39Jltf_Gd_nvPjIg1Zc8VHaSklzl6oFo3jjnj6NgVgED-lMlYIfyVyy23W2b659Bz98FLzdpzOc-oxbXGuOUwnqw2zsLeg9sTLovdrORoEUExXzKGAaCiM_dT8/s1600/SemiSupervised.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZaW-5LaooUAzXT-ANJ39Jltf_Gd_nvPjIg1Zc8VHaSklzl6oFo3jjnj6NgVgED-lMlYIfyVyy23W2b659Bz98FLzdpzOc-oxbXGuOUwnqw2zsLeg9sTLovdrORoEUExXzKGAaCiM_dT8/s1600/SemiSupervised.png" /></a></div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
Manning Publications has made available for free an excerpt from my book <a href="http://www.manning.com/malak?a_aid=sparkgraphx&bid=28876901" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Spark GraphX In Action</a>. The excerpt is entitled <a href="http://freecontent.manning.com/poor-mans-training-data-graph-based-semi-supervised-learning/" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">Poor Man’s Training Data: Graph-Based Semi-Supervised Learning</a> and shows how to:</div>
<ul style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin: 0px 0px 1.5em 2em; outline: 0px; padding: 0px; vertical-align: baseline;">
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Construct a graph from a collection of points using a K-Nearest Neighbors Graph Construction algorithm (not to be confused with KNN machine learning prediction, which actually gets used below)</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Do the above in a way optimized for distributed computing.</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Propagate labels to unlabeled nodes to achieve semi-supervised learning.</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Make predictions from the trained model (using conventional KNN machine learning prediction)</li>
</ul>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
And as part of Manning's site-wide MEAP sale for Cyber Monday week, the MEAP is 50% off today using the code dotd120115.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
My co-author, Robin East, and I just finished the second draft this past weekend, so the print version should be available in 2016Q1.</div>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0tag:blogger.com,1999:blog-6104363047659965830.post-62499584701063265552015-11-11T09:26:00.002-07:002015-11-11T09:26:51.492-07:00Spark Streaming 1.6: Stop Using updateStateByKey()<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEALC4WQRM8WgrT4Ug-G5oSNbCIWavq91BAhDPWvt8Zzv0KKqE_vw1Zpc410QiuDD7OA6Vqv7clDpIUl0GpxqK9-Bt-E1yQceemVbP4ExyE_VEOBbTxyvvagCbtVdPkVucVnKkg0Fiebg/s1600/trackStateByKey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEALC4WQRM8WgrT4Ug-G5oSNbCIWavq91BAhDPWvt8Zzv0KKqE_vw1Zpc410QiuDD7OA6Vqv7clDpIUl0GpxqK9-Bt-E1yQceemVbP4ExyE_VEOBbTxyvvagCbtVdPkVucVnKkg0Fiebg/s1600/trackStateByKey.png" /></a></div>
<br />
Last night, Tathagata Das resolved <a href="https://issues.apache.org/jira/browse/SPARK-11290" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">SPARK-11290</a>, "Implement trackStateByKey for improved state management", which will bring a 7x performance improvement to Spark Streaming when Spark 1.6 is released in December, 2015.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin-bottom: 1em; outline: 0px; padding: 0px; vertical-align: baseline;">
<span style="font-family: Courier New, Courier, monospace;">trackStateByKey()</span><span style="font-family: droid serif, helvetica nue, arial, helvetica, sans-serif;"> offers three benefits over </span><span style="font-family: Courier New, Courier, monospace;">updateStateByKey()</span><span style="font-family: droid serif, helvetica nue, arial, helvetica, sans-serif;">, which has served as the workhorse of Spark Streaming since its inception in 2012:</span></div>
<ol style="background-color: white; border: 0px; color: #555555; font-family: 'Droid Serif', 'Helvetica Nue', Arial, Helvetica, sans-serif; font-size: 13px; line-height: 22.1px; margin: 0px 0px 1.5em 2em; outline: 0px; padding: 0px; vertical-align: baseline;">
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="font-family: inherit;">Internally, the performance improvement is achieved by looking at only the key/state pairs for which there is new data. The chart above, which comes from Tathagata's <a href="https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE" style="border: 0px; color: #47c0c0; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; text-decoration: none; vertical-align: baseline;" target="_blank">design document</a>, illustrates a typical use case, where 4 million keys are being tracked (for example, 4 million concurrent users on a website or app, or streaming audio or video) but only 10,000 had some activity during the past micro-batch (of, say, two seconds duration). With </span><span style="font-family: Courier New, Courier, monospace;">updateStateByKey()</span><span style="font-family: inherit;">, all 4 million key/state pairs would have had to have been touched due to </span><span style="font-family: Courier New, Courier, monospace;">updateStateByKey()</span><span style="font-family: inherit;">'s internal use of cogroup.</span></li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Capability to time out states is built in as a first class option. You no longer have to cobble together your own timeout mechanism. The downside is that if you use this option, you lose the performance improvement mentioned above because the timeout mechanism requires examining every key/state pair.</li>
<li style="border: 0px; font-family: inherit; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="font-family: inherit;">Ability to return/emit values other than just state as a result of having examined the state. Returning to the example of tracking a web or app user, </span><span style="font-family: Courier New, Courier, monospace;">trackStateByKey()</span><span style="font-family: inherit;"> could be maintaining a running logged-in time and emit that together with some metadata for the purposes of populating a dashboard. Not only does one avoid dual-purposing the state per key for two different purposes, but the performance benefit of touching only the modified keys is also realized.</span></li>
</ol>
Michael Malakhttp://www.blogger.com/profile/10007582156392845677noreply@blogger.com0