Tuesday, November 21, 2017

1PB in 1U

17 months ago I blogged about 900TB (nearly 1PB) in 1U of rack space for only $60k. There I noted that a 1U server for the new high-density SSDs wasn't commonly available. Well, that has changed. A couple of months ago, Super Micro announced their 32-bay 1U unit for 2.5" drives. With the 32TB SSDs that Samsung announced last year to be available this year, that yields 1PB.

It won't be cheap. Recall that the 900TB 4U for $60k was for spinning drives. Given that the 16TB SSDs go for nearly $12k a pop, the 32TB drive that has been slated for later this year would be at least twice that and likely much more initially. Even at $24k for each 32TB SSD, this 1U of 1PB SSD would set you back $800k.

Friday, October 27, 2017

The Spark GraphX of actual combat

Earlier this year, my book was translated into a Chinese edition. It actually has sold extremely well. I just noticed that Amazon has a product page for it, and they've given it the title The Spark GraphX of actual combat (Chinese Edition).

My hunch is that's what one gets if one translates "Spark GraphX in Action" into Chinese and then back into English.

Friday, October 20, 2017

Neo4j's query language Cypher coming to Spark

In my 2016 Spark Summit presentation Finding Graph Isomorphisms in GraphX and GraphFrames I reviewed the history of graphs in Spark, and how to query a graph in Spark GraphX required many more lines of code than an equivalent query in Neo4j using its Cypher language. Even Spark GraphFrames, which implements a tiny, tiny subset of Cypher requires more code than full Cypher.

Two years ago at the 2015 GraphConnect (an event sponsored by Neo4j), Ion Stoica of Databricks announced:
We look forward to bringing Cypher's graph pattern matching capabilities into the Spark stack, making graph querying more accessible to the masses.
Well, two years later, Neo4j announced yesterday:
Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. 
[...] Until now, data scientists have been using Spark and query tools like GraphX to define extensions to their graphs. Once identified, they would then re-implement and deploy that work within their applications. Now, with Cypher for Apache Spark, these scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly. 
[...] This announcement builds on Neo4j’s unveiling of openCypher in October 2015, as an effort to push the whole graph industry forward by tapping into the open source community and making Cypher’s evolution an open exercise while avoiding redundant research.

Wednesday, June 7, 2017

Spark Summit 2017 Review

Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

For now, the leading contender for Spark GPU deep learning still seems to be DeepLearning4j, which is what I used in my Spark Summit 2017 presentation Neuro-Symbolic AI for Sentiment Analysis. (I will link video and slides once they are posted.)

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!

Tuesday, March 7, 2017

Zeppelin installation tips

If you need to run Apache Zeppelin either a) on a headless server or b) behind a proxy, see below.

Headless server

From your expanded zeppelin directory:

cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
nano conf/zeppelin-site.xml

And change zeppelin.server.addr to be either the IP address or the domain name of this server. This is to allow outside connections.


Zeppelin seems to need npm from node.js, which in turn needs to know your proxy settings. To get around this, install node.js yourself (instead of relying on what is built in to Zeppelin) and execute npm config to set its proxy settings. Below includes the instructions for installing node.js onto RedHat-type Linux distributions (CentOS, Oracle Linux, etc.). See nodejs.org for other OS's.

export http_proxy=<your http proxy>
export https_proxy=<your https proxy>

wget curl --silent --location https://rpm.nodesource.com/setup_6.x
chmod 777 setup_6.x
sudo ./setup_6.x

sudo yum install -y nodejs

npm config set proxy <your http proxy>
npm config set https-proxy <your https proxy>

Wednesday, January 4, 2017

Spark Structured Streaming Supports Kafka Since November 2016

As I noted in my May 14, 2016 blog post, Spark Structured Streaming, which brings the ability to stream a data source into a DataFrame and query it with SQL in real-time, was announced with much fanfare (along with Spark 2.0) at Spark Summit 2016, but notably absent at the time was its support for Kafka.

Diagram from databricks.com

Yes, Spark 2.1, released last week, now supports Kafka in Spark Structured Streaming. But so does Spark 2.0.2, quietly released on November 14, 2016.

So we no longer "have to wait for it" as I blogged last May.