Friday, February 9, 2018

Minimal Scala Play

If you just need Scala Play for some quick testing/demo of Scala code, even the Scala Play Starter Example is too heavy. It has a lot of example code that is not needed and too much security for something to be run and accessed only locally.

Here is how to trim down the Scala Play Starter Example. First is the conf/application.conf file. All that is needed for the whole file is:

play.filters {
  hosts { allowed = ["."] }
  headers { contentSecurityPolicy = "default-src * 'unsafe-inline'" }

The hosts.allowed allows connections from external sources, and headers.contentSecurityPolicy allows things like remotely hosted Javascript (e.g. and Javascript inline directly in HTML elements (i.e. disable CSP and go back to 2016).

Then the conf/routes file:

GET     /                controllers.HomeController.index
GET     /mywebservice    controllers.MyWebServiceController.get(inputdata)

Specifically, you can delete the /count and /message routes, and then add whatever routes you need for web services (like /mywebservice above).

In the app directory:

rm -rf filters
rm -rf services
rm Module.scala
rm controllers/AyncController.scala
rm controllers/CountController.scala
rm views/main.scala.html
rm views/welcome.scala.html

And then in views/index.scala.html you can just delete all the code therein and write your own regular HTML and not bother with the Twirl template language if you don't need it.

Finally, you'll need to create controllers/MyWebServiceController.scala. You can use HomeController.scala as a template and add in import play.api.libs.json._ to gain access to the Play JSON APIs for parsing and generating JSON.


package controllers

import javax.inject._
import play.api.libs.json._
import play.api.mvc._

class MyWebServiceController @Inject()(cc: ControllerComponents) extends AbstractController(cc) {

  def get(inputdata:String) = Action {
    val a = Json.parse(inputdata)
    val r = // do stuff with a

Tuesday, November 21, 2017

1PB in 1U

17 months ago I blogged about 900TB (nearly 1PB) in 1U of rack space for only $60k. There I noted that a 1U server for the new high-density SSDs wasn't commonly available. Well, that has changed. A couple of months ago, Super Micro announced their 32-bay 1U unit for 2.5" drives. With the 32TB SSDs that Samsung announced last year to be available this year, that yields 1PB.

It won't be cheap. Recall that the 900TB 4U for $60k was for spinning drives. Given that the 16TB SSDs go for nearly $12k a pop, the 32TB drive that has been slated for later this year would be at least twice that and likely much more initially. Even at $24k for each 32TB SSD, this 1U of 1PB SSD would set you back $800k.

Friday, October 27, 2017

The Spark GraphX of actual combat

Earlier this year, my book was translated into a Chinese edition. It actually has sold extremely well. I just noticed that Amazon has a product page for it, and they've given it the title The Spark GraphX of actual combat (Chinese Edition).

My hunch is that's what one gets if one translates "Spark GraphX in Action" into Chinese and then back into English.

Friday, October 20, 2017

Neo4j's query language Cypher coming to Spark

In my 2016 Spark Summit presentation Finding Graph Isomorphisms in GraphX and GraphFrames I reviewed the history of graphs in Spark, and how to query a graph in Spark GraphX required many more lines of code than an equivalent query in Neo4j using its Cypher language. Even Spark GraphFrames, which implements a tiny, tiny subset of Cypher requires more code than full Cypher.

Two years ago at the 2015 GraphConnect (an event sponsored by Neo4j), Ion Stoica of Databricks announced:
We look forward to bringing Cypher's graph pattern matching capabilities into the Spark stack, making graph querying more accessible to the masses.
Well, two years later, Neo4j announced yesterday:
Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. 
[...] Until now, data scientists have been using Spark and query tools like GraphX to define extensions to their graphs. Once identified, they would then re-implement and deploy that work within their applications. Now, with Cypher for Apache Spark, these scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly. 
[...] This announcement builds on Neo4j’s unveiling of openCypher in October 2015, as an effort to push the whole graph industry forward by tapping into the open source community and making Cypher’s evolution an open exercise while avoiding redundant research.

Wednesday, June 7, 2017

Spark Summit 2017 Review

Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

For now, the leading contender for Spark GPU deep learning still seems to be DeepLearning4j, which is what I used in my Spark Summit 2017 presentation Neuro-Symbolic AI for Sentiment Analysis. (I will link video and slides once they are posted.)

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!

Tuesday, March 7, 2017

Zeppelin installation tips

If you need to run Apache Zeppelin either a) on a headless server or b) behind a proxy, see below.

Headless server

From your expanded zeppelin directory:

cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
nano conf/zeppelin-site.xml

And change zeppelin.server.addr to be either the IP address or the domain name of this server. This is to allow outside connections.


Zeppelin seems to need npm from node.js, which in turn needs to know your proxy settings. To get around this, install node.js yourself (instead of relying on what is built in to Zeppelin) and execute npm config to set its proxy settings. Below includes the instructions for installing node.js onto RedHat-type Linux distributions (CentOS, Oracle Linux, etc.). See for other OS's.

export http_proxy=<your http proxy>
export https_proxy=<your https proxy>

wget curl --silent --location
chmod 777 setup_6.x
sudo ./setup_6.x

sudo yum install -y nodejs

npm config set proxy <your http proxy>
npm config set https-proxy <your https proxy>