Monday, May 11, 2015

Yes, Neural Networks Have Grandmother Cells


The neural network portions of the above image come from Wikipedia.

The age-old debate about neural networks (both artificial and biological) is whether they have a grandmother cell, a neuron cell/node somewhere in the net that is activated when one's grandmother is viewed (assuming a biological vision scenario or computer vision application).

For biological neural networks, the jury is out, and the answer is leaning toward the "no" side. For artificial neural networks, if you Google for the answer, you'll almost always come across the admonishment to avoid grandmother cells in your neural networks. But to beginners to neural networks, this advice can be easily misunderstood.

More precisely, grandmother cells are to be avoided in the internal nodes. The reason is that internal nodes are supposed to be for latent variables, i.e., intermediate properties like "big eyes" or "big teeth". If an internal node is already recognizing grandma, then that is an indication of overfitting and that perhaps the neural network was created too large for the amount of training data or search space.

But all artificial neural networks whose job it is to classify images where one of the classes is grandma will have a grandmother cell, namely in the output layer. That's simply how you get your output from a classifier artificial neural network.

Thursday, May 7, 2015

RDF vs. Property Graphs: Excerpt From "Spark GraphX In Action"


The above is an animated interpretation of one of the (static) images from chapter 3 of my upcoming book Spark GraphX In Action. Chapter 3 is now available for download for those who have purchased the MEAP. Today, May 7, 2015, the MEAP is available for a 50% discount using the discount code dotd050715au.

Thursday, March 12, 2015

My new book: Spark GraphX In Action

My new book became available on MEAP on March 6, 2015, and the print version is expected later in 2015.

Spark and Graphs are two tools rapidly being adopted by Data Scientists. Combine them and you get the GraphX component of Spark. Written in a way that assumes minimal Scala, the book quickly brings readers up to speed on all four: Scala, Spark, definitions from graph theory, and GraphX itself. An appendix provides three quick ways to get Spark running, and the chapters provide explicit examples from the most basic up through machine learning.

Monday, March 2, 2015

Spark Ecosystem

Presentation I gave to the Data Science Association on February 28, 2015:

  1. How fast Spark is moving
  2. Ways to keep up
  3. Current status of the ever-expanding Spark ecosystem

Original video

Re-recorded later as a screencast

Tuesday, February 24, 2015

State of the Stream 2015Q1

The big news this week -- and for the past couple of months, really -- is of course eBay open sourcing theirPulsar real-time analytics framework, which allows SQL queries into a real-time data stream. It's built on top of their Jetstream data streaming framework, also open sourced this month, which, from its list of capabilities seems to be a layer akin to Storm and Spark Streaming, with sinks and sources, Kafka connectivity, and REST interfaces for cluster monitoring.
With all the breathless news items out there about Pulsar, they neglect to mention one important fact: it is GPL, which precludes its use at a lot of organizations.
This is in stark contrast to some other big news this past week: Druid just switched to an Apache license from its use of the GPL.
Streaming is important because as I've previously blogged, every Big Data project eventually becomes a data streaming project, because once insights are found in Big Data, the consumers of those insights request it be re-run with ever more up-to-date data, until eventually they request real-time updates.
I've also previously blogged, over a year ago now, on the comparison of streaming frameworks, especially comparing Storm with Spark Streaming. At the time, I relayed how at the February, 2013 Strata, several speakers spoke of how Trident, the layer that gives Storm exactly once semantics, makes Storm non-performant. I also wrote at that time how Spark Streaming, even though it had exactly-once semantics, lacked things like graceful shutdown and transactions.
Well, there is recent news about about both Spark Streaming and Storm. First, Spark Streaming gained high availability in 1.2.0, released in December, 2014. And although it has exactly once semantics, doing so with Kafka is a bit tricky since Kafka itself does not guarantee exactly once when a node in a Kafka cluster goes down. However, Spark 1.3.0 will have Kafka exactly once semantics. Spark 1.3.0 release candidates are being voted on now, and will probably be released in the first half of March, 2015.
As for performance, a University of Toronto grad student recently benchmarkedStorm vs. Spark Streaming:

"Storm was around 40% faster than Spark, processing tuples of small size (around the size of a tweet). However, as the tuple's size increased, Spark had better performance maintaining the processing times."

Wednesday, December 31, 2014

Using Data To Predict Data Science in 2015



This is the time of the year when pundits make their 2015 predictions. But to make predictions about Data Science, shouldn't one use data? Here are four charts from Google Trends that show trending performance of various data science technologies. Apache Spark really is overtaking Apache Hadoop.



In this R vs. IPython Notebook chart, we should just gather the trends rather than the absolute magnitudes. "R" is notoriously difficult to Google for, and "R Cran" is just one of the many tricks R users employ to Google for information about R. And, sadly, Google Trends has no way to additively combine search trends together (e.g. "R Cran" OR "R Project"). But, we can still see that IPython Notebook is skyrocketing upward while R is sagging.



This is a little hard to read and requires some explaining. The former name for "Apache Storm" was "Twitter Storm" when Twitter first open-sourced Storm onto GitHub in 2011. But "Twitter Storm" has another common usage, which is a "storm of tweets" such as about a celebrity. I'm guessing about half the searches for "Twitter Storm" are for this latter usage.

The takeaway is that Storm got a two-year head start on Spark Streaming and has been chugging away ever since. Part of the reason is that Spark Streaming, despite the surge in popularity of base Spark, had a lot of catching up to do to Storm in terms of graceful handling of errors and graceful shutdown/restart. A lot of that is addressed in the new HA Spark Streaming features introduced in Spark 1.2.0, released a week ago.

But the other interesting trend is that the academic term "complex event processing" is falling away in favor of the more industry-oriented terms "Storm" and "Spark Streaming".



People forget that "Machine Learning" was quite popular back in the dot-com era. And then it started to fade. That is, until Geoffrey Hinton's invention of deep learning in 2006. That seems to have lifted the popularity of machine learning in general. Well, at least we can say there's a correlation.

The other interesting thing is the very recent (within the past month) uptick in interest in DeepMind. Of course there was a barrage of interest in October when the over-hyped headlines blared "mimics human". But I think people only this past month started getting past the hype and started looking at the actual DeepMind paper which is interesting because it shows how they added state to a neural network, and that that is how they achieved "short term memory".

Saturday, December 13, 2014

Neuromorphic vs. Neural Net

The diagram of biological brain waves comes from med.utah.edu and the diagram of an artificial neural network neuron comes from hemming.se

BrainArtificial Neural Network
AsynchronousGlobal synchronous clock
StochasticDeterministic
Shaped wavesScalar values
Storage and compute synonymousStorage and compute separate
Training is a MysteryBackpropagation
Adaptive network topologyFixed network
Cycles in topologyCycle-free topology

The table above lists the differences between a regular artificial neural network (feed-forward non-spiking, to be specific) and a biological brain. An artificial neural network (ANN) is so far in architecture and function from a biological brain that attempts to simulate a brain in silicon go by a different term altogether: neuromorphic

In the table above, if the last row is modified to allow a neural network to have cycles in its network topology, then it becomes known as a recurrent neural network -- still not quite neuromorphic. But by also modifying the first row of the table to remove the global synchronous clock from neural networks, IBM's TrueNorth chip announced August 2014 claims the neuromorphic moniker. (Asynchronous neural networks are also called spiking neural networks (SNN), but TrueNorth combines the properties of both RNNs and SNNs.)

The TrueNorth chip sports one million neurons and 256 million synapses. But you can't buy one. The closest you can come today perhaps is to use an FPAA, a field-programmable analog array, the analog version of an FPGA. But FPAAs haven't scaled nearly as highly as FPGAs. The largest FPAA is the RASP 2.9. The image of its die below comes from a thesis Contributions to Neuromorphic and Reconfigurable Circuits and Systems.

It has only 78 CABs (Computational Analog Block), contrasted to the largest FPGAs which have over one million logic elements. Researchers in 2013 were able to simulate 18 neuromorphic neurons with this RASP 2.9 analog FPAA chip.


The human brain has 100 billion neurons, so it would hypothetically take 100,000 TrueNorth chips to approach equivalence, based on number of neurons alone. Of course, the other factors, in particular the variable wave shape of biological neurons, would like put any TrueNorth simulation of a brain at a great disadvantage. A lot more information can be carried in a wave shape than in a single scalar value.In the diagram at the top of this blog post, the different wave shapes resulted from showing an animal lights spots of different diameters. An artificial neural network, in contrast, would require N number of output neurons to represent N different distinct diameters.

But with an analog FPAA, perhaps neurons that support wave shapes could be simulated, even if for now one may be limited to a dozen or so neurons. But then there is the real mystery: how a biological brain learns, and by extension how to train a neuromorphic system.