Showing posts with label MachineLearning. Show all posts
Showing posts with label MachineLearning. Show all posts

Thursday, April 7, 2016

Declarative Machine Learning

SQL is commonly referred to as a 4GL, or fourth-generation programming language, as opposed to all of the 3GL's like Java, C++, Python, Scala, etc. SQL is referred to as a declarative language as opposed to an imperativelanguage like the 3GL's. You tell SQL what to do, not how to do it.
Well, TuPAQ is the SQL for machine learning. You give it a high-level goal, and it figures out which machine learning algorithm to use, and tunes the hyperparameters for you. Example code for speech-to-text translation from Evan Sparks et al:
SELECT vm.sender, vm.arrived,
PREDICT(vm.text, vm.audio)
GIVEN LabeledVoiceMails
FROM VoiceMails vm
WHERE vm.user = 'Bob' AND vm.listened is NULL
ORDER BY vm.arrived
DESC LIMIT 50
When will you be able to use this in production? Hopefully, it's not too far away -- maybe a year, as a wild guess. At Spark Summit in June, 2015, Evan Sparks indicated KeystoneML would "soon" integrate with TuPAQas both KeystoneML and TuPAQ are AMPLab projects.
Although I gave KeystoneML a tepid review when it first came out, the new 0.3 version announced last weekshows the impressive direction they're headed in. Although not quite as declarative as TuPAQ, it is still declarative. An example of declaring a machine learning pipeline in KeystoneML:
val trainData = NewsGroupsDataLoader(sc, trainingDir)

val predictor = Trim andThen
    LowerCase() andThen
    Tokenizer() andThen
    NGramsFeaturizer(1 to conf.nGrams) andThen
    TermFrequency(x => 1) andThen
    (<CommonSparseFeatures(conf.commonFeatures), trainData.data) andThen
    (NaiveBayesEstimator(numClasses), trainData.data, trainData.labels) andThen
    MaxClassifier
Sure, the spark.ml package from Spark MLlib is also pipeline-centric, but whereas spark.ml simply relies on DataFrames/Catalyst/Tungsten to optimize each stage of the pipeline, KeystoneML analyzes and optimizes the pipeline as a whole. It "inspects the pipeline DAG and automatically decides where to cache intermediate output using a greedy algorithm."
Are there other declarative machine learning systems out there? Apache SystemML claims to be declarative, but it is only in that automatically plans deployment to a cluster based on data locality, available memory, etc.
SystemML claims that the high-level languages it provides, DML and PyDML, are "declarative", but they are not. They are still imperative languages. Their purpose is to allow non-Spark developers to write machine learning programs in languages they are comfortable in (like Python), yet be able to compile down to Spark Scala when the time comes to deploy to production. Thus, these are high-level languages as SystemML claims, but they are still imperative and not declarative. The ability of SystemML to plan optimal deployment to a cluster, however, is declarative.

Tuesday, March 29, 2016

Table of XX2Vec Algorithms

XX2VecEmbedInSup/UnsupAlgorithms used
Char2VecCharacterSentenceUnsupervisedCNN -> LSTM
Word2VecWordSentenceUnsupervisedANN
GloVeWordSentenceUnsupervisedSGD
Doc2VecParagraph VectorDocumentSupervisedANN -> Logistic Regression
Image2VecImage ElementsImageUnsupervisedDNN
Video2VecVideo ElementsVideoSupervisedCNN -> MLP
The powerful word2vec algorithm has inspired a host of other algorithms listed in the table above. (For a description of word2vec, see my Spark Summit 2015 presentation.) word2vec is a convenient way to assign vectors to words, and of course vectors are the currency of machine learning. Once you've vectorized your data, you are then free to apply any number of machine learning algorithms.
word2vec is able to come up with vectors by leveraging the concept of embedding. In a corpus, a word appears in the context of surrounding words, and word2vec uses those co-occurrences to infer relationships between those words.
All of the XX2Vec algorithms listed in the table above assign vectors to X's, where those X's are embedded in some larger context Y.
But the similarities end there. Each XX2Vec algorithm not only goes about it through means suited for its domain, but their use cases aren't even analagous. Doc2Vec, for example, is supervised learning whereas most of the others are unsupervised learning. The goal of Doc2Vec is to be able to apply labels to documents, whereas the goal of word2Vec and most of the other XX2Vec algorithms is simply to spit out vectors that you can then go and do other machine learning and analyses on (such as analogy detection).
Here is a brief description of each XX2Vec:

Char2Vec

Like word2vec but because it operates at the character level, it is much more tolerant of misspellings and thus better for analysis of tweets, user product reviews, etc.

Word2Vec

Described above. But one more note: it's one of those unreasonably effective algorithms -- a kind of getting lucky, if you will.

GloVe

Instead of just getting lucky, there have been a number of efforts to ground the idea of word embeddings in something more mathematical than just pulling weights out of a neural network and hoping they work. GloVe is the current standard-bearer in this regard. Its model is designed from the ground up to support finding analogies, instead of just getting them by chance in word2vec.

Doc2Vec

Actually, Doc2Vec uses Word2Vec as a first pass. It then comes up with a composite vector for each sentence or paragraph from the contributing Word2Vec word vectors. This composite gives some kind of overall context to the sentence or paragraph, and then this composite vector is plopped down into the beginning of the sentence or pargraph as an "extra word". The paragraph vectors togeher with the word vectors are used to train a supervised-learning classifier using human labels of the documents.

Image2Vec

Whereas word2vec intentionally uses a shallow neural network, Image2Vec uses a deep neural network and composes the resultant vectors from the weights from multiple layers of the network. Image elements that might be represented by these weights include image fragments (grass, bird, fence, etc.) or overall image qualities like color.

Video2Vec

If machine learning on images involves high dimensions, videos involve even higher dimensions. Video2Vec does some initial dimension reduction by doing a first pass with convolutional neural networks.

Thursday, December 17, 2015

GPU off Apache Spark roadmap: Deeplearning4j best bet for Spark GPU


Last night, Reynold Xin took SPARK-3785 "Support off-loading computations to a GPU" off the Apache Spark road map, marking it "Closed" with a resolution of "Later". This is a little different than when GPU was mentioned at Spark Summit in June, 2015 as a possibility for Project Tungsten for 1.6 and beyond.
So for now, the best bet for using GPUs on Spark is Deeplearning4j, from which their architecture diagram above came. As I've blogged previously, the DL4J folks are waiting until they have solid benchmarks before advertising them. Nevertheless, today, you can do deep learning on GPU-powered Spark.

Tuesday, December 1, 2015

Free book excerpt: Semi-Supervised Learning With GraphX

Manning Publications has made available for free an excerpt from my book Spark GraphX In Action. The excerpt is entitled Poor Man’s Training Data: Graph-Based Semi-Supervised Learning and shows how to:
  • Construct a graph from a collection of points using a K-Nearest Neighbors Graph Construction algorithm (not to be confused with KNN machine learning prediction, which actually gets used below)
  • Do the above in a way optimized for distributed computing.
  • Propagate labels to unlabeled nodes to achieve semi-supervised learning.
  • Make predictions from the trained model (using conventional KNN machine learning prediction)
And as part of Manning's site-wide MEAP sale for Cyber Monday week, the MEAP is 50% off today using the code dotd120115.
My co-author, Robin East, and I just finished the second draft this past weekend, so the print version should be available in 2016Q1.

Tuesday, September 29, 2015

39 Machine Learning Libraries for Spark, Categorized

Apache Spark itself


1. MLlib


AMPLab


Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.

ML Base


Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.

2. MLI

3. ML Optimizer (aka Ghostface)
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.


Other than ML Base


4. Splash
A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.

5. Keystone ML
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I previously blogged about.

6. Velox
A server to manage a large collection of machine learning models.

7. CoCoA
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient Distributed Dual Coordinate Ascent


Frameworks


GPU-based


8. DeepLearning4j
I previously blogged DeepLearning4j Adds Spark GPU Support

9. Elephas
Brand new and frankly why I started this list for this blog post. Provides an interface to Keras.


Non-GPU-based


10. DistML
Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is).

11. Aerosolve
From Airbnb, used in their automated pricing

12. Zen
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines

13. Distributed Data Frame
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.


Interfaces to other Machine Learning systems


14. spark-corenlp
Wraps Stanford CoreNLP.

15. Sparkit-learn
Interface to Python's Scikit-learn

16. Sparkling Water
Interface to H2O

17. hivemall-spark
Wraps Hivemall, machine learning in Hive

18. spark-pmml-exporter-validator
Export PMML, an industry standard XML format for transporting machine learning models.


Add-ons that enhance MLlib's existing algorithms


19. MLlib-dropout
Adds dropout capability to Spark MLLib, based on the paper Dropout: A simple way to prevent neural networks from overfitting.

20. generalized-kmeans-clustering
Adds arbitrary distance functions to K-Means

21. spark-ml-streaming
Visualize the Streaming Machine Learning algorithms built into Spark MLlib


Algorithms


Supervised learning


22. spark-libFM
Factorization Machines

23. ScalaNetwork
Recursive Neural Networks (RNNs)

24. dissolve-struct
SVM based on the performant Spark communication framework CoCoA listed above.

25. Sparkling Ferns
Based on Image Classification using Random Forests and Ferns

26. streaming-matrix-factorization
Matrix Factorization Recommendation System


Unsupervised learning


27. PatchWork
40x faster clustering than Spark MLlib K-Means

28. Bisecting K-Meams Clustering
K-Means that produces more uniformly-sized clusters, based on A Comparison of Document Clustering Techniques

29. spark-knn-graphs
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)

30. TopicModeling
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)


Algorithm building blocks


31. sparkboost
Adaboost and MP-Boost

32. spark-tfocs
Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning cost function happens to be convex, then TFOCS can solve it.

33. lazy-linalg
Linear algebra operators to work with Spark MLlib's linalg package


Feature extractors


34. spark-infotheoretic-feature-selection
Information-theoretic basis for feature selection, based on Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

35. spark-MDLP-discretization
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval discretization of continuous-valued attributes for classification learning.

36. spark-tsne
Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.

37. modelmatrix
Sparse feature vectors


Domain-specific


38. Spatial and time-series data
K-Means, Regression, and Statistics

39. Twitter data

UPDATE 2015-09-30: Although it was a reddit.com post regarding the Spark deep learning framework Elephas that kicked me off compiling this list, most of the rest comes from AMPLab and spark-packages.org, plus a couple came from memory. Check AMPLab and spark-packages.org for future updates (since this blog post is a static list). And for tips on how to keep up in general on the fast-moving Spark Ecosystem, see my 10-minute presentation from February, 2015 (scroll down to the second presentation of that mini-conference).

Wednesday, September 9, 2015

Breakdown of Spark 1.5 Improvements


Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.

Spark Core

  • Project Tungsten
    The first major phase of Project Tungsten (aside from a small portion that went into 1.4)
    SPARK-7075
  • Data locality for reduce
    Prior to the Spark 1.2 release, I jumped the gun and announced this made it into Spark 1.2. In reality, it didn't make it in until Spark 1.5.
    SPARK-2774

MLLib

The developers behind Spark are focusing their efforts on the spark.ml package, which supports pipelines, and are in the long process of transferring all the spark.mllib functionality over to spark.ml. For the Spark 1.5 release, the big exception to this rule is the large set of improvements to LDA in spark.mllib.

spark.ml

spark.mllib

LDA received a ton of upgrades:

GraphX

Other than bug fixes and minor improvements, GraphX did not get upgraded in Spark 1.5. However, one of the spark.mllib functions can now accept a Graph as input:

Spark Streaming

  • New scheduling mechanism
    E.g. a job no longer fails if a receiver fails 4 times, and it is now possible to schedule all receivers to run on a particular node (e.g. for rack locality to a Kafka cluster)
    SPARK-8882 and SPARK-7988
  • Dynamic Rate Controller
    While it was previously possible to set a rate limit in Spark Streaming, Spark 1.5 introduces a dynamic and automatic rate limiter. There is no API; it's just automatic. An API to provide configuration and flexiblity did not make it into 1.5.
    SPARK-8834

Spark SQL

Half the Spark 1.5 Jira tickets concerned Spark SQL, but almost all of them were miscellaneous bug fixes and performance improvements. Two notable exceptions are:
  • Project Tungsten
    Already described above for Spark Core, Project Tungsten is really aimed at Spark SQL primarily. The developers behind Spark are aiming their performance improvements primarily at DataFrame from Spark SQL and only secondarily to plain old RDDs. Ever since the Spark 1.3 release, they have been positioning DataFrame as an eventual kind of replacement for RDD, for several reasons:
    1. Because a DataFrame can be populated by a query, Spark can create a database-like query plan which it can optimize.
    2. Spark SQL allows queries to be written in SQL, which may be easier for many (especially from the REPL Spark Shell) than Scala.
    3. SQL is more compact than Scala.

  • Data Source API improvement
    The Data Source API allows Spark SQL to connect to external data sources. 19 jira tickets constitute the Spark 1.5 improvements SPARK-5180, with 5 more slated for Spark 1.6 SPARK-9932.

Wednesday, June 24, 2015

My Spark Summit Presentation On Word2Vec and Semi-Supervised Learning

Abstract

MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.

Video



Slides


Monday, May 11, 2015

Yes, Neural Networks Have Grandmother Cells


The neural network portions of the above image come from Wikipedia.

The age-old debate about neural networks (both artificial and biological) is whether they have a grandmother cell, a neuron cell/node somewhere in the net that is activated when one's grandmother is viewed (assuming a biological vision scenario or computer vision application).

For biological neural networks, the jury is out, and the answer is leaning toward the "no" side. For artificial neural networks, if you Google for the answer, you'll almost always come across the admonishment to avoid grandmother cells in your neural networks. But to beginners to neural networks, this advice can be easily misunderstood.

More precisely, grandmother cells are to be avoided in the internal nodes. The reason is that internal nodes are supposed to be for latent variables, i.e., intermediate properties like "big eyes" or "big teeth". If an internal node is already recognizing grandma, then that is an indication of overfitting and that perhaps the neural network was created too large for the amount of training data or search space.

But all artificial neural networks whose job it is to classify images where one of the classes is grandma will have a grandmother cell, namely in the output layer. That's simply how you get your output from a classifier artificial neural network.

Thursday, March 12, 2015

My new book: Spark GraphX In Action

My new book became available on MEAP on March 6, 2015, and the print version is expected later in 2015.

Spark and Graphs are two tools rapidly being adopted by Data Scientists. Combine them and you get the GraphX component of Spark. Written in a way that assumes minimal Scala, the book quickly brings readers up to speed on all four: Scala, Spark, definitions from graph theory, and GraphX itself. An appendix provides three quick ways to get Spark running, and the chapters provide explicit examples from the most basic up through machine learning.

Wednesday, December 31, 2014

Using Data To Predict Data Science in 2015



This is the time of the year when pundits make their 2015 predictions. But to make predictions about Data Science, shouldn't one use data? Here are four charts from Google Trends that show trending performance of various data science technologies. Apache Spark really is overtaking Apache Hadoop.



In this R vs. IPython Notebook chart, we should just gather the trends rather than the absolute magnitudes. "R" is notoriously difficult to Google for, and "R Cran" is just one of the many tricks R users employ to Google for information about R. And, sadly, Google Trends has no way to additively combine search trends together (e.g. "R Cran" OR "R Project"). But, we can still see that IPython Notebook is skyrocketing upward while R is sagging.



This is a little hard to read and requires some explaining. The former name for "Apache Storm" was "Twitter Storm" when Twitter first open-sourced Storm onto GitHub in 2011. But "Twitter Storm" has another common usage, which is a "storm of tweets" such as about a celebrity. I'm guessing about half the searches for "Twitter Storm" are for this latter usage.

The takeaway is that Storm got a two-year head start on Spark Streaming and has been chugging away ever since. Part of the reason is that Spark Streaming, despite the surge in popularity of base Spark, had a lot of catching up to do to Storm in terms of graceful handling of errors and graceful shutdown/restart. A lot of that is addressed in the new HA Spark Streaming features introduced in Spark 1.2.0, released a week ago.

But the other interesting trend is that the academic term "complex event processing" is falling away in favor of the more industry-oriented terms "Storm" and "Spark Streaming".



People forget that "Machine Learning" was quite popular back in the dot-com era. And then it started to fade. That is, until Geoffrey Hinton's invention of deep learning in 2006. That seems to have lifted the popularity of machine learning in general. Well, at least we can say there's a correlation.

The other interesting thing is the very recent (within the past month) uptick in interest in DeepMind. Of course there was a barrage of interest in October when the over-hyped headlines blared "mimics human". But I think people only this past month started getting past the hype and started looking at the actual DeepMind paper which is interesting because it shows how they added state to a neural network, and that that is how they achieved "short term memory".

Sunday, November 9, 2014

Semantic Similarity Metrics

Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.

There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.

Given a document of text (e.g. a Facebook posting or an e-mail), we can turn it into a set of words or a bag of words. A bag of words is like a set of words, except it also includes the multiplicity. E.g. the miniature document "Now, come now" represented as a set of words would be {"now", "come"} whereas as a bag of words would be

Word Freq
now 2
come 1

Sets of words and bags of words can alternatively be considered as Boolean vectors and numeric vectors, respectively.

A common need when processing documents is to evaluate their similarity, e.g. to determine if they are duplicates, or to determine how close a sample document might be to a "reference" document (e.g. for automated essay scoring). There are various similarity metrics available, for both Boolean and numeric vectors.

Similarity Metrics for Boolean Vectors

Recall that what we mean by "Boolean Vectors" are really just sets, and it is easier to think about and discuss these as sets rather than as literal Boolean vectors, so we use set notation.

Jaccard Index

The Jaccard Index is the simplest metric:

\[\frac{\left|A \cap B\right|}{\left|A \cup B\right|}\]

Dice-Sørensen

The Dice-Sørensen (aka just Dice or just Sørensen) is similar to the Jaccard.

\[\frac{2\left|A \cap B\right|}{\left|A\right| + \left|B\right|}\]

They both give scores in the range [0,1]. But Dice emphasizes similarity, especially in the cases where one set is larger than the other. However, Dice does not satisfy the triangle inequality and thus is not a true metric in the mathematical sense of the word.

Tversky

Tversky is a generalization of Jaccard and Dice, in that Jaccard and Dice become just special cases of Tversky:

\[\frac{2\left|A \cap B\right|}{\left|A \cap B\right| + \alpha\left|A-B\right| + \beta\left|B-A\right|}\]

We arrive at Jaccard with \(\alpha=\beta=1\) and at Dice with \(\alpha=\beta=0.5\). But by varying \(\alpha\) and \(\beta\) to be different from each other, we can apply Tversky to situations where we wish to treat documents asymmetrically. For example, if instead of documents A and B that are treated equally, we have a reference set R (perhaps some sort of answer key) and a user set U, then by setting \(\alpha\) high we can "punish" the user for missing words that were expected in R. Alternatively, we could set \(\beta\) high to "punish" the scoring for not finding the best R that best matches the user input U.

Similarity Metrics for Numeric Vectors

Instead of having seta A and B, we now consider numeric vectors X and Y, which are frequency counts in our bag of words.

Tanimoto

The Tanimoto metric is the numeric vector generalization of the Jaccard index for Boolean vectors:

\[\frac{X \cdot Y}{\left|X\right|^2 + \left|Y\right|^2 - X \cdot Y}\]

Here, the dot represents the vector dot product.

Cosine

The cosine similarity metric is similar in appearance to Tanimoto:

\[\frac{X \cdot Y}{\left|X\right| \left|Y\right|}\]

The cosine has the appealing property that 0 means a 90 degree separation, or complete orthogonality.

Saturday, December 28, 2013

Real-time Data Science

Looking back on 2013, the world of Hadoop emerged from the era of batch processing and into streaming processing. In the contextg of "crisp and actionable," actionable often comes with an expiration date. If you don't take action soon enough, it's too late.

As I learned at the February 2013 Strata conference, Electronic Arts instrumented their online games to continuously stream data into their Big Data system. And because much of their revenue now comes from in-game purchases (in games that are often free to play), they need to know immediately if, e.g., in-game purchases suddenly drop off for whatever reason. They need to take action immediately, not next week.

Here I look at four streaming technologies, and then conclude with some pointers to general streaming computation techniques.

Part of the Berkeley Data Analytics Stack, Spark Streaming is a layer on top of Spark. Spark is a distributed, redundant RAM-based data storage and computation system that was accepted as an Apache Incubator project in 2013. Whereas Hadoop distributes data across the hard drives in a cluster of computers, Spark distributes data across the RAMs in a cluster of computers.

Spark Streaming breaks incoming stream into batches of, for example, 2-second batches or 10-second batches, or however long you set the batch window to be. One criticism of Spark Streaming is that it is not "true real time" since it is up to a batch window size behind. Another limitation is that all processing for that batch must complete before the next batch comes in; otherwise, the behavior is undefined. That can leave a lot of compute capacity unused if the system is tuned to handle the worst-case batch, but normal-size batches leave the processors idle for much of the batch time window.


Storm, another Apache project originally from Twitter, is the other big name in streaming. Its biggest and most well-known shortcoming is that it doesn't guarantee that it won't duplicate events. It guarantees each event will delivered at minimum once. Since this can wreak havoc with real-time metric computation, an optional layer called Trident can de-dupe, but at a high performance penalty.

Storm has a great advantage over Spark Streaming in that it has an "ack" system built-in. If processing an event throws an exception or otherwise fails to "ack" its parent in the directed acyclic graph (DAG) of event flow processing, the event can be saved by the parent for later processing when the problem has been fixed.
Storm also has a web-based admin GUI for starting/stopping/monitoring running DAGs.

Since I don't have any first-hand experience with S4, I have to rely on others' evaluations. At a presentation by Ted Dunning of MapR, he indicated that S4 was designed for topologies and situations that are more complex than usually encountered in the real world. A Yahoo! researcher blogged an excellent comparison between S4 and Storm, and boils it down to guaranteed delivery (Storm) vs. state recovery in case of fault (S4).

Druid is different than the rest in that it's more than just streaming; it also provides a data store and querying system. It's an all-in-one solution whereas the others are usually married to Hadoop.

Druid breaks the storage into two parts, one for real-time (typically 30 minutes worth of data), and the other for permanent "deep storage". A single querying facade can query both stores transparently.

The bad news is that the querying is currently limited. With no SQL, queries must be formed using a stilted JSON syntax, and the ability to create complex queries is limited.
But querying is fast. In fact, it's Druid's hallmark. In response to the common problem of cubing data with high dimensionality, the Druid folks decided instead to just be able to query really fast by storing the data in a distributed RAM-based cluster. With fast querying, there's no need to pre-compute and pre-aggregate.

LinkedIn contributed Samza to Apache in September, 2013. It's new and I haven't tried it yet, but it appears to integrate Kafka and YARN more deeply than the others, and it maintains state similar to how Spark Streaming allows with its RDDs (resilient distributed dataset) and its updateStateByKey() method.

While all the others are for self-hosted infrastructure, Amazon Kinesis is strictly for the could. Again, since I haven't used it myself, I rely on another blogger who informs that Kinesis is restricted to a single procedure, and so isn't suited for complex processing. There is also no rollback capability.

Stream Computation Techniques

It's evident fairly quickly that the mean can be computed in an incremental manner. If mn is the mean of x1 ... xn, then mn+1 = (n*mn + xn+1)/(n+1).

Much less obvious is that it's possible to calculate the standard deviation incrementally. Donald Knuth credits a 1962 paper by Welford with a clever technique.

Even more clever is calculating cardinality incrementally, i.e. the number of distinct items in a set, i.e. SELECT COUNT DISTINCT(*) -- and doing so without having to keep the entire dataset in memory. The trick is to use a hash function that uniformly distributes to the range [0.0,1.0) and then keep track of the most number of leading zeroes. E.g. if the set as 100 items, probabilistically we would expect the smallest hash value to end up being around 0.01, or one leading zero. For a set of 1000 items, some hash values by chance would probably end up being less than that, and the smallest overall would probably end up being around 0.001, or two leading zeroes.

Finally, classic machine learning algorithms such as K-means and decision tree construction have been adapted to incremental versions for streaming data. Joao Gama, a Portuguese professor, has assembled them into an excellent book:

Applications

Basic metrics permit real-time monitoring for:

  • Fraud detection
  • Fault detection
  • Capacity overload detection

Streaming machine learning adds:

  • Intelligent automated adjustment of thresholds to the above
  • Ability to detect opportunities, not just problems. E.g. product affinity of noticing products long-time product A and brand-new product B could alert management to craft a marketing campaign for product B to those who have bought A in the past.

Conclusion

While "real-time data science" doesn't mean taking conventional, manual, deep data exploration and analysis and applying it to real-time data, what it does mean for data scientists is that they are increasingly feeling the pressure to have some results automated and real-time, and data scientists need to be comfortable with these streaming technologies and algorithms.

Saturday, March 9, 2013

Automatically deskew before machine learning in R

A common data pre-processing step in R is to deskew data, which is where if a histogram shows a lopsided distribution, apply a function such as log() before fitting a model. If there are a large number of columns, it can be tedious to eyeball each histogram, and manually substitute offending columns with their log() counterparts.

Helpfully, the e1071 package (notably for its support vector machine algorithms) provides a handy function to measure the skewness of data, called skewness(). Below is a function to automatically deskew an entire range of columns of a data frame.
deskew <- function(df, mincol=1, maxcol=ncol(df), threshold=1.10) {
  for (i in mincol:maxcol) {
    t <- log(1+df[[i]]-min(df[[i]]))
    if (abs(skewness(df[[i]])) > threshold * abs(skewness(t)))
      df[[i]] <- t
  }
  df
}
Deskewing data improves the performance of linear models, both regular lm()/glm() and linear svm() support vector machines. Understandably, it doesn't help with decision trees such as randomTrees().