Technical Tidbit of the Day: NaturalLanguageProcessing

Showing posts with label NaturalLanguageProcessing. Show all posts

Thursday, November 29, 2018

Sentiment Analysis pending patent published today

"Techniques for Sentiment Analysis of Data Using a Convolutional Neural Network and a Co-Occurrence Network"

http://pdfaiw.uspto.gov/.aiw?docid=20180341839&PageNum=1&IDKey=8234088170C0&HomeUrl=http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1%2526Sect2=HITOFF%2526d=PG01%2526p=1%2526u=%25252Fnetahtml%25252FPTO%25252Fsrchnum.html%2526r=1%2526f=G%2526l=50%2526s1=%25252220180341839%252522.PGNR.%2526OS=DN/20180341839%2526RS=DN/20180341839

Monday, June 12, 2017

My Spark Summit presentation: Neuro-Symbolic AI for Sentiment Analysis

Video

Slides

Wednesday, June 7, 2017

Spark Summit 2017 Review

Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

For now, the leading contender for Spark GPU deep learning still seems to be DeepLearning4j, which is what I used in my Spark Summit 2017 presentation Neuro-Symbolic AI for Sentiment Analysis. (I will link video and slides once they are posted.)

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!

Tuesday, March 29, 2016

Table of XX2Vec Algorithms

XX2Vec	Embed	In	Sup/Unsup	Algorithms used
Char2Vec	Character	Sentence	Unsupervised	CNN -> LSTM
Word2Vec	Word	Sentence	Unsupervised	ANN
GloVe	Word	Sentence	Unsupervised	SGD
Doc2Vec	Paragraph Vector	Document	Supervised	ANN -> Logistic Regression
Image2Vec	Image Elements	Image	Unsupervised	DNN
Video2Vec	Video Elements	Video	Supervised	CNN -> MLP

The powerful word2vec algorithm has inspired a host of other algorithms listed in the table above. (For a description of word2vec, see my Spark Summit 2015 presentation.) word2vec is a convenient way to assign vectors to words, and of course vectors are the currency of machine learning. Once you've vectorized your data, you are then free to apply any number of machine learning algorithms.

word2vec is able to come up with vectors by leveraging the concept of embedding. In a corpus, a word appears in the context of surrounding words, and word2vec uses those co-occurrences to infer relationships between those words.

All of the XX2Vec algorithms listed in the table above assign vectors to X's, where those X's are embedded in some larger context Y.

But the similarities end there. Each XX2Vec algorithm not only goes about it through means suited for its domain, but their use cases aren't even analagous. Doc2Vec, for example, is supervised learning whereas most of the others are unsupervised learning. The goal of Doc2Vec is to be able to apply labels to documents, whereas the goal of word2Vec and most of the other XX2Vec algorithms is simply to spit out vectors that you can then go and do other machine learning and analyses on (such as analogy detection).

Here is a brief description of each XX2Vec:

Char2Vec

Like word2vec but because it operates at the character level, it is much more tolerant of misspellings and thus better for analysis of tweets, user product reviews, etc.

Word2Vec

Described above. But one more note: it's one of those unreasonably effective algorithms -- a kind of getting lucky, if you will.

GloVe

Instead of just getting lucky, there have been a number of efforts to ground the idea of word embeddings in something more mathematical than just pulling weights out of a neural network and hoping they work. GloVe is the current standard-bearer in this regard. Its model is designed from the ground up to support finding analogies, instead of just getting them by chance in word2vec.

Doc2Vec

Actually, Doc2Vec uses Word2Vec as a first pass. It then comes up with a composite vector for each sentence or paragraph from the contributing Word2Vec word vectors. This composite gives some kind of overall context to the sentence or paragraph, and then this composite vector is plopped down into the beginning of the sentence or pargraph as an "extra word". The paragraph vectors togeher with the word vectors are used to train a supervised-learning classifier using human labels of the documents.

Image2Vec

Whereas word2vec intentionally uses a shallow neural network, Image2Vec uses a deep neural network and composes the resultant vectors from the weights from multiple layers of the network. Image elements that might be represented by these weights include image fragments (grass, bird, fence, etc.) or overall image qualities like color.

Video2Vec

If machine learning on images involves high dimensions, videos involve even higher dimensions. Video2Vec does some initial dimension reduction by doing a first pass with convolutional neural networks.

Wednesday, June 24, 2015

My Spark Summit Presentation On Word2Vec and Semi-Supervised Learning

Abstract

MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.

Video

Slides

Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Malak, Oracle) from Spark Summit