Thursday, April 7, 2016

Declarative Machine Learning

SQL is commonly referred to as a 4GL, or fourth-generation programming language, as opposed to all of the 3GL's like Java, C++, Python, Scala, etc. SQL is referred to as a declarative language as opposed to an imperativelanguage like the 3GL's. You tell SQL what to do, not how to do it.
Well, TuPAQ is the SQL for machine learning. You give it a high-level goal, and it figures out which machine learning algorithm to use, and tunes the hyperparameters for you. Example code for speech-to-text translation from Evan Sparks et al:
SELECT vm.sender, vm.arrived,
PREDICT(vm.text, vm.audio)
GIVEN LabeledVoiceMails
FROM VoiceMails vm
WHERE vm.user = 'Bob' AND vm.listened is NULL
ORDER BY vm.arrived
DESC LIMIT 50
When will you be able to use this in production? Hopefully, it's not too far away -- maybe a year, as a wild guess. At Spark Summit in June, 2015, Evan Sparks indicated KeystoneML would "soon" integrate with TuPAQas both KeystoneML and TuPAQ are AMPLab projects.
Although I gave KeystoneML a tepid review when it first came out, the new 0.3 version announced last weekshows the impressive direction they're headed in. Although not quite as declarative as TuPAQ, it is still declarative. An example of declaring a machine learning pipeline in KeystoneML:
val trainData = NewsGroupsDataLoader(sc, trainingDir)

val predictor = Trim andThen
    LowerCase() andThen
    Tokenizer() andThen
    NGramsFeaturizer(1 to conf.nGrams) andThen
    TermFrequency(x => 1) andThen
    (<CommonSparseFeatures(conf.commonFeatures), trainData.data) andThen
    (NaiveBayesEstimator(numClasses), trainData.data, trainData.labels) andThen
    MaxClassifier
Sure, the spark.ml package from Spark MLlib is also pipeline-centric, but whereas spark.ml simply relies on DataFrames/Catalyst/Tungsten to optimize each stage of the pipeline, KeystoneML analyzes and optimizes the pipeline as a whole. It "inspects the pipeline DAG and automatically decides where to cache intermediate output using a greedy algorithm."
Are there other declarative machine learning systems out there? Apache SystemML claims to be declarative, but it is only in that automatically plans deployment to a cluster based on data locality, available memory, etc.
SystemML claims that the high-level languages it provides, DML and PyDML, are "declarative", but they are not. They are still imperative languages. Their purpose is to allow non-Spark developers to write machine learning programs in languages they are comfortable in (like Python), yet be able to compile down to Spark Scala when the time comes to deploy to production. Thus, these are high-level languages as SystemML claims, but they are still imperative and not declarative. The ability of SystemML to plan optimal deployment to a cluster, however, is declarative.