Monday, July 18, 2016

Spark Sometimes Forgets to Put Distantly Scoped Variables into Your Closure

This has always been a gotcha with Spark and still is today, yet I don't see a caution of it mentioned much of anywhere.

If your .map() needs access to a variable (or value), and that variable is not defined in the same immediate local scope, "sometimes" Spark will not include it in the closure, leading to erroneous results.

I've never been able to define "sometimes", and I've never been able to come up with a tiny example that demonstrates it. Nevertheless, below is a tiny bit of source code (which does work; that is it does not demonstrate the problem) just to make clear what I'm talking about.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object closure {
  val a = Array(1,2)
  def main(args: Array[String]) {
    // Sometimes the line of code below is necessary (and change the
    // reference to a in the map() to a2 as well)
    // val a2 = a
    val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("closure"))
    println(sc.makeRDD(Array(3,4)).map(_ + a.sum).collect.mkString(";"))
    sc.stop
  }
}

I've seen it where a "sometimes" gets transmitted to the cluster as a zero-length array.

As background, functional languages like Scala compute closures, which means that when you pass a function as a parameter, it's not just the function that gets passed but all the variables and values that it requires along with it. Scala does compute closures, but not serialize closures for distributed computing. Spark has to compute and serialize its own closures, and sometimes it makes mistakes. Sometimes, it's necessary to give it some help by moving the data you need into the same local scope so that it can pick it up.

No comments: