If you're interested in a quick way to start playing with Apache Spark without having to pay for cloud resources, and without having to go through the trouble of installing Hadoop at home, you can leverage the pre-installed Hadoop VM that Cloudera makes freely available to download. Below are the steps.
- Because the VM is 64-bit, your computer must be configured to run 64-bit VM's. This is usually the default for computers made since 2012, but for computers made between 2006 and 2011, you will probably have to enable it in the BIOS settings.
- Install https://www.virtualbox.org/wiki/Downloads (I use VirtualBox since it's more free than VMWare Player.)
- Download and unzip the 2GB QuickStart VM for VirtualBox from Cloudera.
- Launch VirtualBox and from its drop-down menu select File->Import Appliance
- Click the Start icon to launch the VM.
- From the VM Window's drop-down menu, select Devices->Shared Clipboard->Bidirectional
- From the CentOS drop-down menu, select System->Shutdown->Restart. I have found this to be necessary to get HDFS to start working the first time on this particular VM.
- The VM comes with OpenJDK 1.6, but Spark and Scala need Oracle JDK 1.7, which is also supported by Cloudera 4.4. From within CentOS, launch Firefox and navigate to http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Click the radio button "Accept License Agreement" and click to download jdk-7u51-linux-x64.rpm (64-bit RPM), opting to "save" rather than "open" it. I.e., save it to ~/Downloads.
- From the CentOS drop-down menu, select Application->System Tools->Terminal and then:
sudo rpm -Uivh ~/Downloads/jdk-7u51-linux-x64.rpm
echo "export JAVA_HOME=/usr/java/latest" >>~/.bashrc
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >>~/.bashrc
source ~/.bashrc
wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating.tgz
tar xzvf spark-0.9.0-incubating.tgz
cd spark-0.9.0-incubating
SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly
bin/spark-shell
That sbt assembly command also has the nice side-effect of installing scala and sbt for you, so you can start writing scala code to use Spark instead of just using the Spark Shell.