Spark 1.6.1, April 2016

So, building this version for scala 2.11 is slightly different, and the way I got it built is below. Note that I am not interested in Hadoop or HDFS. It’s Spark on Cassandra for me.

How to build Spark 1.6.1 for Scala 2.11

Download the source for Spark, and I also downloaded a working binary which is for Scala 2.10, I’m going to build the Scala 2.11 assembly and replace the Scala 2.10 one.

Make sure you are using a recent Maven, eg 3.3.9. Earlier ones will fail with a warning.

If you are building on Windows, then get Cygwin installed, and go into the dev subdir and change the build to scala 2.11

DO THIS IN CYGWIN, change to the source download directory
cd  C:/[your working dir]/spark-1.6.1
cd ./dev
./change-scala-version.sh 2.11
cd  C:\[your working dir]\spark-1.6.1
set MAVEN_OPTS=-Xmx1024m 
mvn -Pyarn -Phadoop-2.6 -Pbigtop-dist -Dscala-2.11 -DskipTests clean package > buildlog.txt

Finally, move the resulting assembly into your pre-build downloaded Scala 2.10 lib dir. ie copy your one over the top of the scala 2.10 one, and you can use the scala 2.10 binary download.

You can find the newly built assembly at

C:\[your working dir]\spark-1.6.1\assembly\target\scala-2.11

WebUI.class error

At one point I was getting a failure to build with the WebUI complaint below.

[ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'.

I tried to comment out the SQL modules from the pom.xml, and this worked, but it meant I couldn’t then use Cassandra - so the correct fix is the change-scala-version.sh 2.11 and then the build command.

    ...
    sql/catalyst

    docker-integration-tests
    ...

Quick note of maven memory

Note that the wiki article that maven refers to to fix memory issues is out of date for JDK 1.8, ie -XX:MaxPermSize=128m no longer works in 1.8 as perm gen isnt part of the JVM anymore. ie http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Note on resuming a maven build

Note, if you need to resume the build you can do

mvn -Pyarn -Phadoop-2.6 -Pbigtop-dist -Dscala-2.11 -DskipTests clean package -rf :spark-docker-integration-tests_2.10 > buildlog.txt

Running with the new 2.11 assembly

I ended up with a distribution of sorts in assembly/target. Having said that I think all I want is the jar file, spark-assembly-1.6.1-hadoop2.6.0.jar. I’ve already got a 1.6.1 Scala 2.10 build, so I slapped my new Jar into the lib dir, then I ran the start master and start server batch files below and everything worked.

Running Apache Spark on windows (first published Feb 2016)

Why windows? Its the dev laptop I carry around with me, rather than the home based Linux Mint server. Anyway, it turns out the spark guys are Linux heads, and after reading the source on GitHub I managed to work out what all the launcher code eventually ends up doing, which is as below. Obviously change the paths to match your own.

startMaster.bat

SET JAVA_HOME=C:\java\jdk1.8.0_74\
SET SPARK_HOME=C:/java/spark-1.6.1-bin-hadoop2.6/

REM Spark uses hadoop winutils to access native filesystems, 
REM so it has to be in the correct place and HADOOP_HOME set
set HADOOP_HOME=C:\java\hadoopbinaries\winutils\hadoop-2.6.0

%JAVA_HOME%\bin\java -cp "%SPARK_HOME%/conf;%SPARK_HOME%lib/spark-assembly-1.6.1-hadoop2.6.0.jar;%SPARK_HOME%lib/datanucleus-api-jdo-3.2.6.jar;%SPARK_HOME%lib/datanucleus-core-3.2.10.jar;%SPARK_HOME%lib/datanucleus-rdbms-3.2.9.jar" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip 127.0.0.1 --port 7077 --webui-port 8080

startWorker.bat actually, because I’m me, its called startSlave.bat

SET JAVA_HOME=C:\java\jdk1.8.0_74\
SET SPARK_HOME=C:/java/spark-1.6.1-bin-hadoop2.6/

REM Spark uses hadoop winutils to access native filesystems, 
REM so it has to be in the correct place and HADOOP_HOME set
set HADOOP_HOME=C:\java\hadoopbinaries\winutils\hadoop-2.6.0

%JAVA_HOME%\bin\java -cp "%SPARK_HOME%/conf;%SPARK_HOME%lib/spark-assembly-1.6.1-hadoop2.6.0.jar;%SPARK_HOME%lib/datanucleus-api-jdo-3.2.6.jar;%SPARK_HOME%lib/datanucleus-core-3.2.10.jar;%SPARK_HOME%lib/datanucleus-rdbms-3.2.9.jar" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077 --webui-port 8081

A scala Spark example

Reading the wiki makes it appear there are no problems with Spark. It required alot of experimenting and some time to get all the above informaton. It all started with a tiny spark example program, from their docs, which is below.

val sparkConf = new SparkConf().setAppName("JonathanExample").setMaster("spark://127.0.0.1:7077")
    new SparkContext(sparkConf)

If when I ran it I got:

java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc

this is when I started building Spark for myself … See top of the page.

Spark memory error when you run the job

If you get a memory error when you start up, then change the VM options to include a value greater than 417m

-Xmx512m