Spark on windows – updated April 2016
Spark 1.6.1, April 2016
So, building this version for scala 2.11 is slightly different, and the way I got it built is below. Note that I am not interested in Hadoop or HDFS. It’s Spark on Cassandra for me.
How to build Spark 1.6.1 for Scala 2.11
Download the source for Spark, and I also downloaded a working binary which is for Scala 2.10, I’m going to build the Scala 2.11 assembly and replace the Scala 2.10 one.
Make sure you are using a recent Maven, eg 3.3.9. Earlier ones will fail with a warning.
If you are building on Windows, then get Cygwin installed, and go into the dev subdir and change the build to scala 2.11
DO THIS IN CYGWIN, change to the source download directory cd C:/[your working dir]/spark-1.6.1 cd ./dev ./change-scala-version.sh 2.11
cd C:\[your working dir]\spark-1.6.1 set MAVEN_OPTS=-Xmx1024m mvn -Pyarn -Phadoop-2.6 -Pbigtop-dist -Dscala-2.11 -DskipTests clean package > buildlog.txt
Finally, move the resulting assembly into your pre-build downloaded Scala 2.10 lib dir. ie copy your one over the top of the scala 2.10 one, and you can use the scala 2.10 binary download.
You can find the newly built assembly at
C:\[your working dir]\spark-1.6.1\assembly\target\scala-2.11
WebUI.class error
At one point I was getting a failure to build with the WebUI complaint below.
[ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'.
I tried to comment out the SQL modules from the pom.xml, and this worked, but it meant I couldn’t then use Cassandra - so the correct fix is the change-scala-version.sh 2.11 and then the build command.
...sql/catalyst docker-integration-tests ...
Quick note of maven memory
Note that the wiki article that maven refers to to fix memory issues is out of date for JDK 1.8, ie -XX:MaxPermSize=128m no longer works in 1.8 as perm gen isnt part of the JVM anymore. ie http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
Note on resuming a maven build
Note, if you need to resume the build you can do
mvn -Pyarn -Phadoop-2.6 -Pbigtop-dist -Dscala-2.11 -DskipTests clean package -rf :spark-docker-integration-tests_2.10 > buildlog.txt
Running with the new 2.11 assembly
I ended up with a distribution of sorts in assembly/target. Having said that I think all I want is the jar file, spark-assembly-1.6.1-hadoop2.6.0.jar. I’ve already got a 1.6.1 Scala 2.10 build, so I slapped my new Jar into the lib dir, then I ran the start master and start server batch files below and everything worked.
Running Apache Spark on windows (first published Feb 2016)
Why windows? Its the dev laptop I carry around with me, rather than the home based Linux Mint server. Anyway, it turns out the spark guys are Linux heads, and after reading the source on GitHub I managed to work out what all the launcher code eventually ends up doing, which is as below. Obviously change the paths to match your own.
startMaster.bat
SET JAVA_HOME=C:\java\jdk1.8.0_74\ SET SPARK_HOME=C:/java/spark-1.6.1-bin-hadoop2.6/ REM Spark uses hadoop winutils to access native filesystems, REM so it has to be in the correct place and HADOOP_HOME set set HADOOP_HOME=C:\java\hadoopbinaries\winutils\hadoop-2.6.0 %JAVA_HOME%\bin\java -cp "%SPARK_HOME%/conf;%SPARK_HOME%lib/spark-assembly-1.6.1-hadoop2.6.0.jar;%SPARK_HOME%lib/datanucleus-api-jdo-3.2.6.jar;%SPARK_HOME%lib/datanucleus-core-3.2.10.jar;%SPARK_HOME%lib/datanucleus-rdbms-3.2.9.jar" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip 127.0.0.1 --port 7077 --webui-port 8080
startWorker.bat actually, because I’m me, its called startSlave.bat
SET JAVA_HOME=C:\java\jdk1.8.0_74\ SET SPARK_HOME=C:/java/spark-1.6.1-bin-hadoop2.6/ REM Spark uses hadoop winutils to access native filesystems, REM so it has to be in the correct place and HADOOP_HOME set set HADOOP_HOME=C:\java\hadoopbinaries\winutils\hadoop-2.6.0 %JAVA_HOME%\bin\java -cp "%SPARK_HOME%/conf;%SPARK_HOME%lib/spark-assembly-1.6.1-hadoop2.6.0.jar;%SPARK_HOME%lib/datanucleus-api-jdo-3.2.6.jar;%SPARK_HOME%lib/datanucleus-core-3.2.10.jar;%SPARK_HOME%lib/datanucleus-rdbms-3.2.9.jar" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077 --webui-port 8081
A scala Spark example
Reading the wiki makes it appear there are no problems with Spark. It required alot of experimenting and some time to get all the above informaton. It all started with a tiny spark example program, from their docs, which is below.
val sparkConf = new SparkConf().setAppName("JonathanExample").setMaster("spark://127.0.0.1:7077") new SparkContext(sparkConf)
If when I ran it I got:
java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc
this is when I started building Spark for myself … See top of the page.
Spark memory error when you run the job
If you get a memory error when you start up, then change the VM options to include a value greater than 417m
-Xmx512m