Spark(6)Upgrade to 1.0.2 Version again with YARN
Spark(6)Upgrade to 1.0.2 Version again with YARN
Download the prebuilt version
>wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
Setup and Make hadoop2 running on my VMs
http://sillycat.iteye.com/blog/2090186
Prepare the file on hadoop
>hdfs dfs -mkdir /user/sillycat
>hdfs dfs -put /opt/spark/log.txt /user/sillycat/
Login on the shell.
>MASTER=spark://ubuntu-master1:7077 bin/spark-shell
>val file = sc.textFile("hdfs://ubuntu-master1:9000/user/sillycat/log.txt")
>file.first()
Error Message:
Server IPC version 9 cannot communicate with client version 4
Solution:
Version error, I am using spark-hadoop1 to connect to hadoop 2.4.1
It works in the shell.
Go on and configure the YARN.
>sbin/start-dfs.sh
>sbin/start-yarn.sh
>sbin/mr-jobhistory-daemon.sh start historyserver
YARN is running now. Then we can get info from these URL
http://ubuntu-master1:50070/dfshealth.html#tab-overview
http://ubuntu-master1:8088/cluster/nodes
http://ubuntu-master1:19888/jobhistory
Running Spark Shell on YARN
Change the configuration file of spark
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
>MASTER=yarn-client bin/spark-shell
Submit the task as follow
>bin/spark-submit --class com.cloudera.sparkwordcount.SparkWordCount --master yarn /Users/carl/work/current/simplesparkapp/target/sparkwordcount-0.0.1-SNAPSHOT.jar hdfs://ubuntu-master1:9000/user/sillycat/log.txt 2
Spark works great on YARN
>bin/spark-submit --class com.sillycat.spark.app.ClusterComplexJob --master yarn /Users/carl/work/sillycat/sillycat-spark/target/scala-2.10/sillycat-spark-assembly-1.0.jar book1
The sample project is in sillycat-spark
Even the standalone cluster is working
>bin/spark-submit --class com.sillycat.spark.app.FindWordJob --master spark://ubuntu-master1:7077 /Users/carl/work/sillycat/sillycat-spark/target/scala-2.10/sillycat-spark-assembly-1.0.jar book1
The command to start the master and slave
>sbin/start-master.sh
>bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu-master1:7077
Configuration on master1
>cat conf/spark-env.sh
#!/usr/bin/env bash
export SPARK_LOCAL_IP=ubuntu-master1
#export SPARK_EXECUTOR_MEMORY=1G
export SPARK_MASTER_IP=ubuntu-master1
export SPARK_WORKER_MEMORY=1024M
Configuration on slave1
>cat conf/spark-env.sh
#!/usr/bin/env bash
export SPARK_LOCAL_IP=ubuntu-slave1
#export SPARK_EXECUTOR_MEMORY=1G
export SPARK_MASTER_IP=ubuntu-master1
export SPARK_WORKER_MEMORY=1024M
Tips
Spark Job runs
>bin/spark-submit --class com.cloudera.sparkwordcount.SparkWordCount --master local /Users/carl/work/current/simplesparkapp/target/sparkwordcount-0.0.1-SNAPSHOT.jar /opt/spark/README.md 2
Error Message
java.lang.OutOfMemoryError: Java heap space
Solution:
Change the Memory configuration from here
>vi bin/spark-class
References:
http://spark.apache.org/docs/latest/running-on-yarn.html
https://github.com/snowplow/spark-example-project
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
http://parambirs.wordpress.com/2014/05/20/running-spark-1-0-0-snapshot-on-hadoopyarn-2-4-0/
http://parambirs.wordpress.com/2014/05/20/install-hadoopyarn-2-4-0-on-ubuntu-virtualbox/
http://parambirs.wordpress.com/2014/05/20/building-and-running-spark-1-0-0-snapshot-on-ubuntu/