spark运行及开发环境搭建
一、Linux下spark运行环境搭建
http://wenku.baidu.com/link?url=V14fWw5C3vp2G7YhTApqknz_EKwowBGP8lL_TvSbXa8PN2vASVAHUSouK7p0Pu14h3IBf8zmdfPUNUT-2Hr-cnDUzivYJKupgWnEkbHTY8i
参考
http://wenku.baidu.com/link?url=-b2L9j7w2OSic3F7rA3LGPfhQpU45jBHAzVdmYesDDw4G6qGRi35-C7cFi8Oc3E-b1aqjn3agCDSjR4IzwEF2elJouLPSjZtcKdxYEIZQQK
1、scala安装
需改为安装2.10的版本才可以和spark1.3版本的匹配
tar-zxvfscala-2.10.5.tgz
vim/etc/profile
exportSCALA_HOME=/opt/scala-2.10.5
exportPATH=${SCALA_HOME}/bin:$PATH
source/etc/profile
2、下载spark1.3版本
http://apache.fayea.com/spark/spark-1.3.0/spark-1.3.0-bin-hadoop2.4.tgz
tar-zxvfspark-1.3.0-bin-hadoop2.4.tgz
mvspark-1.3.0-bin-hadoop2.4spark1.3
tty:[0]jobs:[0]cwd:[/opt/spark1.3/conf]
17:50[[email protected]]$cpspark-env.sh.templatespark-env.sh
exportSCALA_HOME=/opt/scala-2.10.5
#最大内存
exportSPARK_WORK_MEMORY=1g
exportSPARK_MASTER_IP=10.10.72.182
exportMASTER=spark://10.10.72.182:7077
#hadoop的配置信息路径,根据hadoop搭建过程实际目录
exportHADOOP_CONF_DIR=/opt/hadoop-2.4.0/etc/hadoop
exportJAVA_HOME=/opt/jdk
配置slaves文件
cpslaves.templateslaves
18:08[[email protected]]$vimslaves
#ASparkWorkerwillbestartedoneachofthemachineslistedbelow.
cloud01
cloud02
cloud03
scp-rspark1.3root@cloud02:/opt/
scp-rspark1.3root@cloud03:/opt/
分别在3台服务器上修改/etc/profile文件
添加
exportSPARK_HOME=/opt/spark1.3
exportPATH=$SPARK_HOME/bin;$SPARK_HOME/sbin;$PATH
仅需在182上启动spark即可183和184会自动启动
/opt/spark1.3
启动
./sbin/start-all.sh
关闭
./sbin/stop-all.sh
查看jps
15:03[[email protected]]$jps
12972NodeManager
4000Application
12587QuorumPeerMain
25042Jps
12879ResourceManager
12739DataNode
12648JournalNode
24790Master
24944Worker
15:09[[email protected]]$jps
11802DFSZKFailoverController
11505JournalNode
11906NodeManager
17757Jps
11417QuorumPeerMain
11600NameNode
11692DataNode
17104Worker
访问http://10.10.72.182:8080/
查看spark运行情况
--------------------------------------------------------------------
二、windowsIDEA开发环境搭建
参考
http://blog.csdn.net/ichsonx/article/details/44594567
http://ju.outofmemory.cn/entry/94851
1、安装scala
http://www.scala-lang.org/download/
下载最新版本然后安装
由于版本spark为1.3版本原因scala请使用2.10.5版本
C:\ProgramFiles(x86)\scala\bin
2、Intellij中安装Scala插件
Plugins-->Browserepositories中输入Scala插件
3、在spark官网下载spark包
下载预编译版本spark-1.3.0-bin-hadoop2.4
在intellijIDEA中创建scalaproject,并依次选择“File”–>“projectstructure”–>“Libraries”,选择“+”,将spark-hadoop对应的包导入,
比如导入spark-assembly-1.3.0-hadoop2.2.0.jar(只需导入该jar包,其他不需要),如果IDE没有识别scala库,则需要以同样方式将scala库导入。
之后开发scala程序即可。
E:\spark\spark-1.3.0-bin-hadoop2.4\spark-1.3.0-bin-hadoop2.4\lib\spark-assembly-1.3.0-hadoop2.4.0.jar
E:\spark\spark-1.3.0-bin-hadoop2.4\spark-1.3.0-bin-hadoop2.4\lib\spark-assembly-1.3.0-hadoop2.4.0.jar
3、创建Scala工程
参考http://www.aboutyun.com/thread-12496-1-4.html
NewProject
选择SBT新建SBT工程
然后新建module
选择ScalaSDK
在Run/DebugCongigurations中添加一个Application
新建class是选择object
valconf=newSparkConf().setAppName("SparkPi").setMaster("local")
本地运行
http://www.beanmoon.com/2014/10/11/%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8intellij%E6%90%AD%E5%BB%BAspark%E5%BC%80%E5%8F%91%E7%8E%AF%E5%A2%83%EF%BC%88%E4%B8%8B%EF%BC%89/
目前为止,我还没有找到在intellij中让spark直接在集群中运行的方法,通常的做法是先用intellij把已经编写好的spark程序打包,然后通过命令spark-submit的方式把jar包上传到集群中运行。
4、打包上传到Linux
http://www.open-open.com/doc/view/ebf1c03582804927877b08597dc14c66
参考
http://blog.csdn.net/javastart/article/details/43372977
依次选择“File”–>“ProjectStructure”–>“Artifact”,选择“+”–>“Jar”–>“FromModuleswithdependencies”,选择main函数,并在弹出框中选择输出jar位置,
并选择“OK”。
勾选Buildonmake项目make时会自动打包
D:\\IDEA\\idea_project_new\\sparkTest\\out\\artifacts\\sparkTest_jar代码中指定的目录
最后依次选择IDEA菜单的“Build”–>“BuildArtifact”编译生成jar包。具体如下图所示。
去掉scala和hadoop的依赖包
上传至10.10.72.182的/home/sparkTest/
spark1.3需要改为scala2.10
错误:
http://blog.csdn.net/u012432611/article/details/47274249
4、验证安装情况
参考
http://blog.csdn.net/jediael_lu/article/details/45310321
(1)运行自带示例
$bin/run-exampleorg.apache.spark.examples.SparkPi
(2)查看集群环境
http://master:8080/
(3)进入spark-shell
$spark-shell
(4)查看jobs等信息
http://master:4040/jobs/
提交到集群后通过spark-submit进行提交
tty:[0]jobs:[0]cwd:[/opt/spark1.3/bin]
16:48[[email protected]]$spark-submit--classmain.java.com.laifeng.SparkPi--masterspark://10.10.72.182:7077/home/sparkTest/sparkTest.jar
去掉scala-sdk-2.11和spark-assembly-1.1.0-hadoop相关依赖包
其中–class参数制定了我们刚才已打jar包的主类,–master参数制定了我们spark集群中master实例的身份。关于spark-submit参数的更多用法,可以通过spark-submit–help命令查看。
运行成功
spark-submit--classmain.java.com.laifeng.SparkPi--masterspark://10.10.72.182:7077/home/sparkTest/sparkTest.jar
4、找不到winutil.exe的问题
http://www.tuicool.com/articles/iABZJj
配置好后重启IDEA后运行成功。
5、local时遇到的问题
https://archive.apache.org/dist/hadoop/common/hadoop-2.4.0/
--------------------------------------
其他补充:
zookeeper命令
http://blog.csdn.net/xiaolang85/article/details/13021339
查看是否为leader
14:12[[email protected]]$echostat|nc127.0.0.12181
Zookeeperversion:3.4.6-1569965,builton02/20/201409:09GMT
Clients:
/127.0.0.1:5979[0](queued=0,recved=1,sent=0)
Latencymin/avg/max:0/0/8
Received:16
Sent:15
Connections:1
Outstanding:0
Zxid:0xb00000011
Mode:follower
Nodecount:10
tty:[0]jobs:[0]cwd:[/opt/zookeeper-3.4.6/bin]
14:18[[email protected]]$./zkServer.shstatus
JMXenabledbydefault
Usingconfig:/opt/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode:leader
spark-shell
tty:[0]jobs:[0]cwd:[/opt/spark1.3/bin]
17:16[[email protected]]$spark-shell
scala>sc.version
res0:String=1.3.0
scala>sc.appName
res1:String=Sparkshell
scala>:quit
---------------------------------------------------
远程debug方式
/opt/spark1.3/bin/spark-submit--classmain.scala.com.laifeng.SparkWorldCount--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jar
参考
http://blog.csdn.net/happyanger6/article/details/47065423
方式1
命令方式
参考
bin/spark-submit--classsparksql.HiveOnSQLscalastudy.jar--driver-java-options-agentlib:jdwp=transport=dt_socket,address=9904,server=y,suspend=y
hadoopfs-rm-r/wuzhanwei/test/output1/
/opt/spark1.3/bin/spark-submit--classmain.scala.com.laifeng.SparkWorldCount--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jar--driver-java-options-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y
17:09[[email protected]]$/opt/spark1.3/bin/spark-submit--classmain.scala.com.laifeng.SparkWorldCount--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jar--driver-java-options-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y
SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath
Listeningfortransportdt_socketataddress:8888
现在调试:自己修改的工程:
/opt/spark1.3/bin/spark-submit--classjava.com.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jarhdfs://ns1/input/clientupload20151027.csvhdfs://ns1/output2/output.csv--driver-java-options-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y
hdfs://ns1/input/clientupload20151027.csv
hdfs://ns1/output2/output.csv
方式2
重点:
http://blog.csdn.net/javastart/article/details/43372977
http://blog.csdn.net/happyanger6/article/details/47065423
需停掉spark暂时未采用建议完成调试后在实验该方式
tty:[0]jobs:[0]cwd:[/opt/spark1.3/bin]
16:56[[email protected]]$vimspark-clas
------------------------------------------------------------------
/opt/spark1.3/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jarhdfs://ns1/input/clientupload20151106.csvhdfs://ns1/output6
/opt/spark1.3/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masterspark://10.10.72.182:7077/home/sparkTest/laifeng-spark.jarhdfs://ns1/input/clientupload20151106.csvhdfs://ns1/output6--driver-java-options-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y
测试好使:
/opt/spark1.3/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/home/sparkTest/laifeng-spark.jarhdfs://ns1/input/clientupload20151106.csvhdfs://ns1/output11yarn-cluster--num-executors3--driver-memory1g--executor-memory2g
已跑出数据
demo程序
/opt/spark-onyarn/spark/default/bin/spark-submit--classorg.apache.spark.examples.SparkPi--masteryarn-cluster--num-executors3--driver-memory2g--executor-memory4g/opt/spark-onyarn/spark/default/lib/spark-examples*.jar
采用yarn方式跑数据
参考:
/opt/spark-onyarn/spark/default/bin/spark-submit--classorg.apache.spark.examples.SparkPi--masteryarn-cluster--num-executors3--driver-memory1g--executor-memory2g/opt/spark-onyarn/spark/default/lib/spark-examples*.jar
线上
/opt/spark-onyarn/spark/default/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/work/yule/linshi/spark/laifeng-spark.jar/workspace/yule/test/spark/clientupload20151105.csv/workspace/yule/test/spark/output4yarn-cluster--num-executors3--driver-memory2g--executor-memory4g
/opt/spark-onyarn/spark/default/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/work/yule/linshi/spark/laifeng-spark.jar/workspace/yule/test/spark/clientupload20151105.csv/workspace/yule/test/spark/output4yarn-cluster--num-executors3--driver-memory2g--executor-memory4g
目录方式
/opt/spark-onyarn/spark/default/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/work/yule/linshi/spark/laifeng-spark.jar/workspace/yule/test/spark//workspace/yule/test/sparkoutputyarn-cluster--num-executors3--driver-memory3g--executor-memory6g
yarn-client相当于是命令行会将你输入的代码提交到yarn上面执行yarn-cluster是将你写好的程序打成jar包然后提交到yarn上面去执行然后yarn会将jar包分发到各个节点并负责资源分配和任务管理
hdfs://youkuDfs/workspace/yule/test/spark/clientupload20151105.csv
runing任务界面http://a01.master.spark.hadoop.qingdao.youku:8088
historyserver界面:http://a01.master.spark.hadoop.qingdao.youku:18080/
实际解析
----------------------------------------------
laifeng-spark-clientup.jar
/opt/spark-onyarn/spark/default/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/work/yule/online/spark/laifeng-spark-clientup.jar/workspace/yule/test/spark//workspace/yule/test/sparkoutputyarn-cluster--num-executors3--driver-memory3g--executor-memory6g
/opt/spark-onyarn/spark/default/bin/spark-submit--classcom.laifeng.ddshow.clientup.LaifengClientUpInfoAccessStat--masteryarn-cluster/work/yule/online/spark/laifeng-spark-clientup.jar/source/ent/laifeng/clientupload/20151108//workspace/yule/test/sparkclientInfo/20151108/yarn-cluster--num-executors3--driver-memory3g--executor-memory6g