重新编译、安装spark assembly,使CDH5.5.1支持sparkSQL
CDH内嵌spark版本不支持spark-sql,sparkR,如果要使用,需要将hive的相关依赖包打进spark assembly jar中,下面就是针对spark-sql的编译、安装步骤
一.在任意一台linux机器上准备编译环境
spark-1.5.0.tgz 下载地址:https://spark.apache.org/downloads.html
jdk1.7.0_79
scala2.10.4
maven3.3.9
版本都是spark官网要求如下,详情可参考:https://spark.apache.org/docs/
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.5.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary;
配置环境变量如下,并使其生效:source /etc/profile
export JAVA_HOME=/data/jdk1.7.0_79 export M2_HOME=/data/apache-maven-3.3.9 export SCALA_HOME=/data/scala-2.10.4 export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$SCALA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
二.编译步骤
更多详情点击查看官网:https://spark.apache.org/docs/1.5.0/building-spark.html
1. 重新设置maven编译所占空间,因为编译过程复杂、时间长
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
2. 解压spark-1.5.0.tgz(例如到/data目录下),执行nohup mvn命令开始后台编译,结果输出到日志文件)
nohup mvn -Pyarn -PHadoop-2.6 -Dhadoop.version=hadoop2.6.0-cdh5.5.1 -Dscala-2.10.4 -Phive -Phive-thriftserver -DskipTests clean package > ./spark-mvn-`date +%Y%m%d%H`.log 2>&1 &
首次编译,需要2-3小时,具体看网络情况,(我编译多次,最后成功)编译成功日志末尾如下
[INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 3.200 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 8.887 s] [INFO] Spark Project Networking ........................... SUCCESS [ 8.270 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 4.832 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 6.082 s] [INFO] Spark Project Core ................................. SUCCESS [01:52 min] [INFO] Spark Project Bagel ................................ SUCCESS [ 5.129 s] [INFO] Spark Project GraphX ............................... SUCCESS [ 13.442 s] [INFO] Spark Project Streaming ............................ SUCCESS [ 30.683 s] [INFO] Spark Project Catalyst ............................. SUCCESS [ 43.622 s] [INFO] Spark Project SQL .................................. SUCCESS [ 53.463 s] [INFO] Spark Project ML Library ........................... SUCCESS [01:06 min] [INFO] Spark Project Tools ................................ SUCCESS [ 2.225 s] [INFO] Spark Project Hive ................................. SUCCESS [ 42.020 s] [INFO] Spark Project REPL ................................. SUCCESS [ 8.500 s] [INFO] Spark Project YARN ................................. SUCCESS [ 9.665 s] [INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 7.255 s] [INFO] Spark Project Assembly ............................. SUCCESS [02:15 min] [INFO] Spark Project External Twitter ..................... SUCCESS [ 7.330 s] [INFO] Spark Project External Flume Sink .................. SUCCESS [ 5.103 s] [INFO] Spark Project External Flume ....................... SUCCESS [ 8.405 s] [INFO] Spark Project External Flume Assembly .............. SUCCESS [ 2.928 s] [INFO] Spark Project External MQTT ........................ SUCCESS [ 15.932 s] [INFO] Spark Project External MQTT Assembly ............... SUCCESS [ 7.792 s] [INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 6.057 s] [INFO] Spark Project External Kafka ....................... SUCCESS [ 10.135 s] [INFO] Spark Project Examples ............................. SUCCESS [01:49 min] [INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 8.111 s] [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 5.814 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 12:28 min [INFO] Finished at: 2016-07-26T16:05:11+08:00 [INFO] Final Memory: 90M/1589M [INFO] ------------------------------------------------------------------------
同时在如下目录会找到生成的spark assembly的jar
/data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
三.安装spark assembly
1. 拷贝assembly jar包
到CDH机器180.153..,将jar包远程拷贝过来,例如到/home/hadoop目录下
scp -P 50201 /data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar [email protected].*.*:/home/hadoop
然后再复制到CDH的jars目录下,如果已存在,将其备份后删除
cp -p /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars
2. 替换CDH中spark下的assembly jar包
其实就是修改软连接spark-assembly.jar指向CDH的jars目录下的spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar,软连接所在路径:
/opt/cloudera/parcels/CDH/lib/spark/lib,删除原来的,新增连接
ln -s ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar ln -s spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly.jar
查看软连接情况
[root@db1 lib]# ll total 209204 -rw-r--r-- 1 root root 21645 Dec 3 2015 python.tar.gz lrwxrwxrwx 1 root root 68 Jan 14 2016 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar lrwxrwxrwx 1 root root 54 Jan 14 2016 spark-assembly.jar -> spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar lrwxrwxrwx 1 root root 68 Jan 14 2016 spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar lrwxrwxrwx 1 root root 54 Jan 14 2016 spark-examples.jar -> spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar [root@db1 lib]#
3. 拷贝spark-sql运行文件
从spark源文件的bin下拷贝到CDH的spark的bin目录下
scp -P 50201 /data/spark-1.5.0/bin/spark-sql [email protected].*.*:/opt/cloudera/parcels/CDH/lib/spark/bin
4. 配置环境变量
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_CMD=/opt/cloudera/parcels/CDH/bin/hadoop export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark export SCALA_HOME=/usr/local/scala-2.10.4 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin
5. 拷贝assembly jar包拷贝到HDFS
首先需要将assembly jar拷贝到HDFS的/user/spark/share/lib目录下,修改文件权限为755
hadoop fs -put /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /user/spark/share/lib
6. 在CM上配置
- 登陆CM,修改spark的服务范围为assembly jar在HDFS中的路径
/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
- 修改spark的高级配置
spark.yarn.jar=hdfs://bestCluster/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar export HIVE_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hive/conf
- 点击保存更改,再部署客户端配置即可。
7.运行spark-sql
已配置过环境变量,可在任意目录下执行spark-sql
[hadoop@db1 ~]$ spark-sql ... ... 16/07/27 16:04:52 INFO metastore: Trying to connect to metastore with URI thrift://nn1.hadoop:9083 16/07/27 16:04:52 INFO metastore: Connected to metastore. 16/07/27 16:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/462a9698-5bb6-4d17-bce3-9e162cfd40f8_resources 16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8 16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8 16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8/_tmp_space.db SET spark.sql.hive.version=1.2.1 SET spark.sql.hive.version=1.2.1 spark-sql>
tips:
1.新建/拷贝的文件要赋予读写权限
2.替换原有文件前,注意查看原有文件所属用户、软连接等信息
以上,完结!