重新编译、安装spark assembly,使CDH5.5.1支持sparkSQL

CDH内嵌spark版本不支持spark-sql,sparkR,如果要使用,需要将hive的相关依赖包打进spark assembly jar中,下面就是针对spark-sql的编译、安装步骤

一.在任意一台linux机器上准备编译环境

spark-1.5.0.tgz 下载地址:https://spark.apache.org/downloads.html

jdk1.7.0_79

scala2.10.4

maven3.3.9

版本都是spark官网要求如下,详情可参考:https://spark.apache.org/docs/

Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.5.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).

Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary;

配置环境变量如下,并使其生效:source /etc/profile

export JAVA_HOME=/data/jdk1.7.0_79
export M2_HOME=/data/apache-maven-3.3.9
export SCALA_HOME=/data/scala-2.10.4
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$SCALA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

二.编译步骤

更多详情点击查看官网:https://spark.apache.org/docs/1.5.0/building-spark.html

1. 重新设置maven编译所占空间,因为编译过程复杂、时间长

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

2. 解压spark-1.5.0.tgz(例如到/data目录下),执行nohup mvn命令开始后台编译,结果输出到日志文件)

nohup mvn -Pyarn -PHadoop-2.6 -Dhadoop.version=hadoop2.6.0-cdh5.5.1  -Dscala-2.10.4 -Phive -Phive-thriftserver   -DskipTests clean package  > ./spark-mvn-`date +%Y%m%d%H`.log 2>&1 &

首次编译,需要2-3小时,具体看网络情况,(我编译多次,最后成功)编译成功日志末尾如下

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.200 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  8.887 s]
[INFO] Spark Project Networking ........................... SUCCESS [  8.270 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  4.832 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  6.082 s]
[INFO] Spark Project Core ................................. SUCCESS [01:52 min]
[INFO] Spark Project Bagel ................................ SUCCESS [  5.129 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 13.442 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 30.683 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [ 43.622 s]
[INFO] Spark Project SQL .................................. SUCCESS [ 53.463 s]
[INFO] Spark Project ML Library ........................... SUCCESS [01:06 min]
[INFO] Spark Project Tools ................................ SUCCESS [  2.225 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 42.020 s]
[INFO] Spark Project REPL ................................. SUCCESS [  8.500 s]
[INFO] Spark Project YARN ................................. SUCCESS [  9.665 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [  7.255 s]
[INFO] Spark Project Assembly ............................. SUCCESS [02:15 min]
[INFO] Spark Project External Twitter ..................... SUCCESS [  7.330 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [  5.103 s]
[INFO] Spark Project External Flume ....................... SUCCESS [  8.405 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [  2.928 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [ 15.932 s]
[INFO] Spark Project External MQTT Assembly ............... SUCCESS [  7.792 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [  6.057 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [ 10.135 s]
[INFO] Spark Project Examples ............................. SUCCESS [01:49 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [  8.111 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [  5.814 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:28 min
[INFO] Finished at: 2016-07-26T16:05:11+08:00
[INFO] Final Memory: 90M/1589M
[INFO] ------------------------------------------------------------------------

同时在如下目录会找到生成的spark assembly的jar

/data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar

三.安装spark assembly

1. 拷贝assembly jar包

到CDH机器180.153..,将jar包远程拷贝过来,例如到/home/hadoop目录下

scp -P 50201 /data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar [email protected].*.*:/home/hadoop

然后再复制到CDH的jars目录下,如果已存在,将其备份后删除

cp -p /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars

2. 替换CDH中spark下的assembly jar包

其实就是修改软连接spark-assembly.jar指向CDH的jars目录下的spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar,软连接所在路径:
/opt/cloudera/parcels/CDH/lib/spark/lib,删除原来的,新增连接

ln -s ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  
ln -s  spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  spark-assembly.jar

查看软连接情况

[root@db1 lib]# ll
total 209204
-rw-r--r-- 1 root root     21645 Dec  3  2015 python.tar.gz
lrwxrwxrwx 1 root root        68 Jan 14  2016 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        54 Jan 14  2016 spark-assembly.jar -> spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        68 Jan 14  2016 spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root        54 Jan 14  2016 spark-examples.jar -> spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
[root@db1 lib]#

3. 拷贝spark-sql运行文件

从spark源文件的bin下拷贝到CDH的spark的bin目录下

scp -P 50201 /data/spark-1.5.0/bin/spark-sql [email protected].*.*:/opt/cloudera/parcels/CDH/lib/spark/bin

4. 配置环境变量

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CMD=/opt/cloudera/parcels/CDH/bin/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin

5. 拷贝assembly jar包拷贝到HDFS

首先需要将assembly jar拷贝到HDFS的/user/spark/share/lib目录下,修改文件权限为755

hadoop fs -put /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar  /user/spark/share/lib

6. 在CM上配置

  • 登陆CM,修改spark的服务范围为assembly jar在HDFS中的路径
/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar

重新编译、安装spark assembly,使CDH5.5.1支持sparkSQL

  • 修改spark的高级配置
spark.yarn.jar=hdfs://bestCluster/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar

export HIVE_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hive/conf

重新编译、安装spark assembly,使CDH5.5.1支持sparkSQL

  • 点击保存更改,再部署客户端配置即可。

7.运行spark-sql

已配置过环境变量,可在任意目录下执行spark-sql

[hadoop@db1 ~]$ spark-sql
...
...
16/07/27 16:04:52 INFO metastore: Trying to connect to metastore with URI thrift://nn1.hadoop:9083
16/07/27 16:04:52 INFO metastore: Connected to metastore.
16/07/27 16:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/462a9698-5bb6-4d17-bce3-9e162cfd40f8_resources
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8/_tmp_space.db
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
spark-sql>

tips:

1.新建/拷贝的文件要赋予读写权限

2.替换原有文件前,注意查看原有文件所属用户、软连接等信息

以上,完结!

相关推荐