Idea+maven+scala构建包并在spark on yarn 运行

配置Maven项目

pom.xml配置文件中配置spark开发所需要的包,根据你Spark版本找对应的包,Maven中央仓库

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.3.1</version>
</dependency>

构建方式

配置Artifacts构建包

Idea+maven+scala构建包并在spark on yarn 运行
Idea+maven+scala构建包并在spark on yarn 运行
Idea+maven+scala构建包并在spark on yarn 运行

配置Maven构建包

  • 使用Maven构建包只需要在pom.xml中添加如下插件(maven-shade-plugin)即可
<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.handlers</resource>
                                </transformer>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.schemas</resource>
                                </transformer>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>cn.mucang.sensor.SensorMain</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

构建示例scala代码

import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}

object InfoOutput {
  def main(args: Array[String]): Unit = {
     val sparkConf = new SparkConf().setMaster("local[*]").setAppName("NginxLog")
    val sc = new SparkContext(sparkConf)
    val fd = sc.textFile("hdfs:///xxx/logs/access.log")
    val logRDD = fd.filter(_.contains(".baidu.com")).map(_.split(" "))
    logRDD.persist(StorageLevel.DISK_ONLY)
    val ipTopRDD = logRDD.map(v => v(2)).countByValue().take(10)
    ipTopRDD.foreach(println)
  }
}

Idea+maven+scala构建包并在spark on yarn 运行

上传Jar

  • 使用scp上传Jar包到spark-submit服务器,Jar位置在项目的out目录下
  • 因为没有依赖第三方包所以打出怕jar会很小,使用spark-submit提示任务:
spark-submit --class InfoOutput --verbose --master yarn --deploy-mode cluster nginxlogs.jar

相关推荐