hadoop(5)Eclipse and Example

liujason

2014-07-11

hadoop(5)Eclipse and Example
Find the sample project hadoop-mapreduce-examples

Download the STS tool to work on the JAVA project.

The sample project is easyhadoop. It is built based on MAVEN.

Here is the pom.xml dependency.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sillycat</groupId>
<artifactId>easyhadoop</artifactId>
<version>1.0</version>
<description>Hadoop MapReduce Example</description>
<name>Hadoop MapReduce Examples</name>
<packaging>jar</packaging>

<properties>
    <hadoop.version>2.4.1</hadoop.version>
</properties>

<dependencies>
    <dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.3</version>
    </dependency>
       <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>
</dependencies>

<build>
   <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
     <artifactId>maven-jar-plugin</artifactId>
      <configuration>
       <archive>
         <manifest>
           <mainClass>com.sillycat.easyhadoop.ExecutorDriver</mainClass>
         </manifest>
       </archive>
     </configuration>
    </plugin>

   </plugins>
   </build>
</project>

Here is the mapper class which will fetch the data from files and mapper to arrays
package com.sillycat.easyhadoop.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

     private final static IntWritable one = new IntWritable(1);

     private Text word = new Text();

     public void map(Object key, Text value, Context context)
               throws IOException, InterruptedException {
          StringTokenizer itr = new StringTokenizer(value.toString());
          while (itr.hasMoreTokens()) {
               word.set(itr.nextToken());
               context.write(word, one);
          }
     }
}

Here is the reducer class, based on the mapper array, it will reduce the data and get a result
package com.sillycat.easyhadoop.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends
          Reducer<Text, IntWritable, Text, IntWritable> {
     private IntWritable result = new IntWritable();

     public void reduce(Text key, Iterable<IntWritable> values, Context context)
               throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
               sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
     }
}

Here is the main class which run the word count job.
package com.sillycat.easyhadoop.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

     public static void main(String[] args) throws IOException,
               ClassNotFoundException, InterruptedException {

          Configuration conf = new Configuration();
          String[] otherArgs = new GenericOptionsParser(conf, args)
                    .getRemainingArgs();
          if (otherArgs.length != 2) {
               System.err.println("Usage: wordcount <in> <out>");
               System.exit(2);
          }
          Job job = Job.getInstance(conf, "word count");

          job.setJarByClass(WordCount.class);
          job.setMapperClass(WordCountMapper.class);
          job.setCombinerClass(WordCountReducer.class);
          job.setReducerClass(WordCountReducer.class);
          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(IntWritable.class);
          FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
          FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
          System.exit(job.waitForCompletion(true) ? 0 : 1);
     }

}

Once we create the input/wordcount directory and output directory, we can directly run that on eclipse.
Just create a run configuration for Java Application, add Arguments
input/wordcount output/wordcount

Add Environment
HADOOP_HOME=/opt/hadoop
or
hadoop.home.dir=/opt/hadoop

If I want to run it from the multiple machine cluster. I need to create the jar based on maven
>mvn clean install

First put the jar under this directory /opt/hadoop/share/custom, Here is how it runs on local machine
>hadoop jar /opt/hadoop/share/custom/easyhadoop-1.0.jar wordcount input output

On the ubuntu-master, place the jar under the /opt/hadoop/share/custom directory.
Start all the servers.
>sbin/start-dfs.sh
>sbin/start-yarn.sh
>sbin/mr-jobhistory-daemon.sh start historyserver

Since I already put my files in the hdfs.
>hadoop fs -mkdir -p /data/worldcount
>hadoop fs -put /opt/hadoop/etc/hadoop/*.xml /data/worldcount/

I can directly run my jar
>hadoop jar /opt/hadoop/share/custom/easyhadoop-1.0.jar wordcount /data/worldcount /output/worldcount2

And this will show me the result
>hadoop fs -cat /output/worldcount2/*

Actually, I just want to know about hadoop and map reduce framework, finally, I thought I will use Hbase, Spark. So I did not try to mapping and reducing based on database.

References:
http://hadoop.apache.org/docs/r2.4.1/api/

http://java.dzone.com/articles/running-hadoop-mapreduce
http://www.cnblogs.com/shitouer/archive/2012/05/29/2522860.html
http://blog.csdn.net/zythy/article/details/17397153
https://github.com/romainr/hadoop-tutorials-examples
http://www.javaworld.com/article/2077907/open-source-tools/mapreduce-programming-with-apache-hadoop.html
http://wiki.apache.org/hadoop/Grep

http://www.osedu.net/article/nosql/2012-05-02/435.html
http://www.ibm.com/developerworks/cn/java/j-javadev2-15/

hadoop classpath
http://grepalex.com/2013/02/25/hadoop-libjars/
http://stackoverflow.com/questions/12940239/hadoop-hadoop-classpath-issues

mapper from and reducer to DB
http://archanaschangale.wordpress.com/2013/09/26/database-access-with-apache-hadoop/
http://shazsterblog.blogspot.com/2012/11/storing-hadoop-wordcount-example-with.html

hadoop

安科网

hadoop(5)Eclipse and Example

liujason

liujason

相关推荐

Hadoop3.2.0集群搭建常见注意事项

为什么Java仍将是未来的主流语言？

hadoop伪分布式环境搭建

_服役新节点，退役旧节点，多目录配置。+_HDFS2.x的新特性

Hadoop（一）安装

第四周练习

Hadoop小练习

hadoop框架三大组件hdfs、mapreduce、yarn 内容

Hadoop基础（三十三）：Zookeeper 分布式安装部署

Hadoop基础（二十二）：Shuffle机制（三）

hdfs、hive、hbase的搭建总结

NameNode和Zookeeper的format作用

hadoop集群的启动与停止

JStorm介绍

Hadoop2.7.7 centos7 完全分布式配置与问题随记

Hadoop Yarn工作机制 Job提交流程

【赵强老师】大数据工作流引擎Oozie

Hadoop

入门大数据---Spark开发环境搭建

hadoop创建目录

liujason