Hadoop 处理不同的输入文件，文件关联

hadoop

2014-04-12

关注关注

类型一：一一对应

file1：

a 1

b 2

c 3

file2：

1 ！

2 @

3 #

file1和file2进行关联，想要的结果：

a !

b @

3 #

思路：

1、标记不同输入文件

2、将file1的key、value颠倒；file1和file2的key相同，file1的value做key，file2的value做value ，输出。

程序：

package smiple;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.Hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class FileJoin {

public static class MyMap extends Mapper<LongWritable , Text, Text, Text> {

public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
// String line = value.toString();
String line=new String(value.getBytes(),0,value.getLength(),"GBK");
StringTokenizer tokenizer = new StringTokenizer(line);
String keystr = tokenizer.nextToken();
String valuestr = tokenizer.nextToken();

//获取文件名
InputSplit inputSplit = context.getInputSplit();
String fileName = ((FileSplit) inputSplit).getPath().getName();

if("file1".equals(fileName)){//加标记
context.write(new Text(valuestr),new Text("file1_"+keystr));
}else if("file2".equals(fileName)){
context.write(new Text(keystr), new Text("file2_"+valuestr));
}

}
}

public static class MyReduce extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
Text resultKey = new Text("key0");
Text resultValue = new Text("value0");
for (Text val : values) {
if("file1_".equals(val.toString().substring(0, 6))){
resultKey = new Text(val.toString().substring(6));
}else if("file2_".equals(val.toString().substring(0, 6))){
resultValue = new Text(val.toString().substring(6));
}
}
System.out.println(resultKey.toString()+" " + resultValue.toString());
context.write(resultKey, resultValue);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] ioArgs = new String[] { "hdfs://ip:port/mr/join/in","hdfs://ip:port/mr/join/out" };
String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();

if (otherArgs.length != 2) {
System.err.println("Usage: Data Sort <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "file join ");

job.setJarByClass(Sort.class);

// 设置Map和Reduce处理类
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);

// 设置输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

// 设置输入和输出目录
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

结果：

Hadoop 处理不同的输入文件，文件关联

相关阅读：

string apache hadoop

安科网

Hadoop 处理不同的输入文件，文件关联

hadoop

结果：

hadoop

相关推荐

消费消息+手动提交+同步异步

golang的序列化与反序列化的几种方式

Redis中的String类型及使用Redis解决订单秒杀超卖问题

springboot +redis 实现点赞、浏览、收藏、评论等数量的增减操作

Ajax实现登录案例

php使用event扩展的io复用测试的示例

Golang和Rust语言常见功能/库

好用到哭！请记住这20段Python代码

[Typescript] Function Overloads

JDBC连接MySQL

Golang面试make和new的用法

Redis migrate数据迁移工具的使用教程

关于 JavaScript 错误处理的最完整指南(下半部)

基于thinkphp5框架实现微信小程序支付退款订单查询退款查询操作

Golang 如何解析和生成json

PHP执行普通shell命令流程解析

php判断IP地址是否在多个IP段内

Python初学者必学的20个重要技巧

源码分析C++的string的实现

想要在JS中把正则玩得飘逸，学会这几个函数的使用必不可少

hadoop