Hdfs之DistributedCache
在利用mapred分析大数据时,程序中常常会引入hdfs上一些“辅助数据”,通常的做法在mr的启动前加载这些数据形成cache以提高效率,而mr在大并发下去访问hdfs的同一文件可能存在性能瓶颈,DistributedCache可以帮助解决。
DistributedCache</code> is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.
添加cacheFile:
DistributedCache.addFileToClassPath(new Path(args[2]), conf);
cacheFile通过hadoop命令参数提供,args[2]是/group/tlog/resources/ipAppMapping.txt
在mapper或reducer中使用:
@Override
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localArchives = DistributedCache.getFileClassPaths(context.getConfiguration());
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream in;
if (localArchives == null) {
System.out.println("Load refources file form system class loader.");
in = ClassLoader.getSystemResourceAsStream("ipAppMapping.txt");
} else {
in = fs.open(localArchives[0]);
}
if (in == null) {
throw new RuntimeException("Resource file is not exist.");
}
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
//加载辅助数据
reader.close();
}