Hadoop(二)Hadoop的HelloWorld(单机模式下的安装和使用)
本文已同步至个人博客liaosi's blog-Hadoop(二)Hadoop的HelloWorld(单机模式下的安装和使用)本文示例使用的VMWare虚拟机,Linux系统版本是CentOS 7_64位,Hadoop的版本是Hadoop 2.8.2,JDK版本是1.8,使用的账号是创建的hadoop账号(参考Hadoop(一)Hadoop的介绍和安装前准备)。
安装Hadoop之前要保证系统已经安装了Java JDK,并配置好了Java环境变量。
Hadoop集群有三种启动模式:
- 单机模式:下载Hadoop在系统中,默认情况下之,Hadoop被配置成以非分布式模式运行的一个独立Java进程。这种模式适合用于调试。
- 伪分布式模式:Hadoop可以在单节点上以所谓的伪分布式模式运行。此时每一个Hadoop守护进程,如 hdfs, yarn, MapReduce 等,都将作为一个独立的java程序运行。这种模式适合用户开发。
- 完全分布式模式:即真正的分布式,需要多台独立服务器组成集群。
本文内容即是单机模式的示例。
一.下载Hadoop并配置
1.从官网上 http://hadoop.apache.org/rele... 下载,并解压到服务器的某个目录下(此处我登录的用户是hadoop,解压到${HOME}/app目录下)。
2.在Hadoop的运行环境配置文件中配置Java的安装目录
编辑 ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
文件,将JAVA_HOME设置为Java安装根路径。
3.配置Hadoop的环境变量
在/etc/profile
文件中增加:
export HADOOP_HOME=/opt/hadoop-2.8.1 export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
比如我的/etc/profile
设置成如下图:
4.执行hadoop version
命令,验证验证环境变量是否配置成功,正常情况下会看到类似如下的结果:
[hadoop@server01 hadoop]$ hadoop version Hadoop 2.8.2 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb Compiled by jdu on 2017-10-19T20:39Z Compiled with protoc 2.5.0 From source with checksum dce55e5afe30c210816b39b631a53b1d This command was run using /home/hadoop/app/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.8.2.jar [hadoop@server01 hadoop]$
二.使用示例
Hadoop自带了一个MapReduce程序$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
,它作为一个例子提供了MapReduce的基本功能,并且可以用于计算,包括 wordcount、terasort、join、grep 等。
以通过执行如下命令查看该.jar
文件支持哪些MapReduce功能。
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
[hadoop@server01 mapreduce]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. [hadoop@server01 mapreduce]$
接下来我们就演示如何使用这个自带的MapReduce程序来计算文件中单词的个数。
1.创建一个目录用来存放我们要处理的数据,可以创建在任何地方(这里我是在/home/hadoop/hadoopdata
的目录下创建一个input的目录),并把想要计算分析的文件放到这个目录下(这里我把Hadoop的配置文件复制一份到input目录下)。
cd /home/hadoop/hadoopdata mkdir input cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input ls -l input
[hadoop@server01 hadoopdata]$ cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input [hadoop@server01 hadoopdata]$ ll input total 52 -rw-r--r--. 1 hadoop hadoop 4942 Apr 30 11:43 capacity-scheduler.xml -rw-r--r--. 1 hadoop hadoop 1144 Apr 30 11:43 core-site.xml -rw-r--r--. 1 hadoop hadoop 9683 Apr 30 11:43 hadoop-policy.xml -rw-r--r--. 1 hadoop hadoop 854 Apr 30 11:43 hdfs-site.xml -rw-r--r--. 1 hadoop hadoop 620 Apr 30 11:43 httpfs-site.xml -rw-r--r--. 1 hadoop hadoop 3518 Apr 30 11:43 kms-acls.xml -rw-r--r--. 1 hadoop hadoop 5546 Apr 30 11:43 kms-site.xml -rw-r--r--. 1 hadoop hadoop 871 Apr 30 11:43 mapred-site.xml -rw-r--r--. 1 hadoop hadoop 1067 Apr 30 11:43 yarn-site.xml [hadoop@server01 hadoopdata]$
2.在这个例子中,我们将 input 文件夹中的所有文件作为输入,筛选当中符合正则表达式 dfs[a-z.]+ 的单词并统计出现的次数,在/home/hadoop/hadoopdata
目录下执行如下命令启动Hadoop进程。
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'
执行成功的话,会打印一系列处理的信息,处理的结果会输出到 output 文件夹中,通过命令 cat output/* 查看结果,符合正则的单词 dfsadmin 出现了1次:
Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=123 File Output Format Counters Bytes Written=23 [hadoop@server01 hadoopdata]$ cat output/* 1 dfsadmin [hadoop@server01 hadoopdata]$ ll output/ total 4 -rw-r--r--. 1 hadoop hadoop 11 Apr 30 12:51 part-r-00000 -rw-r--r--. 1 hadoop hadoop 0 Apr 30 12:51 _SUCCESS [hadoop@server04 hadoopdata]$
注意,Hadoop 默认不会覆盖结果文件,因此再次运行一个命令并且结果也是输出到output目录则会提示出错,需要先将 output 目录删除。
3.删除output目录后我们使用命令在计算一下单词数:
[hadoop@server04 hadoopdata]$ rm -rf output/ [hadoop@server04 hadoopdata]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount input output
查看结果如下:
File Input Format Counters Bytes Read=26548 File Output Format Counters Bytes Written=10400 [hadoop@server04 hadoopdata]$ cat output/* "*" 18 "AS 8 "License"); 8 "alice,bob 18 "clumping" 1 "kerberos". 1 "simple" 1 'HTTP/' 1 'none' 1 'random' 1
这样我们就利用Hadoop自带的MapReduce程序成功地运行了它计算单词个数的功能。