Hadoop学习 之单节点集群配置
Hadoop学习 之单节点集群配置
1.官网下载Hadoop
http://hadoop.apache.org/releases.html 下载hadoop-2.7.3.tar.gz 在hadoop工作目录 eg: ~/SoftWare/BigData/Hadoop
2 cd eg: ~/SoftWare/BigData/Hadoop/
tar -zxvf adoop-2.7.3.tar.gz
3.检查JDK是否安装,如未安装,则安装JDK7+,并设定$JAVA_HOME, $PATH, $CLASSPATH
4 检查ssh, rsync是否安装
若未安装,则安装
$ sudo apt-get install ssh
$ sudo apt-get install rsync
5 修改${HADOOP_HOME}/etc/hadoop/hadoop-env.sh中JAVA_HOME设置,使用实际的绝对路径
eg:export JAVA_HOME=/home/username/SoftWare/Java/jdk1.8.0_65
6 Standalone Operation下执行一个mapreduce计算
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
$ cat output/*
可以看到
$dfsadmin 1
此时表明Standalone Operation下,Hadoop运行成功
7 Pseudo-Distributed Operation
1)修改${HADOOP_HOME}/etc/hadoop/core-site.xml
<configuration>
<property>
<!-- 指定HDFS老大(namenode)的通信地址 -->
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<!-- 指定hadoop运行时产生文件的存储路径 -->
<name>hadoop.tmp.dir</name>
<value>/home/username/SoftWare/BigData/Hadoop/tmp</value>
</property>
</configuration>
2)修改${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<!-- 设置hdfs副本数量 -->
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
3)检查SSH可否无密码登陆localhost
$ ssh localhost
4) 若无需密码SSH可以登陆localhost,则忽略此步,若需要密码,则执行
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
5)HDFS的启动与执行
HDFS的初次执行需要先格式化,执行${HADOOP_HOME}/bin/hdfs namenode -format
启动HDFS ${HADOOP_HOME}/sbin/start-dfs.sh
使用JPS查看HDFS是否启动成功
$ jps
63842 SecondaryNameNode
63381 NameNode
8470 Jps
63565 DataNode
看到NameNode,DataNode,SecondaryNameNode进程已经启动,表示HDFS启动成功
此时可通过 http://localhost:50070/ 查看namenode, 通过 http://localhost:50090 查看datanode
6) 上传并测试一个文件
本地创建一个文件words.txt, 内容如下
Hello World!
Hello China!
Hello Jim
Hello Tom
The People's Republic Of China!
上传words.txt至HDFS根目录 ${HADOOP_HOME}/bin/hadoop fs -put words.txt /
此时可在http://localhost:50070/explorer.html#/ 查看到上传的文件
7)运行一个例子$ ${HADOOP_HOME}/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
查看结果$ ${HADOOP_HOME}/bin/hdfs dfs -cat output/*
China! 2
Hello 4
Jim 1
Of 1
People's 1
Republic 1
The 1
Tom 1
World! 1
8)Pseudo-Distributed Operation下mapreduce可以运行在Yarn上,若需运行在Yarn上
则修改${HADOOP_HOME}/etc/hadoop/mapred-site.xml
<configuration>
<property>
<!-- 通知框架MR使用YARN -->
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
修改${HADOOP_HOME}/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<!-- reducer取数据的方式是mapreduce_shuffle -->
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<!--表示该节点上YARN可使用的物理内存总量,默认是8192(MB),注意,如果你的节点内存资源不够8GB,则需要调减小这个值,而YARN不会智能的探测节点的物理内存总量。 MB为单位-->
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<!--nodemanager可供分配的最小内存 MB为单位-->
<name>yarn.nodemanager.minmum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<!--单个任务可申请的最多物理内存量,默认是8192(MB) MB为单位-->
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<!--用于磁盘空间检查,低于某一值时,会导致mapreduce无法正常运行-->
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99</value>
</property>
</configuration>
9)启动yarn, ${HADOOP_HOME}/sbin/start_yarn.sh
10)使用jps检查yarn是否启动成功
$ jps
13761 SecondaryNameNode
13410 NameNode
13923 ResourceManager
16744 Jps
14057 NodeManager
13567 DataNode
此时表明yarn启动成功
可通过http://localhost:8088/cluster查看cluster信息
11)运行一个mapreduce程序
${HADOOP_HOME}/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output_wordcount
17/05/13 10:38:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/05/13 10:38:06 INFO input.FileInputFormat: Total input paths to process : 1
17/05/13 10:38:06 INFO mapreduce.JobSubmitter: number of splits:1
17/05/13 10:38:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494642975142_0001
17/05/13 10:38:07 INFO impl.YarnClientImpl: Submitted application application_1494642975142_0001
17/05/13 10:38:07 INFO mapreduce.Job: The url to track the job: http://tizen-HP-Compaq-Pro-6380-MT:8088/proxy/application_1494642975142_0001/
17/05/13 10:38:07 INFO mapreduce.Job: Running job: job_1494642975142_0001
17/05/13 10:38:13 INFO mapreduce.Job: Job job_1494642975142_0001 running in uber mode : false
17/05/13 10:38:13 INFO mapreduce.Job: map 0% reduce 0%
17/05/13 10:38:18 INFO mapreduce.Job: map 100% reduce 0%
17/05/13 10:38:23 INFO mapreduce.Job: map 100% reduce 100%
17/05/13 10:38:24 INFO mapreduce.Job: Job job_1494642975142_0001 completed successfully
17/05/13 10:38:24 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=113
FILE: Number of bytes written=237983
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=180
HDFS: Number of bytes written=71
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2214
Total time spent by all reduces in occupied slots (ms)=2302
Total time spent by all map tasks (ms)=2214
Total time spent by all reduce tasks (ms)=2302
Total vcore-milliseconds taken by all map tasks=2214
Total vcore-milliseconds taken by all reduce tasks=2302
Total megabyte-milliseconds taken by all map tasks=2267136
Total megabyte-milliseconds taken by all reduce tasks=2357248
Map-Reduce Framework
Map input records=5
Map output records=13
Map output bytes=130
Map output materialized bytes=113
Input split bytes=102
Combine input records=13
Combine output records=9
Reduce input groups=9
Reduce shuffle bytes=113
Reduce input records=9
Reduce output records=9
Spilled Records=18
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=88
CPU time spent (ms)=1410
Physical memory (bytes) snapshot=445538304
Virtual memory (bytes) snapshot=3855974400
Total committed heap usage (bytes)=290979840
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=78
File Output Format Counters
Bytes Written=71
表明运行成功 可通过${HADOOP_HOME}/bin/hadoop fs -cat /output_wordcount/* 查看到
China! 2
Hello 4
Jim 1
Of 1
People's 1
Republic 1
The 1
Tom 1
World! 1
注:使用Yarn时运行一个MapReduce任务 出现
2017-05-13 10:38:07,465 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start
2017-05-13 10:38:07,465 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue for user, it is likely set too low. skipping enforcement to allow at least one application to start
必须在${HADOOP_HOME}/etc/hadoop/yarn-site.xml 添加
<property>
<!--用于磁盘空间检查,低于某一值时,会导致mapreduce无法正常运行-->
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99</value>
</property>