Hadoop 学习记录 - 基础概念篇

文洲

2016-07-07

Hadoop 学习记录 - 基础概念篇

由王宇原创并发布：

一 Hadoop 概念

1. 参考资料

<< Distributed Systems Principles and Paradigms 2nd edition >>
<< Hadoop- The Definitive Guide, 4th Edition>>
<< Hadoop Real-World Solutions Cookbook>>

2. Hadoop 结构

2.1 Hadoop 整体结构

Hadoop Common: Hadoop Java 库和实用工具
Hadoop YARN: Job 调度和集群资源管理框架
Hadoop Distributed File System(HDFS): 分布式文件系统
Hadoop MapReduce: 基于YARN的大数据集合并行处理

2.1.1 MapReduce

The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
JobTracker: The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks.
TaskTracker: he slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically

2.1.2 HDFS

NameNode: manages the file system metadata and one or more slave
DataNode: store the actual data

2.2 HDFS

2.2.1 HDFS 功能

It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

2.2.2 HDFS 结构

Namenode 任务

Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

2.3 MapReduce

MapReduce 是一种处理技术，是一种基于JAVA的分布式程序模型。MapReduce算法包含两个重要的任务，Map 和 Reduce。

2.3.1 算法

Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Hadoop 学习记录 - 基础概念篇

2.3.2 Inputs and Outpus

2.3.3 术语

PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

二 Hadoop环境配置

1. 搭建虚拟机

VirualBox 5.0.24, 用户手册
Linux OS : Puppy Slacko6.3.2 64bit
在VirualBox 上安装三台虚拟机，机器名分别为：

hadoop-master
hadoop-slave-1
hadoop-slave-2

2. 软件安装及环境配置

下载JDK
安装Eclipse
下载Hadoop 2.7.2
安装ssh
在~/.bashrc 中配置环境变量

JAVA_HOME=/opt/jdk1.8.0_92
CLASSPATH=.:${JAVA_HOME}/lib/
export JAVA_HOME=${JAVA_HOME}
export CLASSPATH=${CLASSPATH}:${CLASSPATH}tools.jar:${CLASSPATH}dt.jar
export PATH=.:${PATH}:${JAVA_HOME}/bin:${CLASSPATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

在 ~/.profile 最后一行加入： source ~/.bashrc 否则在ssh登录后，无法执行.bashrc, 使得环境变量无效。

3. 配置SSH

使用Package Manager 下载 OpenSSH
生成 Key

ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
     ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
     ssh-keyigen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key

配置 /etc/ssh/sshd_config

PermitRootLogin yes
     TCPKeepAlive yes
     ClientAliveInterval 60
     ClientAliveCountMax 3
     MaxStartups 10:30:100

修改 root 密码

passwd root

启动SSH服务

/usr/sbin/sshd

开机自动启动SSH服务
在 ~/Startup 目录中，增加 launchSSH.sh

#!/usr/bin/bash
/usr/sbin/sshd

可以采用如下命令，调试SSH

ssh -v ip

配置Putty

Hadoop 学习记录 - 基础概念篇

4. 配置 Hadoop

在每个节点上配置SSH Key

$ ssh-keygen -t rsa 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2

在每个节点上编辑hosts

# vi /etc/hosts
enter the following lines in the /etc/hosts file.
192.168.67.113 hadoop-master
192.168.67.27 hadoop-slave-1
192.168.67.98 hadoop-slave-2

core-site.xml

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://localhost:9000/</value>
        </property>
        <property>
                <name>dfs.permissions</name>
                <value>false</value>
        </property>
</configuration>

hdfs-site.xml

<configuration>
   <property>
      <name>dfs.data.dir</name>
      <value>/opt/hadoop-2.7.2/dfs/name/data</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>/opt/hadoop-2.7.2/dfs/name</value>
      <final>true</final>
   </property>

   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
</configuration>

mapred-site.xml

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>hadoop-master:9001</value>
   </property>
</configuration>

hadoop-env.sh
将下列环境变量更换为绝对路径：

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/opt/jdk1.8.0_92

#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
export HADOOP_CONF_DIR="/opt/hadoop-2.7.2/etc/hadoop"

5. 安装Hadoop到 Slave 服务器上

$ cd /opt 
$ scp -r hadoop-2.7.2 hadoop-slave-1:/opt/hadoop-2.7.2
$ scp -r hadoop-2.7.2 hadoop-slave-2:/opt/hadoop-2.7.2

6. 配置 Master 服务器

$ cd /opt/hadoop-2.7.2

// Configuring Master Node
$ vi etc/hadoop/masters
hadoop-master

// Configuring Slave Node
$ vi etc/hadoop/slaves
hadoop-master
hadoop-slave-1 
hadoop-slave-2

7. 在Master 服务器上格式化 Name Node

$ cd /opt/hadoop-2.7.2
$ bin/hadoop namenode –format

8. 启动 Hadoop 服务

$ cd /opt/hadoop-2.7.2
$ sbin/start-all.sh
$ jps

通过logs/hadoop-root-namenode-puppypc3200.log 查看异常

9. 观察 Hadoop 的端口服务

$ cd /opt/hadoop-2.7.2
$ netstat -apnt | grep 'java'

Hadoop 学习记录 - 基础概念篇

10. 使用浏览器访问Hadoop

访问Hadoop

http://master ip:50070

确认集群上的所有应用

http://master ip:8088

11. 操作HDFS

HDFS 命令清单
使用 help 参数，查询命令的详细信息

三 Hadoop MapReduce 开发实例

1. WordCount 此例的功能是计算输入文件中的单词数量

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

2. 编译WordCount 并创建JAR

$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

3. 编辑输入文件file01 file02

file01 输入：Hello World Bye World
file02 输入：Hello Hadoop Goodbye Hadoop

将输入文件复制到HDFS 的 /data/input 目录下

$ bin/hadoop fs -copyFromLocal file0* /data/input/.

4. 执行应用

$ bin/hadoop jar wc.jar WordCount /data/input /data/output

5. 查看结果

hadoop

文洲

0 关注 0 粉丝 0 动态

关注关注

Hadoop3.2.0集群搭建常见注意事项

hadoop-env.sh中不光需要配置java-home,还需要声明下面这些用户变量,不然无法启动:. 如果出现这个说明连接配置有问题,查看core-site.xml配置,这个是配置datanode和namnode通信的:. hdfs应该配置成namno

changjiang 2020-11-16

为什么Java仍将是未来的主流语言？

Java是一种通用编程语言，1995年由Sun Micro-systems公司开发。尽管已经有25年的历史，但它仍然统治着整个世界。根据Stack-overflow的开发者调查，它在2019年最受欢迎的语言中排名第5。超过41%的调查用户将Java标记为

minerd 2020-10-28

hadoop伪分布式环境搭建

core-site.xml文件主要配置了访问Hadoop集群的主要信息，其中master代表主机名称，也可以使用IP替换，9000代表端口。外部通过配置的hdfs：//master：9000信息，就可以找到Hadoop集群。hdfs-site.xml配置文

WeiHHH 2020-09-23

_服役新节点，退役旧节点，多目录配置。+_HDFS2.x的新特性

datanode会主动Namenode请求。这样服役新的节点就做完了。添加到白名单的主机节点，都允许访问NameNode，不在白名单的主机节点，都会被退出。在NameNode的/opt/module/hadoop-2.7.2/etc/hadoop目录下创建

Aleks 2020-08-19

Hadoop（一）安装

################ hadoop fs 文件系统 ####################. ################ hadoop mapreduce 计算框架 ####################. #############

WeiHHH 2020-08-17

第四周练习

13、添加用户bash,testbash,basher,nologin,而后找出当前系统上其用户名和默认shell相同的用户的信息

jackydai 2020-07-28

Hadoop小练习

Hadoop和HDFS内容：1.什么是HDFS文件系统？HDFS是大数据开源框架hadoop的组件之一，全称，它是一个分布式文件系统，由多台服务器联合起来实现文件存储功能，通过目录树来定位文件，集群中的服务器都有有各自的角色.

飞鸿踏雪0 2020-07-26

hadoop框架三大组件hdfs、mapreduce、yarn 内容

1）分布式的运算程序往往需要分成至少2个阶段。2）第一个阶段的MapTask并发实例，完全并行运行，互不相干。3）第二个阶段的ReduceTask并发实例互不相干，但是他们的数据依赖于上一个阶段的所有MapTask并发实例的输出。4）MapReduce编程

tomli 2020-07-26

Hadoop基础（三十三）：Zookeeper 分布式安装部署

在 hadoop102、hadoop103 和 hadoop104 三个节点上部署 Zookeeper。A 是一个数字，表示这个是第几号服务器；就是 A 的值，Zookeeper 启动时读取此文件，拿到里面的数据与 zoo.cfg 里面的配置信息比。较从而

deyu 2020-07-21

Hadoop基础（二十二）：Shuffle机制（三）

统计过程中对每一个MapTask的输出进行局部汇总，以减小网络传输量即采用Combiner功能。public class WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWr

strongyoung 2020-07-19

hdfs、hive、hbase的搭建总结

-- 完全分布式文件系统的名称：schema ip port -->. -- 分布式文件系统的其他路径的所依赖的一个基础路径，完全分布式不能使用默认值，因为临路径不安全，linux系统在重启时，可能会删除此目录下的内容-->. --

eternityzzy 2020-07-19

NameNode和Zookeeper的format作用

在我们安装高可用hadoop集群时，我们会按照以下命令去执行启动操作；??在备namenode节点同步元数据??那么我们为什么要对NameNode和Zookeeper进行format操作；core-site.xml 是 NameNode 的核心配置文件，主

Elmo 2020-07-19

hadoop集群的启动与停止

漫长的启动时间…………思考：每次都一个一个节点启动，如果节点数增加到1000个怎么办？早上来了开始一个一个节点启动，到晚上下班刚好完成，下班？这些名称是我的三台机器的主机名，各位请改成自己的主机名！如果集群是第一次启动，需要格式化NameNode，这里使用

飞鸿踏雪0 2020-07-09

JStorm介绍

JStorm是一个类似于Hadoop MapReduce的系统，用户按照指定的接口实现一个任务，然后将这个任务交给JStorm系统，JStorm将这个任务跑起来，并按7*24小时运行。如果中间一个worker发生了意外故障，调度器立即分配一个新的worke

csdnhadoop 1评论 2020-07-04

Hadoop2.7.7 centos7 完全分布式配置与问题随记

三台机器之间的免密登陆ssh. 说明ip映射没有配置正确，正确的方式是在每一个节点上，都执行"内外外"的配置方式，即将本机与本机的内网ip对应，其他机器设置为外网ip. --SecondayNameNode默认与NameNode在同一台

飞鸿踏雪0 2020-07-04

Hadoop Yarn工作机制 Job提交流程

MR程序提交到客户端所在的节点。YarnRunner向ResourceManager申请一个Application。RM将该应用程序的资源路径返回给YarnRunner。RM将用户的请求初始化成一个Task。其中一个NodeManager领取到Task任务

xieting 2020-07-04

【赵强老师】大数据工作流引擎Oozie

工作流就是工作流程的计算模型，即将工作流程中的工作如何前后组织在一起的逻辑和规则在计算机中以恰当的模型进行表示并对其实施计算。Java的三大主流工作流引擎分别是：Shark，osworkflow，JBPM. Oozie Server负责接收客户端请求、调度

WeiHHH 2020-06-28

Hadoop

Zookeeper：用于 Hadoop 的分布式协调服务。Hadoop 的许多组件依赖于 Zookeeper，它运行在计算机集群中，用于管理 Hadoop 集群。像 Pig 一样，Hive 作为一个抽象层工具，吸引了很多熟悉 SQL 而不是 Java 编程

genshengxiao 2020-06-26

入门大数据---Spark开发环境搭建

Local 模式是最简单的一种运行方式，它采用单节点多线程方式运行，不用部署，开箱即用，适合日常测试开发。进入 spark-shell 后，程序已经自动创建好了上下文 SparkContext，等效于执行了下面的 Scala 代码：。安装完成后可以先做一个

Hhanwen 2020-06-25

hadoop创建目录

//1.vm arguments中添加后面的参数来修改用户 -DHADOOP_USER_NAME=hadoop

硅步至千里 2020-06-25

Hadoop 学习记录 - 基础概念篇

Hadoop 学习记录 - 基础概念篇

一 Hadoop 概念

1. 参考资料

2. Hadoop 结构

2.1 Hadoop 整体结构

2.1.1 MapReduce

2.1.2 HDFS

2.2 HDFS

2.2.1 HDFS 功能

2.2.2 HDFS 结构

2.3 MapReduce

2.3.1 算法

2.3.2 Inputs and Outpus

2.3.3 术语

二 Hadoop环境配置

1. 搭建虚拟机

2. 软件安装及环境配置

3. 配置SSH

4. 配置 Hadoop

5. 安装Hadoop到 Slave 服务器上

6. 配置 Master 服务器

7. 在Master 服务器上格式化 Name Node

8. 启动 Hadoop 服务

9. 观察 Hadoop 的端口服务

10. 使用浏览器访问Hadoop

11. 操作HDFS

三 Hadoop MapReduce 开发实例

1. WordCount 此例的功能是计算输入文件中的单词数量

2. 编译WordCount 并创建JAR

3. 编辑输入文件file01 file02

4. 执行应用

5. 查看结果

相关推荐