Hadoop常见问题及解决办法

SunWuKongHadoop

2014-12-23

关注关注

1：ShuffleError:ExceededMAX_FAILED_UNIQUE_FETCHES;bailing-out

Answer：

程序里面需要打开多个文件，进行分析，系统一般默认数量是1024，（用ulimit-a可以看到）对于正常使用是够了，但是对于程序来讲，就太少了。

修改办法：

修改2个文件。

/etc/security/limits.conf

vi/etc/security/limits.conf

加上：

*softnofile102400

*hardnofile409600

$cd/etc/pam.d/

$sudovilogin

添加sessionrequired/lib/security/pam_limits.so

针对第一个问题我纠正下答案：

这是reduce预处理阶段shuffle时获取已完成的map的输出失败次数超过上限造成的，上限默认为5。引起此问题的方式可能会有很多种，比如网络连接不正常，连接超时，带宽较差以及端口阻塞等。。。通常框架内网络情况较好是不会出现此错误的。

2：Toomanyfetch-failures

Answer:

出现这个问题主要是结点间的连通不够全面。

1)检查、/etc/hosts

要求本机ip对应服务器名

要求要包含所有的服务器ip+服务器名

2)检查.ssh/authorized_keys

要求包含所有服务器（包括其自身）的publickey

3：处理速度特别的慢出现map很快但是reduce很慢而且反复出现reduce=0%

Answer:

结合第二点，然后

修改conf/hadoop-env.sh中的exportHADOOP_HEAPSIZE=4000

4：能够启动datanode，但无法访问，也无法结束的错误

在重新格式化一个新的分布式文件时，需要将你NameNode上所配置的dfs.name.dir这一namenode用来存放NameNode持久存储名字空间及事务日志的本地文件系统路径删除，同时将各DataNode上的dfs.data.dir的路径DataNode存放块数据的本地文件系统路径的目录也删除。如本此配置就是在NameNode上删除/home/hadoop/NameData，在DataNode上删除/home/hadoop/DataNode1和/home/hadoop/DataNode2。这是因为Hadoop在格式化一个新的分布式文件系统时，每个存储的名字空间都对应了建立时间的那个版本（可以查看/home/hadoop/NameData/current目录下的VERSION文件，上面记录了版本信息），在重新格式化新的分布式系统文件时，最好先删除NameData目录。必须删除各DataNode的dfs.data.dir。这样才可以使namedode和datanode记录的信息版本对应。

注意：删除是个很危险的动作，不能确认的情况下不能删除！！做好删除的文件等通通备份！！

5：java.io.IOException:Couldnotobtainblock:blk_194219614024901469_1100file=/user/hive/warehouse/src_20090724_log/src_20090724_log

出现这种情况大多是结点断了，没有连接上。

6：java.lang.OutOfMemoryError:Javaheapspace

出现这种异常，明显是jvm内存不够得原因，要修改所有的datanode的jvm内存大小。

Java-Xms1024m-Xmx4096m

一般jvm的最大内存使用应该为总内存大小的一半，我们使用的8G内存，所以设置为4096m，这一值可能依旧不是最优的值。

本主题由admin于2009-11-2010:50置顶

顶，这样的贴子非常好，要置顶。附件是由Hadoop技术交流群中若冰的同学提供的相关资料：

(12.58KB)

Hadoop添加节点的方法

自己实际添加节点过程：

1.先在slave上配置好环境，包括ssh，jdk，相关config，lib，bin等的拷贝；

2.将新的datanode的host加到集群namenode及其他datanode中去；

3.将新的datanode的ip加到master的conf/slaves中；

4.重启cluster,在cluster中看到新的datanode节点；

5.运行bin/start-balancer.sh，这个会很耗时间

备注：

1.如果不balance，那么cluster会把新的数据都存放在新的node上，这样会降低mr的工作效率；

2.也可调用bin/start-balancer.sh命令执行，也可加参数-threshold5

threshold是平衡阈值，默认是10%，值越低各节点越平衡，但消耗时间也更长。

3.balancer也可以在有mrjob的cluster上运行，默认dfs.balance.bandwidthPerSec很低，为1M/s。在没有mrjob时，可以提高该设置加快负载均衡时间。

其他备注：

1.必须确保slave的firewall已关闭;

2.确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中，反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中

mapper及reducer个数

url地址：http://wiki.apache.org/hadoop/HowManyMapsAndReduces

HowManyMapsAndReduces

Partitioningyourjobintomapsandreduces

PickingtheappropriatesizeforthetasksforyourjobcanradicallychangetheperformanceofHadoop.Increasingthenumberoftasksincreasestheframeworkoverhead,butincreasesloadbalancingandlowersthecostoffailures.Atoneextremeisthe1map/1reducecasewherenothingisdistributed.Theotherextremeistohave1,000,000maps/1,000,000reduceswheretheframeworkrunsoutofresourcesfortheoverhead.

NumberofMaps

ThenumberofmapsisusuallydrivenbythenumberofDFSblocksintheinputfiles.AlthoughthatcausespeopletoadjusttheirDFSblocksizetoadjustthenumberofmaps.Therightlevelofparallelismformapsseemstobearound10-100maps/node,althoughwehavetakenitupto300orsoforverycpu-lightmaptasks.Tasksetuptakesawhile,soitisbestifthemapstakeatleastaminutetoexecute.

Actuallycontrollingthenumberofmapsissubtle.Themapred.map.tasksparameterisjustahinttotheInputFormatforthenumberofmaps.ThedefaultInputFormatbehavioristosplitthetotalnumberofbytesintotherightnumberoffragments.However,inthedefaultcasetheDFSblocksizeoftheinputfilesistreatedasanupperboundforinputsplits.Alowerboundonthesplitsizecanbesetviamapred.min.split.size.Thus,ifyouexpect10TBofinputdataandhave128MBDFSblocks,you'llendupwith82kmaps,unlessyourmapred.map.tasksisevenlarger.Ultimatelythe[WWW]InputFormatdeterminesthenumberofmaps.

ThenumberofmaptaskscanalsobeincreasedmanuallyusingtheJobConf'sconf.setNumMapTasks(intnum).Thiscanbeusedtoincreasethenumberofmaptasks,butwillnotsetthenumberbelowthatwhichHadoopdeterminesviasplittingtheinputdata.

NumberofReduces

Therightnumberofreducesseemstobe0.95or1.75*(nodes*mapred.tasktracker.tasks.maximum).At0.95allofthereducescanlaunchimmediatelyandstarttransferingmapoutputsasthemapsfinish.At1.75thefasternodeswillfinishtheirfirstroundofreducesandlaunchasecondroundofreducesdoingamuchbetterjobofloadbalancing.

Currentlythenumberofreducesislimitedtoroughly1000bythebuffersizefortheoutputfiles(io.buffer.size*2*numReduces<<heapSize).Thiswillbefixedatsomepoint,butuntilitisitprovidesaprettyfirmupperbound.

Thenumberofreducesalsocontrolsthenumberofoutputfilesintheoutputdirectory,butusuallythatisnotimportantbecausethenextmap/reducestepwillsplitthemintoevensmallersplitsforthemaps.

Thenumberofreducetaskscanalsobeincreasedinthesamewayasthemaptasks,viaJobConf'sconf.setNumReduceTasks(intnum).

自己的理解：

mapper个数的设置：跟inputfile有关系，也跟filesplits有关系，filesplits的上线为dfs.block.size，下线可以通过mapred.min.split.size设置，最后还是由InputFormat决定。

较好的建议：

Therightnumberofreducesseemstobe0.95or1.75multipliedby(<no.ofnodes>*mapred.tasktracker.reduce.tasks.maximum).increasingthenumberofreducesincreasestheframeworkoverhead,butincreasesloadbalancingandlowersthecostoffailures.

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<description>Themaximumnumberofreducetasksthatwillberun

simultaneouslybyatasktracker.

</description>

</property>

单个node新加硬盘

1.修改需要新加硬盘的node的dfs.data.dir，用逗号分隔新、旧文件目录

2.重启dfs

同步hadoop代码

hadoop-env.sh

#host:pathwherehadoopcodeshouldbersync'dfrom.Unsetbydefault.

#exportHADOOP_MASTER=master:/home/$USER/src/hadoop

用命令合并HDFS小文件

hadoopfs-getmerge<src><dest>

重启reducejob方法

IntroducedrecoveryofjobswhenJobTrackerrestarts.Thisfacilityisoffbydefault.

Introducedconfigparameters"mapred.jobtracker.restart.recover","mapred.jobtracker.job.history.block.size",and"mapred.jobtracker.job.history.buffer.size".

还未验证过。

IO写操作出现问题

0-1246359584298,infoPort=50075,ipcPort=50020):Gotexceptionwhileservingblk_-5911099437886836280_1292to/172.16.100.165:

java.net.SocketTimeoutException:480000millistimeoutwhilewaitingforchanneltobereadyforwrite.ch:java.nio.channels.SocketChannel[connectedlocal=/

172.16.100.165:50010remote=/172.16.100.165:50930]

atorg.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)

atorg.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)

atorg.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)

atorg.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)

atorg.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)

atjava.lang.Thread.run(Thread.java:619)

Itseemstherearemanyreasonsthatitcantimeout,theexamplegivenin

HADOOP-3831isaslowreadingclient.

解决办法：在hadoop-site.xml中设置dfs.datanode.socket.write.timeout=0试试；

MyunderstandingisthatthisissueshouldbefixedinHadoop0.19.1sothat

weshouldleavethestandardtimeout.Howeveruntilthenthiscanhelp

resolveissuesliketheoneyou'reseeing.

HDFS退服节点的方法

目前版本的dfsadmin的帮助信息是没写清楚的，已经file了一个bug了，正确的方法如下：

1.将dfs.hosts置为当前的slaves，文件名用完整路径，注意，列表中的节点主机名要用大名，即uname-n可以得到的那个。

2.将slaves中要被退服的节点的全名列表放在另一个文件里，如slaves.ex，使用dfs.host.exclude参数指向这个文件的完整路径

3.运行命令bin/hadoopdfsadmin-refreshNodes

4.web界面或bin/hadoopdfsadmin-report可以看到退服节点的状态是Decomissioninprogress，直到需要复制的数据复制完成为止

5.完成之后，从slaves里（指dfs.hosts指向的文件）去掉已经退服的节点

附带说一下-refreshNodes命令的另外三种用途：

2.添加允许的节点到列表中（添加主机名到dfs.hosts里来）

3.直接去掉节点，不做数据副本备份（在dfs.hosts里去掉主机名）

4.退服的逆操作——停止exclude里面和dfs.hosts里面都有的，正在进行decomission的节点的退服，也就是把Decomissioninprogress的节点重新变为Normal（在web界面叫inservice)

hadoop学习借鉴

1.解决hadoopOutOfMemoryError问题：

<name>mapred.child.java.opts</name>

<value>-Xmx800M-server</value>

</property>

WiththerightJVMsizeinyourhadoop-site.xml,youwillhavetocopythis

toallmaprednodesandrestartthecluster.

或者：hadoopjarjarfile[mainclass]-Dmapred.child.java.opts=-Xmx800M

2.Hadoopjava.io.IOException:Jobfailed!atorg.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)whileindexing.

wheniusenutch1.0,getthiserror:

Hadoopjava.io.IOException:Jobfailed!atorg.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)whileindexing.

这个也很好解决：

可以删除conf/log4j.properties，然后可以看到详细的错误报告

我这儿出现的是outofmemory

解决办法是在给运行主类org.apache.nutch.crawl.Crawl加上参数：-Xms64m-Xmx512m

你的或许不是这个问题，但是能看到详细的错误报告问题就好解决了

distributecache使用

类似一个全局变量，但是由于这个变量较大，所以不能设置在config文件中，转而使用distributecache

具体使用方法：(详见《thedefinitiveguide》,P240)

1.在命令行调用时：调用-files，引入需要查询的文件(可以是localfile,HDFSfile(使用hdfs://xxx?)),或者-archives(JAR,ZIP,tar等)

%hadoopjarjob.jarMaxTemperatureByStationNameUsingDistributedCacheFile/

-filesinput/ncdc/metadata/stations-fixed-width.txtinput/ncdc/alloutput

2.程序中调用：

publicvoidconfigure(JobConfconf){

metadata=newNcdcStationMetadata();

try{

metadata.initialize(newFile("stations-fixed-width.txt"));

}catch(IOExceptione){

thrownewRuntimeException(e);

}

另外一种间接的使用方法：在hadoop-0.19.0中好像没有

调用addCacheFile()或者addCacheArchive()添加文件，

使用getLocalCacheFiles()或getLocalCacheArchives()获得文件

hadoop的job显示web

Thereareweb-basedinterfacestoboththeJobTracker(MapReducemaster)andNameNode(HDFSmaster)whichdisplaystatuspagesaboutthestateoftheentiresystem.Bydefault,thesearelocatedat[WWW]http://job.tracker.addr:50030/and[WWW]http://name.node.addr:50070/.

hadoop监控

OnlyXP(52388483)131702

用nagios作告警，ganglia作监控图表即可

statusof255error

错误类型：

java.io.IOException:Taskprocessexitwithnonzerostatusof255.

atorg.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

错误原因：

Setmapred.jobtracker.retirejob.intervalandmapred.userlog.retain.hourstohighervalue.Bydefault,theirvaluesare24hours.Thesemightbethereasonforfailure,thoughI'mnotsure

splitsize

FileInputFormatinputsplits:(详见《thedefinitiveguide》P190)

mapred.min.split.size:default=1,thesmallestvalidesizeinbytesforafilesplit.

mapred.max.split.size:default=Long.MAX_VALUE,thelargestvalidsize.

dfs.block.size:default=64M,系统中设置为128M。

如果设置minimumsplitsize>blocksize,会增加块的数量。(猜想从其他节点拿去数据的时候，会合并block，导致block数量增多)

如果设置maximumsplitsize<blocksize,会进一步拆分block。

splitsize=max(minimumSize,min(maximumSize,blockSize));

其中minimumSize<blockSize<maximumSize.

sortbyvalue

hadoop不提供直接的sortbyvalue方法，因为这样会降低mapreduce性能。

但可以用组合的办法来实现，具体实现方法见《thedefinitiveguide》,P250

基本思想：

1.组合key/value作为新的key；

2.重载partitioner，根据oldkey来分割；

conf.setPartitionerClass(FirstPartitioner.class);

3.自定义keyComparator：先根据oldkey排序，再根据oldvalue排序；

conf.setOutputKeyComparatorClass(KeyComparator.class);

4.重载GroupComparator,也根据oldkey来组合；conf.setOutputValueGroupingComparator(GroupComparator.class);

smallinputfiles的处理

对于一系列的smallfiles作为inputfile，会降低hadoop效率。

有3种方法可以将smallfile合并处理：

1.将一系列的smallfiles合并成一个sequneceFile，加快mapreduce速度。

详见WholeFileInputFormat及SmallFilesToSequenceFileConverter,《thedefinitiveguide》,P194

2.使用CombineFileInputFormat集成FileinputFormat，但是未实现过；

3.使用hadooparchives(类似打包)，减少小文件在namenode中的metadata内存消耗。(这个方法不一定可行，所以不建议使用)

方法：

将/my/files目录及其子目录归档成files.har，然后放在/my目录下

bin/hadooparchive-archiveNamefiles.har/my/files/my

查看filesinthearchive:

bin/hadoopfs-lsrhar://my/files.har

skipbadrecords

JobConfconf=newJobConf(ProductMR.class);

conf.setJobName("ProductMR");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Product.class);

conf.setMapperClass(Map.class);

conf.setReducerClass(Reduce.class);

conf.setMapOutputCompressorClass(DefaultCodec.class);

conf.setInputFormat(SequenceFileInputFormat.class);

conf.setOutputFormat(SequenceFileOutputFormat.class);

Stringobjpath="abc1";

SequenceFileInputFormat.addInputPath(conf,newPath(objpath));

SkipBadRecords.setMapperMaxSkipRecords(conf,Long.MAX_VALUE);

SkipBadRecords.setAttemptsToStartSkipping(conf,0);

SkipBadRecords.setSkipOutputPath(conf,newPath("data/product/skip/"));

Stringoutput="abc";

SequenceFileOutputFormat.setOutputPath(conf,newPath(output));

JobClient.runJob(conf);

Forskippingfailedtaskstry:mapred.max.map.failures.percent

restart单个datanode

如果一个datanode出现问题，解决之后需要重新加入cluster而不重启cluster，方法如下：

bin/hadoop-daemon.shstartdatanode

bin/hadoop-daemon.shstartjobtracker

reduceexceed100%

"ReduceTaskProgressshows>100%whenthetotalsizeofmapoutputs(fora

singlereducer)ishigh"

造成原因：

在reduce的merge过程中，checkprogress有误差，导致status>100%，在统计过程中就会出现以下错误：java.lang.ArrayIndexOutOfBoundsException:3

atorg.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228)

atorg.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159)

atjavax.servlet.http.HttpServlet.service(HttpServlet.java:689)

atjavax.servlet.http.HttpServlet.service(HttpServlet.java:802)

atorg.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)

atorg.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)

atorg.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)

atorg.mortbay.http.HttpContext.handle(HttpContext.java:1565)

atorg.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)

atorg.mortbay.http.HttpContext.handle(HttpContext.java:1517)

atorg.mortbay.http.HttpServer.service(HttpServer.java:954)

jira地址：

counters

3中counters：

1.built-incounters:Mapinputbytes,Mapoutputrecords...

2.enumcounters

调用方式：

enumTemperature{

MISSING,

MALFORMED

}

reporter.incrCounter(Temperature.MISSING,1)

结果显示：

09/04/2006:33:36INFOmapred.JobClient:AirTemperatureRecor

09/04/2006:33:36INFOmapred.JobClient:Malformed=3

09/04/2006:33:36INFOmapred.JobClient:Missing=66136856

3.dynamiccountes:

调用方式：

reporter.incrCounter("TemperatureQuality",parser.getQuality(),1);

结果显示：

09/04/2006:33:36INFOmapred.JobClient:TemperatureQuality

09/04/2006:33:36INFOmapred.JobClient:2=1246032

09/04/2006:33:36INFOmapred.JobClient:1=973422173

09/04/2006:33:36INFOmapred.JobClient:0=1

7:Namenodeinsafemode

解决方法

bin/hadoopdfsadmin-safemodeleave

8:java.net.NoRouteToHostException:Noroutetohost

j解决方法：

sudo/etc/init.d/iptablesstop

9：更改namenode后，在hive中运行select依旧指向之前的namenode地址

这是因为：Whenyoucreateatable,hiveactuallystoresthelocationofthetable(e.g.

hdfs://ip:port/user/root/...)intheSDSandDBStablesinthemetastore.SowhenIbringupanewclusterthemasterhasanewIP,buthive'smetastoreisstillpointingtothelocationswithintheold

cluster.IcouldmodifythemetastoretoupdatewiththenewIPeverytimeIbringupacluster.ButtheeasierandsimplersolutionwastojustuseanelasticIPforthemaster

所以要将metastore中的之前出现的namenode地址全部更换为现有的namenode地址

10：YourDataNodeisstartedandyoucancreatedirectorieswithbin/hadoopdfs-mkdir,butyougetanerrormessagewhenyoutrytoputfilesintotheHDFS(e.g.,whenyourunacommandlikebin/hadoopdfs-put).

解决方法：

GototheHDFSinfowebpage(openyourwebbrowserandgotohttp://namenode:dfs_info_portwherenamenodeisthehostnameofyourNameNodeanddfs_info_portistheportyouchosedfs.info.port;iffollowedtheQuickStartonyourpersonalcomputerthenthisURLwillbehttp://localhost:50070).OnceatthatpageclickonthenumberwhereittellsyouhowmanyDataNodesyouhavetolookatalistoftheDataNodesinyourcluster.

Ifitsaysyouhaveused100%ofyourspace,thenyouneedtofreeuproomonlocaldisk(s)oftheDataNode(s).

IfyouareonWindowsthenthisnumberwillnotbeaccurate(thereissomekindofbugeitherinCygwin'sdf.exeorinWindows).Justfreeupsomemorespaceandyoushouldbeokay.OnoneWindowsmachinewetriedthediskhad1GBfreebutHadoopreportedthatitwas100%full.Thenwefreedupanother1GBandthenitsaidthatthediskwas99.15%fullandstartedwritingdataintotheHDFSagain.WeencounteredthisbugonWindowsXPSP2.

11：YourDataNodeswon'tstart,andyouseesomethinglikethisinlogs/*datanode*:

IncompatiblenamespaceIDsin/tmp/hadoop-ross/dfs/data

原因：

YourHadoopnamespaceIDbecamecorrupted.UnfortunatelytheeasiestthingtodoreformattheHDFS.

解决方法：

Youneedtodosomethinglikethis:

bin/stop-all.sh

rm-Rf/tmp/hadoop-your-username/*

bin/hadoopnamenode-format

12：YoucanrunHadoopjobswritteninJava(likethegrepexample),butyourHadoopStreamingjobs(suchasthePythonexamplethatfetcheswebpagetitles)won'twork.

原因：

Youmighthavegivenonlyarelativepathtothemapperandreducerprograms.Thetutorialoriginallyjustspecifiedrelativepaths,butabsolutepathsarerequiredifyouarerunninginarealcluster.

解决方法：

Useabsolutepathslikethisfromthetutorial:

bin/hadoopjarcontrib/hadoop-0.15.2-streaming.jar/

-mapper$HOME/proj/hadoop/multifetch.py/

-reducer$HOME/proj/hadoop/reducer.py/

-inputurls/*/

-outputtitles

13：2009-01-0810:02:40,709ERRORmetadata.Hive(Hive.java:getPartitions(499))-javax.jdo.JDODataStoreException:Requiredtablemissing:""PARTITIONS""inCatalog""Schema"".JPOXrequiresthistabletoperformitspersistenceoperations.EitheryourMetaDataisincorrect,oryouneedtoenable"org.jpox.autoCreateTables"

原因：就是因为在hive-default.xml里把org.jpox.fixedDatastore设置成true了

startingnamenode,loggingto/home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop.out

localhost:startingdatanode,loggingto/home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop.out

localhost:startingsecondarynamenode,loggingto/home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop.out

localhost:Exceptioninthread"main"java.lang.NullPointerException

localhost:atorg.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)

localhost:atorg.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:116)

localhost:atorg.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:120)

localhost:atorg.apache.hadoop.dfs.SecondaryNameNode.initialize(SecondaryNameNode.java:124)

localhost:atorg.apache.hadoop.dfs.SecondaryNameNode.<init>(SecondaryNameNode.java:108)

localhost:atorg.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:460)

14：09/08/3118:25:45INFOhdfs.DFSClient:ExceptionincreateBlockOutputStreamjava.io.IOException:BadconnectackwithfirstBadLink192.168.1.11:50010

>09/08/3118:25:45INFOhdfs.DFSClient:Abandoningblockblk_-8575812198227241296_1001

>09/08/3118:25:51INFOhdfs.DFSClient:ExceptionincreateBlockOutputStreamjava.io.IOException:

BadconnectackwithfirstBadLink192.168.1.16:50010

>09/08/3118:25:51INFOhdfs.DFSClient:Abandoningblockblk_-2932256218448902464_1001

>09/08/3118:25:57INFOhdfs.DFSClient:ExceptionincreateBlockOutputStreamjava.io.IOException:

BadconnectackwithfirstBadLink192.168.1.11:50010

>09/08/3118:25:57INFOhdfs.DFSClient:Abandoningblockblk_-1014449966480421244_1001

>09/08/3118:26:03INFOhdfs.DFSClient:ExceptionincreateBlockOutputStreamjava.io.IOException:

BadconnectackwithfirstBadLink192.168.1.16:50010

>09/08/3118:26:03INFOhdfs.DFSClient:Abandoningblockblk_7193173823538206978_1001

>09/08/3118:26:09WARNhdfs.DFSClient:DataStreamerException:java.io.IOException:Unable

tocreatenewblock.

>atorg.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731)

>atorg.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)

>atorg.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)

>09/08/3118:26:09WARNhdfs.DFSClient:ErrorRecoveryforblockblk_7193173823538206978_1001

baddatanode[2]nodes==null

>09/08/3118:26:09WARNhdfs.DFSClient:Couldnotgetblocklocations.Sourcefile"/user/umer/8GB_input"

-Aborting...

>put:BadconnectackwithfirstBadLink192.168.1.16:50010

解决方法：

Ihaveresolvedtheissue:

Whatidid:

1)'/etc/init.d/iptablesstop'-->stoppedfirewall

2)SELINUX=disabledin'/etc/selinux/config'file.-->disabledselinux

Iworkedformeafterthesetwochanges

解决jline.ConsoleReader.readLine在Windows上不生效问题方法

在CliDriver.java的main()函数中，有一条语句reader.readLine，用来读取标准输入，但在Windows平台上该语句总是返回null，这个reader是一个实例jline.ConsoleReader实例，给WindowsEclipse调试带来不便。

我们可以通过使用java.util.Scanner.Scanner来替代它，将原来的

while((line=reader.readLine(curPrompt+">"))!=null)

复制代码

替换为：

Scannersc=newScanner(System.in);

while((line=sc.nextLine())!=null)

复制代码

重新编译发布，即可正常从标准输入读取输入的SQL语句了。

Windowseclispe调试hive报doesnothaveascheme错误可能原因

1、Hive配置文件中的“hive.metastore.local”配置项值为false，需要将它修改为true，因为是单机版

2、没有设置HIVE_HOME环境变量，或设置错误

3、“doesnothaveascheme”很可能是因为找不到“hive-default.xml”。使用Eclipse调试Hive时，遇到找不到hive-default.xml的解决方法：http://bbs.hadoopor.com/thread-292-1-1.html

1、中文问题

从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。

这是TextOutputFormat.class中的代码，hadoop默认的输出都是继承自FileOutputFormat来的，FileOutputFormat的两个子类一个是基于二进制流的输出，一个就是基于文本的输出TextOutputFormat。

publicclassTextOutputFormat<K,V>extendsFileOutputFormat<K,V>{

protectedstaticclassLineRecordWriter<K,V>

implementsRecordWriter<K,V>{

privatestaticfinalStringutf8=“UTF-8″;//这里被写死成了utf-8

privatestaticfinalbyte[]newline;

static{

try{

newline=“/n”.getBytes(utf8);

}catch(UnsupportedEncodingExceptionuee){

thrownewIllegalArgumentException(”can’tfind”+utf8+”encoding”);

}

…

publicLineRecordWriter(DataOutputStreamout,StringkeyValueSeparator){

this.out=out;

try{

this.keyValueSeparator=keyValueSeparator.getBytes(utf8);

}catch(UnsupportedEncodingExceptionuee){

thrownewIllegalArgumentException(”can’tfind”+utf8+”encoding”);

}

…

privatevoidwriteObject(Objecto)throwsIOException{

if(oinstanceofText){

Textto=(Text)o;

out.write(to.getBytes(),0,to.getLength());//这里也需要修改

}else{

out.write(o.toString().getBytes(utf8));

}

…

}

可以看出hadoop默认的输出写死为utf-8，因此如果decode中文正确，那么将Linux客户端的character设为utf-8是可以看到中文的。因为hadoop用utf-8的格式输出了中文。

因为大多数数据库是用gbk来定义字段的，如果想让hadoop用gbk格式输出中文以兼容数据库怎么办？

我们可以定义一个新的类：

publicclassGbkOutputFormat<K,V>extendsFileOutputFormat<K,V>{

protectedstaticclassLineRecordWriter<K,V>

implementsRecordWriter<K,V>{

//写成gbk即可

privatestaticfinalStringgbk=“gbk”;

privatestaticfinalbyte[]newline;

static{

try{

newline=“/n”.getBytes(gbk);

}catch(UnsupportedEncodingExceptionuee){

thrownewIllegalArgumentException(”can’tfind”+gbk+”encoding”);

}

…

publicLineRecordWriter(DataOutputStreamout,StringkeyValueSeparator){

this.out=out;

try{

this.keyValueSeparator=keyValueSeparator.getBytes(gbk);

}catch(UnsupportedEncodingExceptionuee){

thrownewIllegalArgumentException(”can’tfind”+gbk+”encoding”);

}

…

privatevoidwriteObject(Objecto)throwsIOException{

if(oinstanceofText){

//Textto=(Text)o;

//out.write(to.getBytes(),0,to.getLength());

//}else{

out.write(o.toString().getBytes(gbk));

}

…

}

然后在mapreduce代码中加入conf1.setOutputFormat(GbkOutputFormat.class)

即可以gbk格式输出中文。

2、某次正常运行mapreduce实例时,抛出错误

java.io.IOException:Alldatanodesxxx.xxx.xxx.xxx:xxxarebad.Aborting…

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

java.io.IOException:Couldnotgetblocklocations.Aborting…

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143)

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)

atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

经查明，问题原因是linux机器打开了过多的文件导致。用命令ulimit-n可以发现linux默认的文件打开数目为1024，修改/ect/security/limit.conf，增加hadoopsoft65535

再重新运行程序（最好所有的datanode都修改），问题解决

3、运行一段时间后hadoop不能stop-all.sh的问题，显示报错

notasktrackertostop，nodatanodetostop

问题的原因是hadoop在stop的时候依据的是datanode上的mapred和dfs进程号。而默认的进程号保存在/tmp下，linux默认会每隔一段时间（一般是一个月或者7天左右）去删除这个目录下的文件。因此删掉hadoop-hadoop-jobtracker.pid和hadoop-hadoop-namenode.pid两个文件后，namenode自然就找不到datanode上的这两个进程了。

在配置文件中的exportHADOOP_PID_DIR可以解决这个问题

问题：

IncompatiblenamespaceIDsin/usr/local/hadoop/dfs/data:namenodenamespaceID=405233244966;datanodenamespaceID=33333244

原因：

在每次执行hadoopnamenode-format时，都会为NameNode生成namespaceID,，但是在hadoop.tmp.dir目录下的DataNode还是保留上次的namespaceID，因为namespaceID的不一致，而导致DataNode无法启动，所以只要在每次执行hadoopnamenode-format之前，先删除hadoop.tmp.dir目录就可以启动成功。请注意是删除hadoop.tmp.dir对应的本地目录，而不是HDFS目录。

Problem:Storagedirectorynotexist

2010-02-0921:37:53,203INFOorg.apache.hadoop.hdfs.server.common.Storage:StoragedirectoryD:/hadoop/run/dfs_name_dirdoesnotexist.

2010-02-0921:37:53,203ERRORorg.apache.hadoop.hdfs.server.namenode.FSNamesystem:FSNamesysteminitializationfailed.

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:DirectoryD:/hadoop/run/dfs_name_dirisinaninconsistentstate:storagedirectorydoesnotexistorisnotaccessible.

solution:是因为存储目录D:/hadoop/run/dfs_name_dir不存在，所以只需要手动创建好这个目录即可。

Problem:NameNodeisnotformatted

solution:是因为HDFS还没有格式化，只需要运行hadoopnamenode-format一下，然后再启动即可

bin/hadoopjps后报如下异常：

Exceptioninthread"main"java.lang.NullPointerException

atsun.jvmstat.perfdata.monitor.protocol.local.LocalVmManager.activeVms(LocalVmManager.java:127)

atsun.jvmstat.perfdata.monitor.protocol.local.MonitoredHostProvider.activeVms(MonitoredHostProvider.java:133)

atsun.tools.jps.Jps.main(Jps.java:45)

原因为：

系统根目录/tmp文件夹被删除了。重新建立/tmp文件夹即可。

bin/hive中出现unabletocreatelogdirectory/tmp/...也可能是这个原因