hadoop 集群升级失败记录
一.错误概述
因为需要使用hadoop与hbase结合使用,所以需要为hadoop hdfs升级使用append写模式。需要对现有的hadoop 0.20.1 升级至0.20.205.0; 升级过程简单的使用 hadoop namenode -upgrade 从 -18 version => -32version(这个是dfs/name/current/VERSION).但我们发现0.20.205跟hive不兼容,所以又安装facebook的hadoop版本(-30版本)。
整个升级流程简单来说就是: dfs/name/current/VERSION -18 => -32 => -30. 升级再回滚的过程。
二. namenode错误处理过程
1。启动namenode失败(第一个错误):
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop-tmp/hadoop-hadoop/dfs/name. Reported: - 32 . Expecting = - 30 . at org.apache.hadoop.hdfs.server.common.Storage.getFields(Storage.java:662 ) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:741 ) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:238 ) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:227 ) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:453 ) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:158 ) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:386 ) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:361 ) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:274 ) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:385 ) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1419 ) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1428 )
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop-tmp/hadoop-hadoop/dfs/name. Reported: -32. Expecting = -30. at org.apache.hadoop.hdfs.server.common.Storage.getFields(Storage.java:662) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:741) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:238) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:227) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:453) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:386) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:361) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:274) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:385) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1419) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1428)
解决办法:
在namenode节点运行:
hadoop namenode -rollback
hadoop namenode -rollback
将-32版本回滚至 -18版本
2。继续启动namenode(第二个错误):
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/hadoop-tmp/hadoop-hadoop/dfs/name is in an inconsistent state: file VERSION has image MD5 digest when version is - 18
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/hadoop-tmp/hadoop-hadoop/dfs/name is in an inconsistent state: file VERSION has image MD5 digest when version is -18
解决办法:
将dfs/name/current/VERSION中的 imageMD5Digest 注释掉,不进行MD5完整性检查。
3。再将升级-18 => -30版本
hadoop namenode -upgrade
hadoop namenode -upgrade
这样namenode已经启动成功。
三.datanode错误处理过程
1。启动datanode
2011 - 12 - 12 18 : 06 : 18 , 544 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to start datanode org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop-tmp/hadoop-hadoop/dfs/data. Reported: - 32 . Expecting = - 30 .
2011-12-12 18:06:18,544 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to start datanode org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/hadoop-tmp/hadoop-hadoop/dfs/data. Reported: -32. Expecting = -30.
解决办法:
hadoop datanode -rollback
hadoop datanode -rollback
回滚至 -18版本,再升级至-30版本.
hadoop datanode -rollback
hadoop datanode -rollback
2. 继续启动datanode
11 / 12 / 12 19 : 34 : 26 INFO datanode.DataNode: Failed to start datanode org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/hadoop-tmp/hadoop-hadoop/dfs/data is in an inconsistent state: previous and previous.tmp cannot exist together. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:427 ) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:113 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:332 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:249 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1528 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1477 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1485 ) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1626 )
11/12/12 19:34:26 INFO datanode.DataNode: Failed to start datanode org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/hadoop-tmp/hadoop-hadoop/dfs/data is in an inconsistent state: previous and previous.tmp cannot exist together. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:427) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:113) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:332) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:249) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1528) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1477) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1485) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1626)
解决办法:
直接查看源码,发现Storage$StorageDirectory.analyzeStorage()完全是进行相关目录的状态检查。而且跟生产环境的hadoop集群对比,发生没有previous,previous.tmp 两个目录(是升级的备份文件),所以将这两个目录重命名。然后继续启动datanode成功。
最后通过 hadoop namenode -finalize 来结束此次升级,以便删除升级的备份文件
四.本次处理总结:
1. 对你的hadoop namenode数据及时做好备份,不管是否是测试集群
2. 查看hadoop源码吧,有时处理问题还是得看源码,因为在google上搜索,资料还是相当少的。