nutch 2.1 分布式hbase部署
官方文档:http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&redirect=GORA_HBase
现在网上针对nutch2.0以上版本的部署内容很残缺。经过两天奋战,终于把nutch2.1在hbase上部署成功了!在此与网友分享。
准备两台机器:
cr5(master):192.168.8.185,cr8(slave):192.168.8.188
这两台机器必须保证相互的ssh是通的(具体可以问谷歌)
修改两台机器的/etc/hostname文件
cr5 或者 cr8
修改两台机器的/etc/hosts文件
192.168.8.185 cr5 192.168.8.188 cr8
我准备在cr5机上运行进程:
Hadoop:NameNode,SecondaryNameNode,JobTracker
Hbase:HMaster
在cr8机上运行进程:
Hadoop:DataNode,TaskTracker
Hbase:HQuorumPeer,HRegionServer
接下来我们开始部署hadoop和hbase
官网上有很多hadoop和hbase的版本,并不是所有的版本都可以被nutch2.1支持的。
官方文档有这么一句话:
•InstallandconfigureHBase.Youcangetithere(N.B.Gora0.2usesHBase0.90.4,howeverthesetupisknowntoworkwithmorerecentversionsoftheHBase0.90.xbranch)
保险起见还是采用推荐的hbase0.90.x版本吧。
我选择的是hadoop-1.0.4和hbase-0.90.6
那如果采用其他版本在运行nutch的时候会报以下异常
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V
我觉得是因为gora的原因,因为gora的版本已经很久没有更新。
一、配置hadoop
1.wget命令下载对应的hadoop版本.tar.gz
2.tarzxvfhadoop版本.tar.gz解压hadoop
3.cdconf下修改配置文件
a.hadoop-env.sh
export JAVA_HOME=/opt/jdk1.6.0_21
b.core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://cr5:9000/</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/kfs/ww/data/hadoop_tmp</value> <description>此处设置hadoop根目录</description> </property> </configuration>
c.hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>副本个数</description> </property> </configuration>
d.mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>cr5:9001</value> <description>jobtracker 标识:端口号</description> </property> </configuration>
e.masters
cr5
f.slaves
cr8
配置完成后,将cr5的hadoop复制到cr8下面
在cr5的hadoop/bin下面运行
./hadoop-namenodeformat
./hadoop-datanodeformat
然后启动hadoop
./start-all.sh
检查启动成功与否
查看hadoop/logs下面的×.log日志确保没有异常
然后通过
http://localhost:50030
http://localhost:50070
来查看信息
二、配置hbase
1.wget命令下载对应的hbase版本.tar.gz
2.tarzxvfhbase版本.tar.gz解压hadoop
3.cdconf下修改配置文件
a.hbase-site.xml
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://cr5:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>cr8</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/kfs/ww/data/zookeeper_data</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.tmp.dir</name> <value>/home/kfs/ww/data/hbase_tmp</value> </property> </configuration>
注意:这里的hdfs://cr5:9000/hbase和hadoop配置需对应
b.hadoop-env.sh
export JAVA_HOME=/opt/jdk1.6.0_21 export HBASE_CLASSPATH=~/ww/hbase-0.90.6/conf export HBASE_MANAGES_ZK=true
c.regionservers
cr8
hbase配置完成
当然还有后续的工作
1.删除hbase中的hadoop-core-版本.jar,然后把hadoop中的hadoop-core-版本.jar和commons-collections-3.2.1.jar拷贝到hbase的lib中。
否则hbase的HMaster无法启动!
2.关闭防火墙
到hbase/bin下通过./start-hbase.sh启动hbase
验证启动与否查看log是否有异常
或者http://localhost:60010查看具体信息
三、nutch配置
部署到eclipse中就不加累述了,主要是配置~
1.wget命令下载对应的hadoop版本.tar.gz
2.tarzxvfhadoop版本.tar.gz解压hadoop
3.cdconf下修改配置文件
a.gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
b.nutch-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>test-nutch</value> </property> <property> <name>http.robots.agents</name> <value>test-nutch,*</value> </property> <property> <name>http.agent.name.check</name> <value>true</value> </property> <!-- property> <name>plugin.includes</name> <value>.*</value> <description>Enable all plugins during unit testing.</description> </property --> <property> <name>distributed.search.test.port</name> <value>60000</value> <description>TCP port used during junit testing.</description> </property> <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available </description> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>C:/data/hadoop_tmp</value> <description>此处设置hadoop根目录</description> </property> </configuration>
c.nutch-site.xml
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
d.hbase-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.master</name> <value>cr5:60000</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>cr8</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> </configuration>
e.ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2.1" conf="*->default" />
f.新建urls文件夹,然后再文件夹中创建seed.txt,在seed.txt中写入需要抓取的链接
g.regex-urlfilter.txt加入抓取条件正则
配置完成,后续工作
nutch中的hbase-版本.jar需和部署的hbase的版本统一
运行nutch
配置Arguments信息
1.Proguamarguments
urls -depth 3 topN 5 这里的urls就是nutch配置中生成的url种子文件夹
2.VMarguments
-Xms256m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
大功告成~~~
四、运行过程中异常处理
1.pointorg.apache.nutch.net.URLNormalizernotfound.请参见http://youkimra.iteye.com/blog/1039903
2.ERRORorg.apache.hadoop.mapred.TaskTracker:Cannotstarttasktrackerbecausejava.io.IOException:Failedtosetpermissionsofpath:\tmp\hadoop-admin\mapred\local\ttprivateto0700
请参见:http://download.csdn.net/detail/java2000_wl/4326323
3.nutch中有一些plugin的类缺少包,遇到问题补全包即可
转载请注明来自:http://wangwei3.iteye.com/blog/1818599