Nutch2.1在Windows平台上使用Eclipse debug 存储在MySQL的搭建过

hjgreg

2014-01-25

步骤1：准备好eclipse、eclipse svn插件、mysql准备好，mysql使用utf-8编码
步骤2：mysql建库，建表：
CREATE DATABASE nutch ;
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

`id` varchar(767) NOT NULL 这个在我本机是不能成功的，只能最大设置为100 所以改为：`id` varchar(100) NOT NULL
步骤3：从 https://svn.apache.org/repos/asf/nutch/tags/release-2.1 拉下代码，在本地创建java project。本人因为试验过很多次，所以在此取项目名称为test。
步骤4：加src文件
在project explorer下右击项目，选择properties。进入java build path ，在source选项卡，删除src文件夹，选择“Add Folder ”，在这里把conf,src/bin,src/java,src/test,src/testresources，以及src/plugin文件夹下各个插件的src和test也加入进来。最终可以看到如下界面（test为项目名称）：

在每个eclipse 项目文件夹下有 .classpath文件，打开 .classpath文件能看到：内容基本是这样的。
<classpathentry kind="src" path="conf"/>
<classpathentry kind="src" path="src/java"/>
<classpathentry kind="src" path="src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-file/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/>
<classpathentry kind="src" path="src/plugin/subcollection/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-html/src/test"/>
<classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-html/src/java"/>
<classpathentry kind="src" path="src/plugin/parse-tika/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-http/src/test"/>
<classpathentry kind="src" path="src/plugin/parse-tika/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/>
<classpathentry kind="src" path="src/plugin/scoring-link/src/java"/>
<classpathentry kind="src" path="src/plugin/index-anchor/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-http/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/>
<classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/>
<classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/>
<classpathentry kind="src" path="src/plugin/protocol-file/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/>
<classpathentry kind="src" path="src/plugin/language-identifier/src/java"/>
<classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/>
<classpathentry kind="src" path="src/plugin/language-identifier/src/test"/>
<classpathentry kind="src" path="src/plugin/subcollection/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/>
<classpathentry kind="src" path="src/plugin/index-basic/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/>
<classpathentry kind="src" path="src/plugin/creativecommons/src/java"/>
<classpathentry kind="src" path="src/bin"/>
<classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/>
<classpathentry kind="src" path="src/plugin/tld/src/java"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/>
<classpathentry kind="src" path="src/plugin/index-basic/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-http/src/java"/>
<classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/>
<classpathentry kind="src" path="src/plugin/index-anchor/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/>
<classpathentry kind="src" path="src/plugin/index-more/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/>
<classpathentry kind="src" path="src/plugin/creativecommons/src/test"/>
<classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/>
<classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/>
<classpathentry kind="src" path="src/plugin/index-more/src/test"/>
<classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/>
<classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/>
<classpathentry kind="src" path="src/testresources"/>

步骤5：加入lib包：
切换到Libaries选项卡，“Add Library"->"IvyDE Managed Dependencies"->"Next",选择“Project”，选择ivy\ivy.xml文件。点 Ok。eclipse会自动下载依赖的jar包。

在这个过程中或许会报错，看到错误信息是因为org.restlet.jse包下载不到。解决方法是：ivy\ivy.xml中找到
<dependency org="org.restlet.jse" name="org.restlet" rev="2.0.5" conf="*->default" />
<dependency org="org.restlet.jse" name="org.restlet.ext.jackson" rev="2.0.5"
conf="*->default" />
部分，注释掉。在网上手动找到这两个包，放在lib包下，加入到Libaries中。

接着加入plugin文件夹下各个插件的ivy.xml文件。手动一个一个加进去。

步骤6：在"Order and Export"选项卡，将 conf top
步骤7：数据库配置以及其他配置信息
打开/conf/gora.properties ，删除文件中所有内容，写入mysql配置：
###############################
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123456

在/conf/gora-sql-mapping.xml 修改 <primarykey column="id" length="240"/>
在 /conf/nutch-site.xml输入：
<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

在根目录下的build.xml中找到如下代码
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">
<ivy:resolve file="${ivy.file}" conf="default" log="download-only" />
<ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />
<antcall target="copy-libs" />
</target>
将pattern="${build.lib.dir}/[artifact]-[revision].[ext]"替换为pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]"
步骤8：配置抓取url
在test项目下创建文件夹urls，在urls下创建文件seeds.txt ，写你要抓取的网站。我写的是http://www.163.com。
步骤9：运行org.apache.nutch.crawl.Crawler
打开Crawler文件，“Run As” -> “Run Configurations” ，在“Arguments”选项卡的“Program Arguments”，输入 “urls -depth 3 -topN 5”，点"Run"。哈哈，报错了吧。报错信息类似于“ Failed to set permissions of path: \tmp\Hadoop-Administrator\mapred\staging\Administrator1712398257\. ”的错误。这是hadoop的一个问题。解决方法是，修改/hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue，注释掉即可。当然最简单的办法是在网上找一个修改过的包，替换一下FileUtil.class。
再次运行，哈哈执行成功到此结束。

祝各位好运吧。

遇到的问题：
1 报 Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
根据在网上查到的问题可能很多首先 nutch-default.xml 中配置 <name>plugin.folders</name><value>./src/plugin</value>
其次查找 hadoop.log文件。

Nutch的详细介绍：请点这里
Nutch的下载地址：请点这里

相关阅读：

test mysql src

安科网

Nutch2.1在Windows平台上使用Eclipse debug 存储在MySQL的搭建过

hjgreg

hjgreg

相关推荐

MySQL外键约束的实例讲解

详解MySQL alter ignore 语法

nginx配置proxy_pass中url末尾带/与不带/的区别详解

PHP dirname(FILE)原理及用法解析

Yii中特殊行为ActionFilter的使用方法示例

四种ABAP单元测试隔离(test isolation)技术

shiro配合druid使用下URL拦截权限设置为anno时遇到的问题

使用Java JUnit框架里的@Rule注解的用法举例

使用Java JUnit框架里的@SuiteClasses注解管理测试用例

拥有此神技，脚本调试从此与 echo、set、test 说分手

Linux下 ls 命令的高级用法8例

GO语言复合类型专题

Pytest如何使用skip跳过执行测试

使用alwayson后如何收缩数据库日志的方法详解

Linux下如何高效切换目录？

test

数据归一化 scikit-learn中的Scaler

机器学习基础

ffmpeg coco2d-x lua test

python-高阶函数（map,reduce,filter）

hjgreg