Nutch 1.13 + Solr 5.5.4集成常见报错

一切依旧

2017-06-15

1. 没有找到Elastic search的信息

配置好了nutch1.13和Solr5.5.4之后，使用solr来索引nutch爬取到的结果，结果报错显示么有配elastic search的东西：

Zhuos-MacBook-Pro:apache-nutch-1.13 jo$ nutch solrindex http://localhost:8983/solr/ ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments

...

Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. At least one of them should be set in nutch-site.xml

ElasticIndexWriter

elastic.cluster : elastic prefix cluster

elastic.host : hostname

elastic.port : port

elastic.index : elastic index command

elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)

elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)

elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)

elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)

elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600

解决方法：这是因为indexer被指定为了elastic。修改nutch-site.xml解决这个问题：

将 <name>plugin.includes</name>的value节点值indexer-elastic修改成indexer-solr即可。修改完后如下：

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
  </description>
</property>

2）solrindex时报错Job failed.执行nutch solrindex报错404， 405等..

Indexer: java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)

at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)

at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

解决方法：查看nutch的log目录下的报错，发现后台是因为solr的页面返回了404，405等信息。

这些事因为在执行nutch solrindex的时候指定的solr地址不正确错误。

我们在http://localhost:8983/solr后面直接跟上core的名称，不要加#，不要加~cores/目录..

例如我自己创建了一个core叫mycore，那么执行的nutch solrindex命令如下：

nutch solrindex http://localhost:8983/solr/mycore ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments

一切依旧

0 关注 0 粉丝 0 动态

关注关注

安科网

Nutch 1.13 + Solr 5.5.4集成常见报错

一切依旧

一切依旧

一切依旧