Nutch 1.13 + Solr 5.5.4集成常见报错

1. 没有找到Elastic search的信息

配置好了nutch1.13和Solr5.5.4之后,使用solr来索引nutch爬取到的结果,结果报错显示么有配elastic search的东西:

Zhuos-MacBook-Pro:apache-nutch-1.13 jo$ nutch solrindex http://localhost:8983/solr/ ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments

...

Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. At least one of them should be set in nutch-site.xml 

ElasticIndexWriter

elastic.cluster : elastic prefix cluster

elastic.host : hostname

elastic.port : port

elastic.index : elastic index command 

elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)

elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)

elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)

elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)

elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600

解决方法: 这是因为indexer被指定为了elastic。修改nutch-site.xml解决这个问题:

将  <name>plugin.includes</name>的value节点值indexer-elastic修改成indexer-solr即可。修改完后如下:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.
  </description>
</property>

2)solrindex时报错Job failed.执行nutch solrindex报错404, 405等..

Indexer: java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)

at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)

at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

 

解决方法:查看nutch的log目录下的报错,发现后台是因为solr的页面返回了404,405等信息。

这些事因为在执行nutch solrindex的时候指定的solr地址不正确错误。

我们在http://localhost:8983/solr后面直接跟上core的名称,不要加#,不要加~cores/目录..

例如我自己创建了一个core叫mycore,那么执行的nutch solrindex命令如下:

nutch solrindex http://localhost:8983/solr/mycore ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments