Nutch 1.13 + Solr 5.5.4集成常见报错
1. 没有找到Elastic search的信息
配置好了nutch1.13和Solr5.5.4之后,使用solr来索引nutch爬取到的结果,结果报错显示么有配elastic search的东西:
Zhuos-MacBook-Pro:apache-nutch-1.13 jo$ nutch solrindex http://localhost:8983/solr/ ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments
...
Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. At least one of them should be set in nutch-site.xml
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600
解决方法: 这是因为indexer被指定为了elastic。修改nutch-site.xml解决这个问题:
将 <name>plugin.includes</name>的value节点值indexer-elastic修改成indexer-solr即可。修改完后如下:
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler. </description> </property>
2)solrindex时报错Job failed.执行nutch solrindex报错404, 405等..
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
解决方法:查看nutch的log目录下的报错,发现后台是因为solr的页面返回了404,405等信息。
这些事因为在执行nutch solrindex的时候指定的solr地址不正确错误。
我们在http://localhost:8983/solr后面直接跟上core的名称,不要加#,不要加~cores/目录..
例如我自己创建了一个core叫mycore,那么执行的nutch solrindex命令如下:
nutch solrindex http://localhost:8983/solr/mycore ~/nutch/crawl/crawldb -linkdb ~/nutch/crawl/linkdb -dir ~/nutch/crawl/segments