Heritrix学习及遇到问题汇总(四)
1.
message:Valueofillegaltype:'org.archive.crawler.settings.ModuleType','org.archive.crawler.framework.Frontier'wasexpected.:Valueofillegaltype:'org.archive.crawler.settings.ModuleType','org.archive.crawler.framework.Frontier'wasexpected.
Exception:Noassociatedexception.
2.
message:Oncrawl:questionUnabletosetupcrawlmodules
exception:java.lang.ClassCastException:org.archive.crawler.settings.ModuleTypecannotbecasttoorg.archive.crawler.framework.Frontier
Stacktrace:java.lang.ClassCastException:org.archive.crawler.settings.ModuleTypecannotbecasttoorg.archive.crawler.framework.Frontier
atorg.archive.crawler.framework.CrawlController.setupCrawlModules(CrawlController.java:675)
atorg.archive.crawler.framework.CrawlController.initialize(CrawlController.java:381)
atorg.archive.crawler.admin.CrawlJob.setupForCrawlStart(CrawlJob.java:853)
atorg.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJobHandler.java:1144)
atorg.archive.crawler.admin.CrawlJobHandler$3.run(CrawlJobHandler.java:1127)
atjava.lang.Thread.run(Thread.java:619)
3.
message:Wrongdocumenttype'crawl-order'in'file:/c:/heritrix/jobs/question-20141005032127804/order.xml',line:1,column:160
exception:Noassociatedexception.
解决方案:一般都是由于处理器链没有正确设置而导致
比如,在应该是Prefetcher的地方,设置成了Writer。这样就会导致错误
请严格按照以下方式来设置:
1.frontier
org.archive.crawler.frontier.BdbFrontier
2.scope
org.archive.crawler.scope.BroadScope
3.Prefetcher
org.archive.crawler.prefetch.Preselector
org.archive.crawler.prefetch.PreconditionEnforcer
4.Fetcher
org.archive.crawler.fetcher.FetchDNS
org.archive.crawler.fetcher.FetchHTTP
5.Extractor
org.archive.crawler.extractor.ExtractorHTTP
org.archive.crawler.extractor.ExtractorHTML
(这里可以按自己的需要多添几个,比如ExtractorSWF、ExtractorJS什么的,但是前两个是必不可少的)
6.Writer
可以是MirrorWriter或ARCWriter,一般建议使用MirrorWriter
7.PostProcessor
org.archive.crawler.postprocessor.CrawlStateUpdater
org.archive.crawler.postprocessor.LinksScoper
org.archive.crawler.postprocessor.FrontierScheduler
(FrontierScheduler可以自行扩展,按书上的方法)