scrapy 为什么要用yield item 而不用yield dict来传输数据
经过实践, yield dict和yield item一样有效果,不过为什么官方要用yield item ,以下是官方解释:
The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.
To define common output data format Scrapy provides the Item
class. Item
objects are simple containers used to collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available fields.
Various Scrapy components use extra information provided by Items: exporters look at declared fields to figure out columns to export, serialization can be customized using Item fields metadata, trackref
tracks Item instances to help find memory leaks (see Debugging memory leaks with trackref), etc.
简单的说,就是爬虫过多的时候,使用dict容易出现键字打错,而造成数据传输错误,使用item 系统可以通过key error来提示程序员从而避免这种问题。