Scrapy爬虫之爬取银行理财产品信息(共12多万条)并存入MongoDB
本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息,并存入MongoDB中。网页的截图如下,全部数据共12多万条。
我们不再过多介绍Scrapy的创建和运行,只给出相关的代码。关于Scrapy的创建和运行,有兴趣的读者可以参考:Scrapy爬虫(4)爬取豆瓣电影Top250图片。
修改items.py,代码如下,用来储存每个理财产品的相关信息,如产品名称,发行银行等。
import scrapy class BankItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() bank = scrapy.Field() currency = scrapy.Field() startDate = scrapy.Field() endDate = scrapy.Field() period = scrapy.Field() proType = scrapy.Field() profit = scrapy.Field() amount = scrapy.Field()
创建爬虫文件bankSpider.py,代码如下,用来爬取网页中理财产品的具体信息。
import scrapy from bank.items import BankItem class bankSpider(scrapy.Spider): name = 'bank' start_urls = ['https://www.rong360.com/licai-bank/list/p1'] def parse(self, response): item = BankItem() trs = response.css('tr')[1:] for tr in trs: item['name'] = tr.xpath('td[1]/a/text()').extract_first() item['bank'] = tr.xpath('td[2]/p/text()').extract_first() item['currency'] = tr.xpath('td[3]/text()').extract_first() item['startDate'] = tr.xpath('td[4]/text()').extract_first() item['endDate'] = tr.xpath('td[5]/text()').extract_first() item['period'] = tr.xpath('td[6]/text()').extract_first() item['proType'] = tr.xpath('td[7]/text()').extract_first() item['profit'] = tr.xpath('td[8]/text()').extract_first() item['amount'] = tr.xpath('td[9]/text()').extract_first() yield item next_pages = response.css('a.next-page') if len(next_pages) == 1: next_page_link = next_pages.xpath('@href').extract_first() else: next_page_link = next_pages[1].xpath('@href').extract_first() if next_page_link: next_page = "https://www.rong360.com" + next_page_link yield scrapy.Request(next_page, callback=self.parse)
为了将爬取的数据储存到MongoDB中,我们需要修改pipelines.py文件,代码如下:
# pipelines to insert the data into mongodb import pymongo from scrapy.conf import settings class BankPipeline(object): def __init__(self): # connect database self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT']) # using name and password to login mongodb # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW']) # handle of the database and collection of mongodb self.db = self.client[settings['MONGO_DB']] self.coll = self.db[settings['MONGO_COLL']] def process_item(self, item, spider): postItem = dict(item) self.coll.insert(postItem) return item
其中的MongoDB的相关参数,如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下:
- ROBOTSTXT_OBEY = False
- ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}
- 添加MongoDB连接参数
MONGO_HOST = "localhost" # 主机IP MONGO_PORT = 27017 # 端口号 MONGO_DB = "Spider" # 库名 MONGO_COLL = "bank" # collection名 # MONGO_USER = "" # MONGO_PSW = ""
其中用户名和密码可以根据需要添加。
接下来,我们就可以运行爬虫了。运行结果如下:
共用时3小时,爬了12多万条数据,效率之高令人惊叹!
最后我们再来看一眼MongoDB中的数据:
Perfect!本次分享到此结束,欢迎大家交流~~