Scrapy爬虫之爬取银行理财产品信息（共12多万条）并存入MongoDB

moguibeijing

2019-06-26

本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息，并存入MongoDB中。网页的截图如下，全部数据共12多万条。

我们不再过多介绍Scrapy的创建和运行，只给出相关的代码。关于Scrapy的创建和运行，有兴趣的读者可以参考：Scrapy爬虫（4）爬取豆瓣电影Top250图片。
修改items.py，代码如下，用来储存每个理财产品的相关信息，如产品名称，发行银行等。

import scrapy

class BankItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    bank = scrapy.Field()
    currency = scrapy.Field()
    startDate = scrapy.Field()
    endDate = scrapy.Field()
    period = scrapy.Field()
    proType = scrapy.Field()
    profit = scrapy.Field()
    amount = scrapy.Field()

创建爬虫文件bankSpider.py，代码如下，用来爬取网页中理财产品的具体信息。

import scrapy
from bank.items import BankItem

class bankSpider(scrapy.Spider):
    name = 'bank'
    start_urls = ['https://www.rong360.com/licai-bank/list/p1']

    def parse(self, response):

        item = BankItem()
        trs = response.css('tr')[1:]
        
        for tr in trs:
            item['name'] = tr.xpath('td[1]/a/text()').extract_first()
            item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
            item['currency'] = tr.xpath('td[3]/text()').extract_first()
            item['startDate'] = tr.xpath('td[4]/text()').extract_first()
            item['endDate'] = tr.xpath('td[5]/text()').extract_first()
            item['period'] = tr.xpath('td[6]/text()').extract_first()
            item['proType'] = tr.xpath('td[7]/text()').extract_first()
            item['profit'] = tr.xpath('td[8]/text()').extract_first()
            item['amount'] = tr.xpath('td[9]/text()').extract_first()

            yield item

        next_pages = response.css('a.next-page')

        if len(next_pages) == 1:
            next_page_link = next_pages.xpath('@href').extract_first() 
        else:
            next_page_link = next_pages[1].xpath('@href').extract_first()
       
        if next_page_link:
            next_page = "https://www.rong360.com" + next_page_link
            yield scrapy.Request(next_page, callback=self.parse)

为了将爬取的数据储存到MongoDB中，我们需要修改pipelines.py文件，代码如下：

# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings

class BankPipeline(object):
    def __init__(self):
        # connect database
        self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])

        # using name and password to login mongodb
        # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
        
        # handle of the database and collection of mongodb
        self.db = self.client[settings['MONGO_DB']]
        self.coll = self.db[settings['MONGO_COLL']] 

    def process_item(self, item, spider):
        postItem = dict(item)
        self.coll.insert(postItem)
        return item

其中的MongoDB的相关参数，如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下：

ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}
添加MongoDB连接参数

MONGO_HOST = "localhost"  # 主机IP
MONGO_PORT = 27017  # 端口号
MONGO_DB = "Spider"  # 库名 
MONGO_COLL = "bank"  # collection名
# MONGO_USER = ""
# MONGO_PSW = ""

其中用户名和密码可以根据需要添加。

接下来，我们就可以运行爬虫了。运行结果如下：

Scrapy爬虫之爬取银行理财产品信息（共12多万条）并存入MongoDB