【Scrapy】Scrapy爬虫框架的基本用法

Scrapy爬虫框架的基本用法

Scrapy爬虫框架是一个好东西,可以十分简单快速爬取网站,特别适合那些不分离前后端的,数据直接生成在html文件内的网站。本文以爬取 杭电OJ http://acm.hdu.edu.cn 的题目ID和标题为例,做一个基本用法的记录

可参考 https://www.jianshu.com/p/7dee0837b3d2

安装Scrapy

  • 使用pip安装

    pip install scrapy

代码编写

  • 建立项目 myspider

    scrapy startproject myspider
  • 创建爬虫 hdu,网站是 acm.hdu.edu.cn

    scrapy genspider hdu acm.hdu.edu.cn
  • 执行上面的命令后,会在spiders文件夹下建立一个 hdu.py,修改代码为:

    class HduSpider(scrapy.Spider):
    # 爬虫名
    name = 'hdu'
    # 爬取的目标地址
    allowed_domains = ['acm.hdu.edu.cn']
    # 爬虫开始的页面
    start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
    
    # 爬取逻辑
    def parse(self, response):
        # 题目列表是写在页面的第二个script下的,先全部取出script到problem_list列表中
        problem_list = response.xpath('//script/text()').extract()
        # 取题目列表,为第二个,index为1,并使用分号分割
        problems = str.split(problem_list[1], ";")
        # 循环在控制台输出。这里没有交给管道进行操作
        for item in problems:
            print(item)
  • 在 items.py 里新建题目的对应类

    class ProblemItem(scrapy.Item):
        id = scrapy.Field()
        title = scrapy.Field()
  • 在 pipelines.py 里建立一个数据管道来保存数据到 hdu.json文件内

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    
    
    class ItcastPipeline(object):
        def __init__(self):
            self.filename = open("teacher.json", "wb+")
    
        def process_item(self, item, spider):
            jsontext = json.dumps(dict(item), ensure_ascii=False) + "\n"
            self.filename.write(jsontext.encode("utf-8"))
            return item
    
        def close_spider(self, spider):
            self.filename.close()
    
    
    class HduPipeline(object):
        full_json = ''
    
        def __init__(self):
            self.filename = open("hdu.json", "wb+")
            self.filename.write("[".encode("utf-8"))
    
        def process_item(self, item, spider):
            json_text = json.dumps(dict(item), ensure_ascii=False) + ",\n"
            self.full_json += json_text
            return item
    
        def close_spider(self, spider):
            self.filename.write(self.full_json.encode("utf-8"))
            self.filename.write("]".encode("utf-8"))
            self.filename.close()
  • setting.py 中给管道进行配置

    ITEM_PIPELINES = {
       'myspider.pipelines.HduPipeline': 300
    }
    # 不遵循网站的爬虫君子约定
    ROBOTSTXT_OBEY = False
  • 修改 hdu.py 让其交由管道处理

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    from myspider.items import ProblemItem
    
    
    class HduSpider(scrapy.Spider):
        name = 'hdu'
        allowed_domains = ['acm.hdu.edu.cn']
        start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
    
        def parse(self, response):
            hdu = ProblemItem()
            problem_list = response.xpath('//script/text()').extract()
            problems = str.split(problem_list[1], ";")
            for item in problems:
                # print(item)
                p = re.compile(r'[(](.*)[)]', re.S)
                str1 = re.findall(p, item)[0]
                # print(str1)
                detail = str.split(str1, ",")
                hdu['id'] = detail[1]
                hdu['title'] = detail[3]
                yield hdu
  • 运行命令,这里把日志输出到 all.log 中
    scrapy crawl hdu -s LOG_FILE=all.log

  • 在hdu.json文件中看到了爬取的第一页题目标题
    ```
    {"id": "1000", "title": ""A + B Problem""}
    {"id": "1001", "title": ""Sum Problem""}
    {"id": "1002", "title": ""A + B Problem II""}
    {"id": "1003", "title": ""Max Sum""}
    {"id": "1004", "title": ""Let the Balloon Rise""}
    {"id": "1005", "title": ""Number Sequence""}

    ...

    {"id": "1099", "title": ""Lottery ""}

    ```
  • 再次修改 hdu.py 让其能够爬取全部有效页码的内容

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    from myspider.items import ProblemItem
    
    
    class HduSpider(scrapy.Spider):
        name = 'hdu'
        allowed_domains = ['acm.hdu.edu.cn']
        # download_delay = 1
        base_url = 'http://acm.hdu.edu.cn/listproblem.php?vol=%s'
        start_urls = ['http://acm.hdu.edu.cn/listproblem.php']
    
        # 爬虫入口
        def parse(self, response):
            # 首先拿到全部有效页码
            real_pages = response.xpath('//p[@class="footer_link"]/font/a/text()').extract()
            for page in real_pages:
                url = self.base_url % page
                yield scrapy.Request(url, callback=self.parse_problem)
    
        def parse_problem(self, response):
            # 从字符串中抽取有用内容
            hdu = ProblemItem()
            problem_list = response.xpath('//script/text()').extract()
            problems = str.split(problem_list[1], ";")
            for item in problems:
                # hdu有无效空题,进行剔除
                if str.isspace(item) or len(item) == 0:
                    return
                p = re.compile(r'[(](.*)[)]', re.S)
                str1 = re.findall(p, item)
                detail = str.split(str1[0], ",")
                hdu['id'] = detail[1]
                hdu['title'] = detail[3]
                yield hdu
  • 再次运行命令,这里把日志输出到 all.log 中
    scrapy crawl hdu -s LOG_FILE=all.log

  • 现在能爬到全部页码的全部题目标题信息了。但是特别注意的是,爬取到的内容并不是按顺序排列的,有多种原因决定了顺序

    [{"id": "4400", "title": "\"Mines\""},
    {"id": "4401", "title": "\"Battery\""},
    {"id": "4402", "title": "\"Magic Board\""},
    {"id": "4403", "title": "\"A very hard Aoshu problem\""},
    {"id": "4404", "title": "\"Worms\""},
    {"id": "4405", "title": "\"Aeroplane chess\""},
    {"id": "4406", "title": "\"GPA\""},
    {"id": "4407", "title": "\"Sum\""},
    
    ...
    
    {"id": "1099", "title": "\"Lottery \""},
    ]
  • 以上只是爬取到文本文件中,后续将放置到数据库中,本教程暂时略过