scrapy爬取猫眼电影排行榜

做爬虫的人,一定离不开的一个框架就是scrapy框架,写小项目的时候可以用requests模块就能得到结果,但是当爬取的数据量大的时候,就一定要用到框架.

下面先练练手,用scrapy写一个爬取猫眼电影的程序,环境配置和scrapy安装略过

第一步肯定是终端运行创建爬虫项目和文件

# 创建爬虫项目
scrapy startproject Maoyan
cd Maoyan
# 创建爬虫文件
scrapy genspider maoyan maoyan.com

然后在产生的items.py文件夹中定义需要爬取的数据结构

1 name = scrapy.Field()
2 star = scrapy.Field()
3 time = scrapy.Field()

之后打开maoyan.py文件,编写爬虫文件,记得导入items.py文件的MaoyanItem类,并实例化

import scrapy
from ..items import MaoyanItem
?
class MaoyanSpider(scrapy.Spider):
    name = ‘maoyan3‘
    allowed_domains = [‘maoyan.com‘]
    # 去掉start_urls变量
?
    # 重写start_requests()方法
    def start_requests(self):
        for offset in range(0,91,10):
            url = ‘https://maoyan.com/board/4?offset={}‘.format(offset)
            yield scrapy.Request(url=url,callback=self.parse)
?
    def parse(self, response):
        # 给items.py中的类:MaoyanItem(scrapy.Item)实例化
        item = MaoyanItem()
?
        # 基准xpath
        dd_list = response.xpath(‘//dl[@class="board-wrapper"]/dd‘)
        # 依次遍历
        for dd in dd_list:
            # 是在给items.py中那些类变量赋值
            item[‘name‘] = dd.xpath(‘./a/@title‘).get().strip()
            item[‘star‘] = dd.xpath(‘.//p[@class="star"]/text()‘).get().strip()
            item[‘time‘] = dd.xpath(‘.//p[@class="releasetime"]/text()‘).get().strip()
?
            # 把item对象交给管道文件处理
            yield item

定义管道文件pipelines.py,进行持久化储存

class MaoyanPipeline(object):
    # item: 从爬虫文件maoyan.py中yield的item数据
    def process_item(self, item, spider):
        print(item[‘name‘],item[‘time‘],item[‘star‘])
?
        return item
?
?
import pymysql
from .settings import *
?
# 自定义管道 - MySQL数据库
class MaoyanMysqlPipeline(object):
    # 爬虫项目开始运行时执行此函数
    def open_spider(self,spider):
        print(‘我是open_spider函数输出‘)
        # 一般用于建立数据库连接
        self.db = pymysql.connect(
            host = MYSQL_HOST,
            user = MYSQL_USER,
            password = MYSQL_PWD,
            database = MYSQL_DB,
            charset = MYSQL_CHAR
        )
        self.cursor = self.db.cursor()
?
    def process_item(self,item,spider):
        ins = ‘insert into filmtab values(%s,%s,%s)‘
        # 因为execute()的第二个参数为列表
        L = [
            item[‘name‘],item[‘star‘],item[‘time‘]
        ]
        self.cursor.execute(ins,L)
        self.db.commit()
?
        return item
?
    # 爬虫项目结束时执行此函数
    def close_spider(self,spider):
        print(‘我是close_spider函数输出‘)
        # 一般用于断开数据库连接
        self.cursor.close()
        self.db.close()

接下来就是修改配置文件settings.py

USER_AGENT = ‘Mozilla/5.0‘
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
  ‘Accept-Language‘: ‘en‘,
}
ITEM_PIPELINES = {
   ‘Maoyan.pipelines.MaoyanPipeline‘: 300,
   ‘Maoyan.pipelines.MaoyanMysqlPipeline‘:200,
}
# 定义MySQL相关变量
MYSQL_HOST = ‘127.0.0.1‘
MYSQL_USER = ‘root‘
MYSQL_PWD = ‘123456‘
MYSQL_DB = ‘maoyandb‘
MYSQL_CHAR = ‘utf8‘

最后,是创建run.py文件,然后就可以运行了

1 from scrapy import cmdline
2 cmdline.execute(‘scrapy crawl maoyan‘.split())

相关推荐