基于python的scrapy框架爬取豆瓣电影及其可视化

heyboz

2019-04-25

关注关注

1.Scrapy框架介绍

scrapy

主要介绍，spiders，engine，scheduler,downloader,Item pipeline

scrapy常见命令如下：

image

对应在scrapy文件中有，自己增加爬虫文件，系统生成items,pipelines,setting的配置文件就这些。

items写需要爬取的属性名，pipelines写一些数据流操作，写入文件，还是导入数据库中。主要爬虫文件写domain，属性名的xpath，在每页添加属性对应的信息等。

movieRank = scrapy.Field()

movieName = scrapy.Field()

Director = scrapy.Field()

movieDesc = scrapy.Field()

movieRate = scrapy.Field()

peopleCount = scrapy.Field()

movieDate = scrapy.Field()

movieCountry = scrapy.Field()

movieCategory = scrapy.Field()

moviePost = scrapy.Field()

import json

class DoubanPipeline(object):

def __init__(self):

self.f = open("douban.json","w",encoding='utf-8')

def process_item(self, item, spider):

content = json.dumps(dict(item),ensure_ascii = False)+""

self.f.write(content)

return item

def close_spider(self,spider):

self.f.close()

这里xpath使用过程中，安利一个chrome插件xpathHelper。

allowed_domains = ['douban.com']

baseURL = "https://movie.douban.com/top250?start="

offset = 0

start_urls = [baseURL + str(offset)]

def parse(self, response):

node_list = response.xpath("//div[@class='item']")

for node in node_list:

item = DoubanItem()

item['movieName'] = node.xpath("./div[@class='info']/div[1]/a/span/text()").extract()[0]

item['movieRank'] = node.xpath("./div[@class='pic']/em/text()").extract()[0]

item['Director'] = node.xpath("./div[@class='info']/div[@class='bd']/p[1]/text()[1]").extract()[0]

if len(node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")):

item['movieDesc'] = node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0]

else:

item['movieDesc'] = ""

item['movieRate'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0]

item['peopleCount'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[4]/text()").extract()[0]

item['movieDate'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[0]

item['movieCountry'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[1]

item['movieCategory'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[2]

item['moviePost'] = node.xpath("./div[@class='pic']/a/img/@src").extract()[0]

yield item

if self.offset <250:

self.offset += 25

url = self.baseURL+str(self.offset)

yield scrapy.Request(url,callback = self.parse)

这里基本可以爬虫，产生需要的json文件。

接下来是可视化过程。

我们先梳理一下，我们掌握的数据情况。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')

douban.info()

image

基本我们可以分析，电影国家产地，电影拍摄年份，电影类别以及一些导演在TOP250中影响力。

先做个简单了解，可以使用value_counts()函数。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')

df_Country = douban['movieCountry'].copy()

for i in range(len(df_Country)):

item = df_Country.iloc[i].strip()

df_Country.iloc[i] = item[0]

print(df_Country.value_counts())

image

美国电影占半壁江山，122/250，可以反映好莱坞电影工业之强大。同样，日本电影和香港电影在中国也有着重要地位。令人意外是，中国大陆地区电影数量不是令人满意。豆瓣影迷对于国内电影还是非常挑剔的。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')

df_Date = douban['movieDate'].copy()

for i in range(len(df_Date)):

item = df_Date.iloc[i].strip()

df_Date.iloc[i] = item[2]

print(df_Date.value_counts())

2000年以来电影数目在70%以上，考虑10代才过去9年和打分滞后性，总体来说越新的电影越能得到受众喜爱。这可能和豆瓣top250选取机制有关，必须人数在一定数量以上。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')

df_Cate = douban['movieCategory'].copy()

for i in range(len(df_Cate)):

item = df_Cate.iloc[i].strip()

df_Cate.iloc[i] = item[0]

print(df_Cate.value_counts())

image

剧情电影情节起伏更容易得到观众认可。

Python学习群：556370268，有大牛答疑，有资源共享！有想学习python编程的，或是转行，或是大学生，还有工作中想提升自己能力的，正在学习的小伙伴欢迎加入学习。

下面展示几张可视化图片

image

不太会用python进行展示，有些难看。其实，推荐用Echarts等插件，或者用Excel，BI软件来处理图片，比较方便和美观。

下面是为初学者们准备的python电子书籍资料和python入门教程！

请关注+私信回复：“学习”就可以拿到一份我为大家准备的Python学习资料！

python 豆瓣 scrapy 框架 xpath

安科网

基于python的scrapy框架爬取豆瓣电影及其可视化

heyboz

下面是为初学者们准备的python电子书籍资料和python入门教程！

请关注+私信回复：“学习”就可以拿到一份我为大家准备的Python学习资料！

heyboz

相关推荐

python 发送get请求接口详解

python 使用tkinter+you-get实现视频下载器

python中requests模拟登录的三种方式(携带cookie/session进行请求网站)

python开发一个解析protobuf文件的简单编译器

python 下载文件的多种方法汇总

Linux Shell 如何获取参数的方法

python跨文件使用全局变量的实现

Python爬虫破解登陆哔哩哔哩的方法

python调用百度API实现人脸识别

Python调用ffmpeg开源视频处理库，批量处理视频

详解python os.path.exists判断文件或文件夹是否存在

python实现在列表中查找某个元素的下标示例

python如何获得list或numpy数组中最大元素对应的索引

Python实现列表索引批量删除的5种方法

python 爬虫如何实现百度翻译

致命错误！Python开发者的7个崩溃瞬间

针对Python开发人员的10个“疯狂”的项目构想

用Python内置模块处理ini配置文件

VS Code 中 Python 扩展的部分功能重构，支持 R 和 Julia

Python五个隐藏的特性，你可能从未听说过

heyboz