python爬取豆瓣影评，根据关键词生成词云图

sunnyhappy0

2020-04-20

关注关注

背景：

python 版本：3.7.4

使用IDEA：pycharm

操作系统：Windows64

第一步：获取登录状态

爬取豆瓣评论是需要用户登录的，所以需要先拿到登陆相关 cookie。进入浏览器（IE浏览器把所有的 cookie 集合到一起了，比较方便取值，其他浏览器需要自己整合所有的 cookie）登陆豆瓣之后，按下 F12 ，拿到请求头里的 cookie 与 user-agent 的数据，保持登陆状态不要退出。 python爬取豆瓣影评，根据关键词生成词云图

第二步：分析 HTML

简单获取《豪斯医生》的某一页影评，经过分析影评的 html 数据展示格式可知，我们需要的是 tr 标签下面的 td 下面的第二个 p 标签里面的内容：

python爬取豆瓣影评，根据关键词生成词云图

第三步：编码

采用 BeautifulSoup 进行 HTML 解析，简版 python 编码如下：（因为输出内容字符集是 utf-8 的，所以建议指定字符集格式）

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘)
url = ‘https://movie.douban.com/subject/1442129/collections?start=20‘
headers = {
    ‘cookie‘:‘ll=118172; bid=nO_yhRGdS8c; __utma=30149280.744941980.1587025849.1587025849.1587025849.1; __utmb=30149280.7.10.1587025849; __utmz=30149280.1587025849.1.1.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmt=1; push_noty_num=0; push_doumail_num=0; __utmv=30149280.18122; douban-profile-remind=1; __utmc=30149280; dbcl2=181229630:peNlRIftZSU; ck=0DBS; _vwo_uuid_v2=D6F0A378B72943607FFB8D0DE9AA9E4F2|e4b22c328b795c724132d4d5a5551615; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1587025959%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E9%2587%258D%25E7%2594%259F%22%5D; _pk_id.100001.4cf6=55b0d18436426829.1587025959.1.1587025959.1587025959.; _pk_ses.100001.4cf6=*; __utma=223695111.917770948.1587025959.1587025959.1587025959.1; __utmb=223695111.0.10.1587025959; __utmc=223695111; __utmz=223695111.1587025959.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; __yadk_uid=wBD152Qkg8CojaIRAPIB7nXOYiwGgYAj‘,
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko‘
}

response = requests.get(url, headers=headers).text
bs4 = BeautifulSoup(response, ‘html.parser‘)
print(bs4.select("tr > td > p:nth-of-type(2)"))

爬到的影评结果如下（可以设置规则，去掉 p 标签）：

[<p>看之前：不就是个医疗剧能拍出什么花？？
看之后：为什么一个医疗剧可以拍出这么多花？？</p>, <p>高中时期的下饭剧</p>]

第四步：将获取到的影评做成词云

主要用到的模块有：jieba、wordcloud、image，可以使用 pip 进行安装，具体词云制作代码如下：

爬到的影评的数据存放位置：F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt；

网上找的一张豪斯医生的剧照的存放位置：F:\\python\\install_3_7_4\\txt\\haosiyisheng.png

词云采用的字体的存放位置：C:/Windows/Fonts/msyh.ttc

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
import jieba
import matplotlib.pyplot as plt
fig, ax=plt.subplots()

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘)

def GetWordCloud():
    path_txt = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt";
    path_img = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.png";
    f = open(path_txt, ‘r‘, encoding=‘UTF-8‘).read()
    background_image = np.array(Image.open(path_img))
    cut_text = " ".join(jieba.cut(f))

    wordcloud = WordCloud(
        font_path="C:/Windows/Fonts/msyh.ttc",
        background_color="white",
        mask=background_image
    ).generate(cut_text)

    ax.imshow(wordcloud)
    ax.axis("off")
    plt.show()
    wordcloud.to_file(r"haosiyisheng_result.png")


if __name__ == ‘__main__‘:
    GetWordCloud()

词云最终效果图：

python爬取豆瓣影评，根据关键词生成词云图

第五步：编码过程中的异常与解决方案

1. 解决异常：ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.

使用 pip install xxx模块时，经常会遇到这个异常：

ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.

可以尝试更改 pip 源，国内源：

http://pypi.douban.com/ 豆瓣
http://pypi.hustunique.com/ 华中理工大学
http://pypi.sdutlinux.org/ 山东理工大学
http://pypi.mirrors.ustc.edu.cn/ 中国科学技术大学

最简单的方式，直接指定 pip 源，如下所示指定为豆瓣的源：

pip install -i https://pypi.douban.com/simple <需要安装的包>

2. 安装 wordcloud

安装 wordcloud 遇到一点意外，正确安装方式如下：

首先进入链接：https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud

根据 python 大版本号下载对应的 wordcloud，我本机的 python 大版本是37，所以下载的是：

python爬取豆瓣影评，根据关键词生成词云图

下载 wheel 模块，因为要通过 wheel 模块进行.whl文件的安装

pip install wheel

将之前下载好的 wordcloud-1.6.0-cp37-cp37m-win32.whl 文件复制到 python 的安装目录的 /Scripts 目录下，在此位置执行：

$ pip install wordcloud-1.6.0-cp37-cp37m-win32.whl
Processing f:\python\install_3_7_4\scripts\wordcloud-1.6.0-cp37-cp37m-win32.whl
Requirement already satisfied: pillow in f:\python\install_3_7_4\lib\site-packag                                                                                                                                                                                      es (from wordcloud==1.6.0) (7.1.1)
Requirement already satisfied: numpy>=1.6.1 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from wordcloud==1.6.0) (1.18.2)
Requirement already satisfied: matplotlib in f:\python\install_3_7_4\lib\site-pa                                                                                                                                                                                      ckages (from wordcloud==1.6.0) (3.2.1)
Requirement already satisfied: kiwisolver>=1.0.1 in f:\python\install_3_7_4\lib\                                                                                                                                                                                      site-packages (from matplotlib->wordcloud==1.6.0) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in f:\py                                                                                                                                                                                      thon\install_3_7_4\lib\site-packages (from matplotlib->wordcloud==1.6.0) (2.4.7)
Requirement already satisfied: cycler>=0.10 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from matplotlib->wordcloud==1.6.0) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in f:\python\install_3_7_4\l                                                                                                                                                                                      ib\site-packages (from matplotlib->wordcloud==1.6.0) (2.8.1)
Requirement already satisfied: six in f:\python\install_3_7_4\lib\site-packages                                                                                                                                                                                       (from cycler>=0.10->matplotlib->wordcloud==1.6.0) (1.14.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0

3. 使用 pip list 查看已安装的模块

$ pip list
Package         Version
--------------- ----------
asgiref         3.2.7
beautifulsoup4  4.9.0
bs4             0.0.1
certifi         2020.4.5.1
chardet         3.0.4
cycler          0.10.0
Django          3.0.5
idna            2.9
image           1.5.30
jieba           0.39
kiwisolver      1.2.0
matplotlib      3.2.1
numpy           1.18.2
Pillow          7.1.1
pip             19.2.3
pyparsing       2.4.7
python-dateutil 2.8.1
pytz            2019.3
requests        2.23.0
setuptools      40.8.0
six             1.14.0
soupsieve       2.0
sqlparse        0.3.1
urllib3         1.25.8
wheel           0.34.2
wordcloud       1.

python 豆瓣 cookie

安科网

python爬取豆瓣影评，根据关键词生成词云图

sunnyhappy0

sunnyhappy0

相关推荐

python中requests模拟登录的三种方式(携带cookie/session进行请求网站)

python 发送get请求接口详解

python 使用tkinter+you-get实现视频下载器

python开发一个解析protobuf文件的简单编译器

python 下载文件的多种方法汇总

Linux Shell 如何获取参数的方法

python跨文件使用全局变量的实现

Python爬虫破解登陆哔哩哔哩的方法

python调用百度API实现人脸识别

Python调用ffmpeg开源视频处理库，批量处理视频

详解python os.path.exists判断文件或文件夹是否存在

python实现在列表中查找某个元素的下标示例

python如何获得list或numpy数组中最大元素对应的索引

Python实现列表索引批量删除的5种方法

python 爬虫如何实现百度翻译

致命错误！Python开发者的7个崩溃瞬间

针对Python开发人员的10个“疯狂”的项目构想

用Python内置模块处理ini配置文件

VS Code 中 Python 扩展的部分功能重构，支持 R 和 Julia

Python五个隐藏的特性，你可能从未听说过

sunnyhappy0