python爬取豆瓣影评,根据关键词生成词云图
背景:
python 版本:3.7.4
使用IDEA:pycharm
操作系统:Windows64
第一步:获取登录状态
爬取豆瓣评论是需要用户登录的,所以需要先拿到登陆相关 cookie。进入浏览器(IE浏览器把所有的 cookie 集合到一起了,比较方便取值,其他浏览器需要自己整合所有的 cookie)登陆豆瓣之后,按下 F12 ,拿到请求头里的 cookie 与 user-agent 的数据,保持登陆状态不要退出。
第二步:分析 HTML
简单获取《豪斯医生》的某一页影评,经过分析影评的 html 数据展示格式可知,我们需要的是 tr 标签下面的 td 下面的第二个 p 标签里面的内容:
第三步:编码
采用 BeautifulSoup 进行 HTML 解析,简版 python 编码如下:(因为输出内容字符集是 utf-8 的,所以建议指定字符集格式)
#!/usr/bin/python # -*- coding: utf-8 -*- import io import sys import requests from bs4 import BeautifulSoup sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘) url = ‘https://movie.douban.com/subject/1442129/collections?start=20‘ headers = { ‘cookie‘:‘ll=118172; bid=nO_yhRGdS8c; __utma=30149280.744941980.1587025849.1587025849.1587025849.1; __utmb=30149280.7.10.1587025849; __utmz=30149280.1587025849.1.1.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmt=1; push_noty_num=0; push_doumail_num=0; __utmv=30149280.18122; douban-profile-remind=1; __utmc=30149280; dbcl2=181229630:peNlRIftZSU; ck=0DBS; _vwo_uuid_v2=D6F0A378B72943607FFB8D0DE9AA9E4F2|e4b22c328b795c724132d4d5a5551615; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1587025959%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E9%2587%258D%25E7%2594%259F%22%5D; _pk_id.100001.4cf6=55b0d18436426829.1587025959.1.1587025959.1587025959.; _pk_ses.100001.4cf6=*; __utma=223695111.917770948.1587025959.1587025959.1587025959.1; __utmb=223695111.0.10.1587025959; __utmc=223695111; __utmz=223695111.1587025959.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; __yadk_uid=wBD152Qkg8CojaIRAPIB7nXOYiwGgYAj‘, ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko‘ } response = requests.get(url, headers=headers).text bs4 = BeautifulSoup(response, ‘html.parser‘) print(bs4.select("tr > td > p:nth-of-type(2)"))
爬到的影评结果如下(可以设置规则,去掉 p 标签):
[<p>看之前:不就是个医疗剧能拍出什么花?? 看之后:为什么一个医疗剧可以拍出这么多花??</p>, <p>高中时期的下饭剧</p>]
第四步:将获取到的影评做成词云
主要用到的模块有:jieba、wordcloud、image,可以使用 pip 进行安装,具体词云制作代码如下:
爬到的影评的数据存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt;
网上找的一张豪斯医生的剧照的存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.png
词云采用的字体的存放位置:C:/Windows/Fonts/msyh.ttc
#!/usr/bin/python # -*- coding: utf-8 -*- import io import sys from PIL import Image from wordcloud import WordCloud, ImageColorGenerator import numpy as np import jieba import matplotlib.pyplot as plt fig, ax=plt.subplots() sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘) def GetWordCloud(): path_txt = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt"; path_img = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.png"; f = open(path_txt, ‘r‘, encoding=‘UTF-8‘).read() background_image = np.array(Image.open(path_img)) cut_text = " ".join(jieba.cut(f)) wordcloud = WordCloud( font_path="C:/Windows/Fonts/msyh.ttc", background_color="white", mask=background_image ).generate(cut_text) ax.imshow(wordcloud) ax.axis("off") plt.show() wordcloud.to_file(r"haosiyisheng_result.png") if __name__ == ‘__main__‘: GetWordCloud()
词云最终效果图:
第五步:编码过程中的异常与解决方案
1. 解决异常:ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.
使用 pip install xxx模块 时,经常会遇到这个异常:
ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.
可以尝试更改 pip 源,国内源:
http://pypi.douban.com/ 豆瓣 http://pypi.hustunique.com/ 华中理工大学 http://pypi.sdutlinux.org/ 山东理工大学 http://pypi.mirrors.ustc.edu.cn/ 中国科学技术大学
最简单的方式,直接指定 pip 源,如下所示指定为豆瓣的源:
pip install -i https://pypi.douban.com/simple <需要安装的包>
2. 安装 wordcloud
安装 wordcloud 遇到一点意外,正确安装方式如下:
首先进入链接:https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud
根据 python 大版本号下载对应的 wordcloud,我本机的 python 大版本是37,所以下载的是:
下载 wheel 模块,因为要通过 wheel 模块进行.whl文件的安装
pip install wheel
将之前下载好的 wordcloud-1.6.0-cp37-cp37m-win32.whl 文件复制到 python 的安装目录的 /Scripts 目录下,在此位置执行:
$ pip install wordcloud-1.6.0-cp37-cp37m-win32.whl Processing f:\python\install_3_7_4\scripts\wordcloud-1.6.0-cp37-cp37m-win32.whl Requirement already satisfied: pillow in f:\python\install_3_7_4\lib\site-packag es (from wordcloud==1.6.0) (7.1.1) Requirement already satisfied: numpy>=1.6.1 in f:\python\install_3_7_4\lib\site- packages (from wordcloud==1.6.0) (1.18.2) Requirement already satisfied: matplotlib in f:\python\install_3_7_4\lib\site-pa ckages (from wordcloud==1.6.0) (3.2.1) Requirement already satisfied: kiwisolver>=1.0.1 in f:\python\install_3_7_4\lib\ site-packages (from matplotlib->wordcloud==1.6.0) (1.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in f:\py thon\install_3_7_4\lib\site-packages (from matplotlib->wordcloud==1.6.0) (2.4.7) Requirement already satisfied: cycler>=0.10 in f:\python\install_3_7_4\lib\site- packages (from matplotlib->wordcloud==1.6.0) (0.10.0) Requirement already satisfied: python-dateutil>=2.1 in f:\python\install_3_7_4\l ib\site-packages (from matplotlib->wordcloud==1.6.0) (2.8.1) Requirement already satisfied: six in f:\python\install_3_7_4\lib\site-packages (from cycler>=0.10->matplotlib->wordcloud==1.6.0) (1.14.0) Installing collected packages: wordcloud Successfully installed wordcloud-1.6.0
3. 使用 pip list 查看已安装的模块
$ pip list Package Version --------------- ---------- asgiref 3.2.7 beautifulsoup4 4.9.0 bs4 0.0.1 certifi 2020.4.5.1 chardet 3.0.4 cycler 0.10.0 Django 3.0.5 idna 2.9 image 1.5.30 jieba 0.39 kiwisolver 1.2.0 matplotlib 3.2.1 numpy 1.18.2 Pillow 7.1.1 pip 19.2.3 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2019.3 requests 2.23.0 setuptools 40.8.0 six 1.14.0 soupsieve 2.0 sqlparse 0.3.1 urllib3 1.25.8 wheel 0.34.2 wordcloud 1.