如何利用python实现用户评论挖掘并分析

疾风先生

2019-06-24

关注关注

1、利用函数nextpage获取所需的id

顾名思义，这是一个“翻下一页”的函数。可以通过读取url中的id进行自动翻页，利用该函数对股票代码进行获取。

以沪深股市为例，在当前页面按F12（Fn+F12），在Elements界面查看，找到下一页的id，即可通过正则表达式获得股票代码数据。

注意：使用该函数时，需要download selenium module并在环境变量中配置Chrome 驱动

url = http://quote.eastmoney.com/center/gridlist.html#hs_a_board

import re
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def nextpage(url, pages_num, rule):
 page = 0
 codes = ''
 codes = []
 driver = webdriver.Chrome()
 driver.get(url)
 while page &lt; pages_num:
 WebDriverWait(driver, 15, 0.5).until(EC.presence_of_element_located((By.ID, "table_wrapper")))
 code = re.findall(rule, driver.page_source, flags=re.S)
 codes = code[:] + codes
 driver.find_element_by_id('main-table_next').click() # selenium的find id用法，找到包含“下一页”的id去点击
 page = page + 1
 time.sleep(random.uniform(0.3,1)) # sleep一个随机浮点数
 driver.quit()
 return codes

这里的input有三个变量：

url=包含id的网页；pages_num=需要翻页的数量；rule=正则表达式的规则

通畅在翻页时可利用time.sleep函数进行暂停，然后利用set函数对得到的股票代码去重，避免重复爬取。那么在实现股票代码获取的时候可以用如下代码：

# crawl the relevant pages of stocks
url = 'http://quote.eastmoney.com/center/gridlist.html#us_stocks'
pages_num = 277
items = nextpage(url, pages_num, '/us/(.*?).html') # 从页面中获取股票代码
items = list(set(items)) # 去重，避免重复爬取
print('There are %s items relevant to the us.' % len(items))

这样，通过nextpage和正则处理后，所有原url列表中的股票代码都获取了下来，从而可以为下一步用户评论的获取做准备。

小编准备了一份Python学习资料，给那些正在学习Python的同学，或者准备学习Python的同学，关注，转发，私信小编“01”即可免费获取！

如何利用python实现用户评论挖掘并分析

2、通过函数crawl，实现股票评论HTML文件获取

这里要用到的module为urllib，ssl

下面的例子展示了crawl函数爬取所需的HTML。用这种方法批处理时，需要进行try/except判断，然后再把fail的url print出来，把不成功的原因找出来。

注意：访问不要过于频繁，使用sleep()适当暂停！批量获取前先单个测试，并在爬取过程中及时记录成功与否状态。

import ssl
from urllib import request
def crawl(url):
 # try:
 page = request.urlopen(url,context=context,timeout=5)
 html = page.read().decode('utf-8')
 return html
for item in items:
 item_url = 'http://guba.eastmoney.com/list,us' + item + '.html'
 print(item_url)
 try:
 item_html = crawl(item_url)
 file_name = './us_pages/' + item + '.html'
 with open(file_name,'w',encoding='utf-8') as f:
 f.write(item_html)
 print(item, 'success')
 except:
 print(item, 'failed', item_url)
 continue
 time.sleep(random.uniform(0.3,1)) # sleep一个随机浮点数

3、利用函数cleantext得到并清洗评论

用第三个自己编写的cleantext函数进行评论文本获取和清洗，可以为下一步的分词和分析做准备。cleantext函数用的正则表达式的方法与nextpage中的方法类似，就是从HTML文件中利用正则规则提取评论文本，然后对contents进行re.sub()处理，从而实现清洗评论文本的效果：

def cleantext(html):
 result = ''
 contents = re.findall('.html" title="(.*?)"&gt;',html)
 for content in contents:
 content = re.sub('&lt;|&gt;', '', content, flags=re.S)
 result += content + '
'
 return result
 
#clean for titles of comments
files = glob.glob('./us_pages/*.html') # 获取爬取的html文件，返回list
all_stocks = open('all_us_stocks.txt','w', encoding='utf-8') # 将所有comments页面清洗结果存于该文件，以备后续分析
for file in files:
 print('process', file)
 with open(file,'r',encoding='utf-8') as f:
 content = f.read()
 stock = cleantext(content)
 all_stocks.write(stock)
all_stocks.close()

以沪深股票为例，部分评论文本效果如下：

如何利用python实现用户评论挖掘并分析

4、利用分词和关键词提取制作词云图

首先使用jieba module分词，主要函数：jieba.cut()，然后使用jieba.analyse.extract_tags()提取关键词及hidf权重，接着使用WordCloud库制作词云图，其中 generate_from_frequencies()函数可以自定义词语权重，即使用jieba计算的h-idf信息，词云图背景可自定义

import jieba.posseg as pseg
import jieba.analyse as ale
from wordcloud import WordCloud,ImageColorGenerator
from scipy.misc import imread
filename = './all_xsb_stocks.txt'
font_path = './siyuan.ttf' # 指定汉字字体位置，否则中文无法显示
pic_path = './cup.jpg'
# step 1. 抽取关键词
with open(filename,'r',encoding='utf-8') as f:
 content = f.read()
keywords = ale.extract_tags(content, topK=100, withWeight=True, allowPOS=()) 
#keywords = ale.textrank(content, topK=100, withWeight=True)
d = {}
for kw in keywords:
 #print(kw)
 d[kw[0]] = kw[1]
# step 2. 绘制词云
pic = imread(pic_path) #读取图片
pic_color = ImageColorGenerator(pic)
wc = WordCloud(scale = 4, font_path=font_path, mask=pic, color_func=pic_color, background_color='white')
wc.fit_words(d)
wc.to_file('./uk_tags.png')

仍然以沪深股市为例，出图效果如下：

如何利用python实现用户评论挖掘并分析

下面这个是港股的，可以看出基本为繁体字：

如何利用python实现用户评论挖掘并分析

新三板，关注“科技”，“智能”，“生物”：

如何利用python实现用户评论挖掘并分析

美股基本上就是英文的公告和评论了：

如何利用python实现用户评论挖掘并分析

英股的评论基本上是中文，也不知道为何这么多人觉得英股“垃圾”

如何利用python实现用户评论挖掘并分析

但是这个关键词提取方法也存在一定的缺陷，比如频率较高但是不那么重要的词往往占了前几名。

python url webdriver

安科网

如何利用python实现用户评论挖掘并分析

疾风先生

疾风先生

相关推荐

python 下载文件的多种方法汇总

python 发送get请求接口详解

python 使用tkinter+you-get实现视频下载器

python中requests模拟登录的三种方式(携带cookie/session进行请求网站)

python开发一个解析protobuf文件的简单编译器

Linux Shell 如何获取参数的方法

python跨文件使用全局变量的实现

Python爬虫破解登陆哔哩哔哩的方法

python调用百度API实现人脸识别

Python调用ffmpeg开源视频处理库，批量处理视频

详解python os.path.exists判断文件或文件夹是否存在

python实现在列表中查找某个元素的下标示例

python如何获得list或numpy数组中最大元素对应的索引

Python实现列表索引批量删除的5种方法

python 爬虫如何实现百度翻译

致命错误！Python开发者的7个崩溃瞬间

针对Python开发人员的10个“疯狂”的项目构想

用Python内置模块处理ini配置文件

VS Code 中 Python 扩展的部分功能重构，支持 R 和 Julia

Python五个隐藏的特性，你可能从未听说过

疾风先生