「python Text Mining」7个好玩实用的英文文本挖掘工具实例
如何用计算机帮你:
计算一篇托福文章有多少个单词?
标出所有同义词、反义词?
标出所有的形容词?(副词、连词什么的也行啦)
计算一部《小王子》有多少个句子?
标出所有句子的主语、谓语?
计算一篇Economist读起来难度如何?
计算一则川普推文的情感是积极还是消极de?
标出一首流行歌曲中,哪些歌词押韵?
... ...
作为一名在线学习设计师,我每天都要处理、统计、分析大量的陌生英文文本
从打开python的大门起,就经常探索一些工具来辅助工作:
本文我归纳了用过的工具,如下:
1. nltk(word_tokenize, sent_tokenize, corpus.cmudict, pos_tag)
2. SpaCy
3. textstat
4. textblob
所有的介绍都以解决问题出发,不罗列所有功能
若有兴趣深入研究,请自行进入文档链接
* 所有可视化部分代码不在此文展示,详情见Github:
https://github.com/thecrossed/7_text_mining_tools_python
1. nltk (word_tokenize, sent_tokenize)
NLTK的全称为Natural Language Toolkit,是一套用于英文自然语言处理的Python库与程序。
文档地址:www.nltk.org
NLTK Book 地址:www.nltk.org
其中 word_tokenize 和 sent_tokenize 可以对文本分别进行以词、句为单位的切割。
问题:比较两篇文章的长度(各自的句子数,各自句子长度)
我们经常会接触到大量陌生的文本,不知道它们的长度如何。可以用nltk来计算两篇文本各自的句子数,以及每个句子的单词数。
运行代码显示“Text_a 有40个句子。”
接下来,我们对两篇文本的长度做可视化呈现:
可以看出
# Text a的句子数约是Text b的两倍
# Text a的句子平均长度(单词数)高于Text b,分别约为17,11
2. nltk(pos_tag)
pos_tag 处理一系列的单词,返回单词的词性(part of speech)
问题:如何标出文本中所有de形容词
文本如下:一篇雅思写作范文
Crime is an issue of increasing concern around the world, and more money than ever before is being spent on the detection and punishment of criminal activity. The reasons why people commit crime are countless, but drugs and alcohol, social problems and poverty play a major role. To solve these problems, governments can either focus on draconian punishments, or improve employment opportunities, invest in good housing projects and tackle drug and alcohol abuse.
One of the main causes of criminality is the use, sale and trafficking of narcotics. For example, the sale of drugs is organised by armed criminal gangs who illegally traffic drugs and control their business with extreme violence. Drug-related crime does not end there; drug users often steal to fund theft habit, resulting in further acts of petty crime. The social problems connected with crime are said to be the result of single-parent families, absent role models and bad living conditions. The children from these broken families often become criminals because they feel alienated from society. Poverty is also a reason behind crime. When unskilled jobs pay so little and prices are so high, it's easy to see why some turn to crime for an income.
Crime can, of course, be dealt with by toughening criminal laws and introducing longer custodial sentences for persistent criminals, but some of the best ways to deal with crime may be to deal with the social causes. Increasing employment opportunities in poorer areas would improve living standards, which would mean access to affordable housing and education. Government funding for drug and alcohol rehabilitation programmes would help reduce dependency on stimulants and the need for the criminal activity that surrounds them.
In conclusion, crime is a major issue, but cracking down on offenders with a harsh penal system is not the only way. These problems can be solved through the government providing jobs and funding which should raise living standards and dramatically reduce crime levels.
将文章中的形容词,一股脑儿列出来:
['criminal', 'countless', 'social', 'major', 'draconian', 'good', 'tackle', 'main', 'armed', 'criminal', 'extreme', 'Drug-related', 'further', 'petty', 'social', 'single-parent', 'absent', 'bad', 'broken', 'unskilled', 'little', 'high', 'easy', 'criminal', 'custodial', 'persistent', 'social', 'criminal', 'major', 'harsh', 'penal', 'only']
在文章中,用大写字母标出:
Crime is an issue of increasing concern around the world, and more money than ever before is being spent on the detection and punishment of CRIMINAL activity. The reasons why people commit crime are COUNTLESS, but drugs and alcohol, SOCIAL problems and poverty play a MAJOR role. To solve these problems, governments can either focus on DRACONIAN punishments, or improve employment opportunities, invest in GOOD housing projects and TACKLE drug and alcohol abuse. One of the MAIN causes of criminality is the use, sale and trafficking of narcotics. For example, the sale of drugs is organised by ARMED CRIMINAL gangs who illegally traffic drugs and control their business with EXTREME violence. DRUG-RELATED crime does not end there; drug users often steal to fund theft habit, resulting in FURTHER acts of PETTY crime. The SOCIAL problems connected with crime are said to be the result of SINGLE-PARENT families, ABSENT role models and BAD living conditions. The children from these BROKEN families often become criminals because they feel alienated from society. Poverty is also a reason behind crime. When UNSKILLED jobs pay so LITTLE and prices are so HIGH, it's EASY to see why some turn to crime for an income. Crime can, of course, be dealt with by toughening CRIMINAL laws and introducing longer CUSTODIAL sentences for PERSISTENT criminals, but some of the best ways to deal with crime may be to deal with the SOCIAL causes. Increasing employment opportunities in poorer areas would improve living standards, which would mean access to affordable housing and education. Government funding for drug and alcohol rehabilitation programmes would help reduce dependency on stimulants and the need for the CRIMINAL activity that surrounds them. In conclusion, crime is a MAJOR issue, but cracking down on offenders with a HARSH PENAL system is not the ONLY way. These problems can be solved through the government providing jobs and funding which should raise living standards and dramatically reduce crime levels.
3. nltk.corpus (wordnet)
wordnet 是一个巨大的英文单词数据库。名词,动词,形容词,副词等等,在其中都以词义相联系。
问题:文章形容词同义标注
接着刚才的文章,我们在其中的形容词中,找到互为同义的形容词。
结果,同义的形容词有一组: little 和 petty
将它们在文章中大写:
Crime is an issue of increasing concern around the world, and more money than ever before is being spent on the detection and punishment of criminal activity. The reasons why people commit crime are countless, but drugs and alcohol, social problems and poverty play a major role. To solve these problems, governments can either focus on draconian punishments, or improve employment opportunities, invest in good housing projects and tackle drug and alcohol abuse. One of the main causes of criminality is the use, sale and trafficking of narcotics. For example, the sale of drugs is organised by armed criminal gangs who illegally traffic drugs and control their business with extreme violence. Drug-related crime does not end there; drug users often steal to fund theft habit, resulting in further acts of PETTY crime. The social problems connected with crime are said to be the result of single-parent families, absent role models and bad living conditions. The children from these broken families often become criminals because they feel alienated from society. Poverty is also a reason behind crime. When unskilled jobs pay so LITTLE and prices are so high, it's easy to see why some turn to crime for an income. Crime can, of course, be dealt with by toughening criminal laws and introducing longer custodial sentences for persistent criminals, but some of the best ways to deal with crime may be to deal with the social causes. Increasing employment opportunities in poorer areas would improve living standards, which would mean access to affordable housing and education. Government funding for drug and alcohol rehabilitation programmes would help reduce dependency on stimulants and the need for the criminal activity that surrounds them. In conclusion, crime is a major issue, but cracking down on offenders with a harsh penal system is not the only way. These problems can be solved through the government providing jobs and funding which should raise living standards and dramatically reduce crime levels.
4. SpaCy
Spacy 是一款由Python和Cython写出来的开源库,用于处理NLP任务。相较于NLTK,SpaCy更多应用于工业界。
在之前的任务中,nltk的pos_tag可以为文章中的单词标注词性。现在我们用Spacy尝试将单词在句子中的成分标注出来。
问题:标出文章中所有句子的root verb(谓语动词)
运行结果如下:标出每句的谓语动词,并大写
Crime is an issue of increasing concern around the world, and more money than ever before is being spent on the detection and punishment of criminal activity. IS
The reasons why people commit crime are countless, but drugs and alcohol, social problems and poverty play a major role. ARE
To solve these problems, governments can either focus on draconian punishments, or improve employment opportunities, invest in good housing projects and tackle drug and alcohol abuse. FOCUS
One of the main causes of criminality is the use, sale and trafficking of narcotics. IS
For example, the sale of drugs is organised by armed criminal gangs who illegally traffic drugs and control their business with extreme violence. ORGANISED
Drug-related crime does not end there; drug users often steal to fund theft habit, resulting in further acts of petty crime. STEAL
The social problems connected with crime are said to be the result of single-parent families, absent role models and bad living conditions. SAID
The children from these broken families often become criminals because they feel alienated from society. BECOME
Poverty is also a reason behind crime. IS
When unskilled jobs pay so little and prices are so high, it's easy to see why some turn to crime for an income. 'S
Crime can, of course, be dealt with by toughening criminal laws and introducing longer custodial sentences for persistent criminals, but some of the best ways to deal with crime may be to deal with the social causes. DEALT
Increasing employment opportunities in poorer areas would improve living standards, which would mean access to affordable housing and education. IMPROVE
Government funding for drug and alcohol rehabilitation programmes would help reduce dependency on stimulants and the need for the criminal activity that surrounds them.
HELP
In conclusion, crime is a major issue, but cracking down on offenders with a harsh penal system is not the only way. IS
These problems can be solved through the government providing jobs and funding which should raise living standards and dramatically reduce crime levels. SOLVED
5. textstat
# textstat 是一款简单好用的文本分析工具,我们可以调用flesch_reading_ease对阅读难易度(readability)分析
问题:特朗普和奥巴马的推文,谁的语言更简单?
textstat中的flesch_reading_ease可以测算英语文章的难易程度
6. textblob
textblob建立在NLTK和Pattern基础之上,对于刚接触NLP的新手来说,非常友好。
可以做sentiment analysis, pos-tagging, noun phrase extraction
问题:对Trump和Obama的推文做情感分析
可视化结果如下:
# 从样本看出
# Trump的tweet语言比Obama的较为极端(1.0,-0.6,-0.4均为蓝点)
# 两人大多数的tweet都分布于0~0.6的范围(positive)
# 两人tweet在语言的难度上,较为接近;Obama有两个位于40以下的红点(语言较难)
7. nltk (corpus.cmudict.dict())
corpus.cmudict.dict() 是CMU的一套用于NLTK的词典,收录了近13万多的单词含义、发音等信息。
问题: 一首流行歌曲(Close to you),那些句子的单词在结尾押韵?
rhyme 列为每句最后一个音节,相同则为押韵。