使用NLTK和Gensim在Python中进行主题建模
在这篇文章中,我们将学习如何识别文档中讨论的主题,称为主题建模。
话题建模是一种无监督的机器学习方法,可帮助我们发现论文中隐藏的语义结构,使我们能够在语料库中学习论文的主题表示。该模型可以应用于文档上的任何类型的标签,例如网站上帖子上的标签。
文本清理
我们使用以下函数来清理我们的文本并返回一个令牌列表:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
lda_tokens = []
tokens = parser(text)
for token in tokens:
if token.orth_.isspace():
continue
elif token.like_url:
lda_tokens.append('URL')
elif token.orth_.startswith('@'):
lda_tokens.append('SCREEN_NAME')
else:
lda_tokens.append(token.lower_)
return lda_tokens
我们使用NLTK的Wordnet来查找单词,同义词,反义词等的含义。另外,我们使用WordNetLemmatizer来获取根词。
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
lemma = wn.morphy(word)
if lemma is None:
return word
else:
return lemma
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
return WordNetLemmatizer().lemmatize(word)
过滤停止词:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
现在我们可以定义一个函数来为主题建模准备文本:
def prepare_text_for_lda(text):
tokens = tokenize(text)
tokens = [token for token in tokens if len(token) > 4]
tokens = [token for token in tokens if token not in en_stop]
tokens = [get_lemma(token) for token in tokens]
return tokens
打开我们的数据,逐行读取每行,为LDA准备文本,然后添加到列表中。
现在我们可以看到我们的文本数据如何转换:
import random
text_data = []
with open('dataset.csv') as f:
for line in f:
tokens = prepare_text_for_lda(line)
if random.random() > .99:
print(tokens)
text_data.append(tokens)
[‘sociocrowd’, ‘social’, ‘network’, ‘base’, ‘framework’, ‘crowd’, ‘simulation’]
[‘detection’, ‘technique’, ‘clock’, ‘recovery’, ‘application’]
[‘voltage’, ‘syllabic’, ‘companding’, ‘domain’, ‘filter’]
[‘perceptual’, ‘base’, ‘coding’, ‘decision’]
[‘cognitive’, ‘mobile’, ‘virtual’, ‘network’, ‘operator’, ‘investment’, ‘pricing’, ‘supply’, ‘uncertainty’]
[‘clustering’, ‘query’, ‘search’, ‘engine’]
[‘psychological’, ‘engagement’, ‘enterprise’, ‘starting’, ‘london’]
[‘10-bit’, ‘200-ms’, ‘digitally’, ‘calibrate’, ‘pipelined’, ‘using’, ‘switching’, ‘opamps’]
[‘optimal’, ‘allocation’, ‘resource’, ‘distribute’, ‘information’, ‘network’]
[‘modeling’, ‘synaptic’, ‘plasticity’, ‘within’, ‘network’, ‘highly’, ‘accelerate’, ‘i&f’, ‘neuron’]
[‘tile’, ‘interleave’, ‘multi’, ‘level’, ‘discrete’, ‘wavelet’, ‘transform’]
[‘security’, ‘cross’, ‘layer’, ‘protocol’, ‘wireless’, ‘sensor’, ‘network’]
[‘objectivity’, ‘industrial’, ‘exhibit’]
[‘balance’, ‘packet’, ‘discard’, ‘improve’, ‘performance’, ‘network’]
[‘bodyqos’, ‘adaptive’, ‘radio’, ‘agnostic’, ‘sensor’, ‘network’]
[‘design’, ‘reliability’, ‘methodology’]
[‘context’, ‘aware’, ‘image’, ‘semantic’, ‘extraction’, ‘social’]
[‘computation’, ‘unstable’, ‘limit’, ‘cycle’, ‘large’, ‘scale’, ‘power’, ‘system’, ‘model’]
[‘photon’, ‘density’, ‘estimation’, ‘using’, ‘multiple’, ‘importance’, ‘sampling’]
[‘approach’, ‘joint’, ‘blind’, ‘space’, ‘equalization’, ‘estimation’]
[‘unify’, ‘quadratic’, ‘programming’, ‘approach’, ‘mix’, ‘placement’]
LDA with Gensim
首先,我们从数据中创建一个字典,然后转换为bag-of-words语料库并保存字典和语料库以供将来使用。
from gensim import corpora
dictionary = corpora.Dictionary(text_data)corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')
我们要求LDA在数据中找到5个主题:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics = NUM_TOPICS,id2word = dictionary,passes = 15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words = 4)
for topic in topic:
print(topic)
(0, ‘0.034*”processor” + 0.019*”database” + 0.019*”issue” + 0.019*”overview”’)
(1, ‘0.051*”computer” + 0.028*”design” + 0.028*”graphics” + 0.028*”gallery”’)
(2, ‘0.050*”management” + 0.027*”object” + 0.027*”circuit” + 0.027*”efficient”’)
(3, ‘0.019*”cognitive” + 0.019*”radio” + 0.019*”network” + 0.019*”distribute”’)
(4, ‘0.029*”circuit” + 0.029*”system” + 0.029*”rigorous” + 0.029*”integration”’)
主题0包括诸如“处理器”,“数据库”,“问题”和“概述”之类的词,听起来像是与数据库相关的主题。主题1包括“计算机”,“设计”,“图形”和“画廊”等词语,它确定为平面设计相关主题。主题2包括“管理”,“对象”,“电路”和“高效”等词语,听起来像公司管理相关的话题。等等。
通过LDA,我们可以看到具有不同主题的不同文档,并且歧视是显而易见的。
让我们尝试一个新的文件:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))
[(38, 1), (117, 1)]
[(0, 0.06669136), (1, 0.40170625), (2, 0.06670282), (3, 0.39819494), (4, 0.066704586)]
LDA输出表明,主题1分配的概率最高,主题3分配的概率第二高。
现在我们要求LDA在数据中找到3个主题:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
(0, ‘0.029*”processor” + 0.016*”management” + 0.016*”aid” + 0.016*”algorithm”’)
(1, ‘0.026*”radio” + 0.026*”network” + 0.026*”cognitive” + 0.026*”efficient”’)
(2, ‘0.029*”circuit” + 0.029*”distribute” + 0.016*”database” + 0.016*”management”’)
我们还可以找到10个主题:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
(0, ‘0.055*”database” + 0.055*”system” + 0.029*”technical” + 0.029*”recursive”’)
(1, ‘0.038*”distribute” + 0.038*”graphics” + 0.038*”regenerate” + 0.038*”exact”’)
(2, ‘0.055*”management” + 0.029*”multiversion” + 0.029*”reference” + 0.029*”document”’)
(3, ‘0.046*”circuit” + 0.046*”object” + 0.046*”generation” + 0.046*”transformation”’)
(4, ‘0.008*”programming” + 0.008*”circuit” + 0.008*”network” + 0.008*”surface”’)
(5, ‘0.061*”radio” + 0.061*”cognitive” + 0.061*”network” + 0.061*”connectivity”’)
(6, ‘0.085*”programming” + 0.008*”circuit” + 0.008*”subdivision” + 0.008*”management”’)
(7, ‘0.041*”circuit” + 0.041*”design” + 0.041*”processor” + 0.041*”instruction”’)
(8, ‘0.055*”computer” + 0.029*”efficient” + 0.029*”channel” + 0.029*”cooperation”’)
(9, ‘0.061*”stimulation” + 0.061*”sensor” + 0.061*”retinal” + 0.061*”pixel”’)
pyLDAvis
pyLDAvis旨在帮助用户解释符合文本数据语料库的主题模型中的主题。该软件包从拟合的LDA主题模型中提取信息,以通知基于交互式的基于Web的可视化。
可视化5个主题:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
显着性:衡量该术语告诉你这个主题的程度。
相关性:给出该主题的单词的概率的加权平均值,以及由该主题的概率标准化的主题的单词的加权平均值。
泡沫的大小衡量了相对于数据的主题的重要性。
首先,我们得到了最显着的术语,意思是术语大多告诉我们与主题相关的事情。我们也可以看个人主题。
可视化3个主题:
lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)
可视化10个主题:
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10,corpus,dictionary,sort_topics = False)
pyLDAvis.display(lda_display10)
当我们有5或10个主题时,我们可以看到某些主题聚集在一起,这表明主题之间的相似性。