使用NLTK和Gensim在Python中进行主题建模

使用NLTK和Gensim在Python中进行主题建模

在这篇文章中,我们将学习如何识别文档中讨论的主题,称为主题建模。

话题建模是一种无监督的机器学习方法,可帮助我们发现论文中隐藏的语义结构,使我们能够在语料库中学习论文的主题表示。该模型可以应用于文档上的任何类型的标签,例如网站上帖子上的标签。

文本清理

我们使用以下函数来清理我们的文本并返回一个令牌列表:

import spacy

spacy.load('en')

from spacy.lang.en import English

parser = English()

def tokenize(text):

lda_tokens = []

tokens = parser(text)

for token in tokens:

if token.orth_.isspace():

continue

elif token.like_url:

lda_tokens.append('URL')

elif token.orth_.startswith('@'):

lda_tokens.append('SCREEN_NAME')

else:

lda_tokens.append(token.lower_)

return lda_tokens

我们使用NLTK的Wordnet来查找单词,同义词,反义词等的含义。另外,我们使用WordNetLemmatizer来获取根词。

import nltk

nltk.download('wordnet')

from nltk.corpus import wordnet as wn

def get_lemma(word):

lemma = wn.morphy(word)

if lemma is None:

return word

else:

return lemma

from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma2(word):

return WordNetLemmatizer().lemmatize(word)

过滤停止词:

nltk.download('stopwords')

en_stop = set(nltk.corpus.stopwords.words('english'))

现在我们可以定义一个函数来为主题建模准备文本:

def prepare_text_for_lda(text):

tokens = tokenize(text)

tokens = [token for token in tokens if len(token) > 4]

tokens = [token for token in tokens if token not in en_stop]

tokens = [get_lemma(token) for token in tokens]

return tokens

打开我们的数据,逐行读取每行,为LDA准备文本,然后添加到列表中。

现在我们可以看到我们的文本数据如何转换:

import random

text_data = []

with open('dataset.csv') as f:

for line in f:

tokens = prepare_text_for_lda(line)

if random.random() > .99:

print(tokens)

text_data.append(tokens)

[‘sociocrowd’, ‘social’, ‘network’, ‘base’, ‘framework’, ‘crowd’, ‘simulation’]

[‘detection’, ‘technique’, ‘clock’, ‘recovery’, ‘application’]

[‘voltage’, ‘syllabic’, ‘companding’, ‘domain’, ‘filter’]

[‘perceptual’, ‘base’, ‘coding’, ‘decision’]

[‘cognitive’, ‘mobile’, ‘virtual’, ‘network’, ‘operator’, ‘investment’, ‘pricing’, ‘supply’, ‘uncertainty’]

[‘clustering’, ‘query’, ‘search’, ‘engine’]

[‘psychological’, ‘engagement’, ‘enterprise’, ‘starting’, ‘london’]

[‘10-bit’, ‘200-ms’, ‘digitally’, ‘calibrate’, ‘pipelined’, ‘using’, ‘switching’, ‘opamps’]

[‘optimal’, ‘allocation’, ‘resource’, ‘distribute’, ‘information’, ‘network’]

[‘modeling’, ‘synaptic’, ‘plasticity’, ‘within’, ‘network’, ‘highly’, ‘accelerate’, ‘i&f’, ‘neuron’]

[‘tile’, ‘interleave’, ‘multi’, ‘level’, ‘discrete’, ‘wavelet’, ‘transform’]

[‘security’, ‘cross’, ‘layer’, ‘protocol’, ‘wireless’, ‘sensor’, ‘network’]

[‘objectivity’, ‘industrial’, ‘exhibit’]

[‘balance’, ‘packet’, ‘discard’, ‘improve’, ‘performance’, ‘network’]

[‘bodyqos’, ‘adaptive’, ‘radio’, ‘agnostic’, ‘sensor’, ‘network’]

[‘design’, ‘reliability’, ‘methodology’]

[‘context’, ‘aware’, ‘image’, ‘semantic’, ‘extraction’, ‘social’]

[‘computation’, ‘unstable’, ‘limit’, ‘cycle’, ‘large’, ‘scale’, ‘power’, ‘system’, ‘model’]

[‘photon’, ‘density’, ‘estimation’, ‘using’, ‘multiple’, ‘importance’, ‘sampling’]

[‘approach’, ‘joint’, ‘blind’, ‘space’, ‘equalization’, ‘estimation’]

[‘unify’, ‘quadratic’, ‘programming’, ‘approach’, ‘mix’, ‘placement’]

LDA with Gensim

首先,我们从数据中创建一个字典,然后转换为bag-of-words语料库并保存字典和语料库以供将来使用。

from gensim import corpora

dictionary = corpora.Dictionary(text_data)corpus = [dictionary.doc2bow(text) for text in text_data]

import pickle

pickle.dump(corpus, open('corpus.pkl', 'wb'))

dictionary.save('dictionary.gensim')

我们要求LDA在数据中找到5个主题:

import gensim

NUM_TOPICS = 5

ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics = NUM_TOPICS,id2word = dictionary,passes = 15)

ldamodel.save('model5.gensim')

topics = ldamodel.print_topics(num_words = 4)

for topic in topic:

print(topic)

(0, ‘0.034*”processor” + 0.019*”database” + 0.019*”issue” + 0.019*”overview”’)

(1, ‘0.051*”computer” + 0.028*”design” + 0.028*”graphics” + 0.028*”gallery”’)

(2, ‘0.050*”management” + 0.027*”object” + 0.027*”circuit” + 0.027*”efficient”’)

(3, ‘0.019*”cognitive” + 0.019*”radio” + 0.019*”network” + 0.019*”distribute”’)

(4, ‘0.029*”circuit” + 0.029*”system” + 0.029*”rigorous” + 0.029*”integration”’)

主题0包括诸如“处理器”,“数据库”,“问题”和“概述”之类的词,听起来像是与数据库相关的主题。主题1包括“计算机”,“设计”,“图形”和“画廊”等词语,它确定为平面设计相关主题。主题2包括“管理”,“对象”,“电路”和“高效”等词语,听起来像公司管理相关的话题。等等。

通过LDA,我们可以看到具有不同主题的不同文档,并且歧视是显而易见的。

让我们尝试一个新的文件:

new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'

new_doc = prepare_text_for_lda(new_doc)

new_doc_bow = dictionary.doc2bow(new_doc)

print(new_doc_bow)

print(ldamodel.get_document_topics(new_doc_bow))

[(38, 1), (117, 1)]

[(0, 0.06669136), (1, 0.40170625), (2, 0.06670282), (3, 0.39819494), (4, 0.066704586)]

LDA输出表明,主题1分配的概率最高,主题3分配的概率第二高。

现在我们要求LDA在数据中找到3个主题:

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)

ldamodel.save('model3.gensim')

topics = ldamodel.print_topics(num_words=4)

for topic in topics:

print(topic)

(0, ‘0.029*”processor” + 0.016*”management” + 0.016*”aid” + 0.016*”algorithm”’)

(1, ‘0.026*”radio” + 0.026*”network” + 0.026*”cognitive” + 0.026*”efficient”’)

(2, ‘0.029*”circuit” + 0.029*”distribute” + 0.016*”database” + 0.016*”management”’)

我们还可以找到10个主题:

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)

ldamodel.save('model10.gensim')

topics = ldamodel.print_topics(num_words=4)

for topic in topics:

print(topic)

(0, ‘0.055*”database” + 0.055*”system” + 0.029*”technical” + 0.029*”recursive”’)

(1, ‘0.038*”distribute” + 0.038*”graphics” + 0.038*”regenerate” + 0.038*”exact”’)

(2, ‘0.055*”management” + 0.029*”multiversion” + 0.029*”reference” + 0.029*”document”’)

(3, ‘0.046*”circuit” + 0.046*”object” + 0.046*”generation” + 0.046*”transformation”’)

(4, ‘0.008*”programming” + 0.008*”circuit” + 0.008*”network” + 0.008*”surface”’)

(5, ‘0.061*”radio” + 0.061*”cognitive” + 0.061*”network” + 0.061*”connectivity”’)

(6, ‘0.085*”programming” + 0.008*”circuit” + 0.008*”subdivision” + 0.008*”management”’)

(7, ‘0.041*”circuit” + 0.041*”design” + 0.041*”processor” + 0.041*”instruction”’)

(8, ‘0.055*”computer” + 0.029*”efficient” + 0.029*”channel” + 0.029*”cooperation”’)

(9, ‘0.061*”stimulation” + 0.061*”sensor” + 0.061*”retinal” + 0.061*”pixel”’)

pyLDAvis

pyLDAvis旨在帮助用户解释符合文本数据语料库的主题模型中的主题。该软件包从拟合的LDA主题模型中提取信息,以通知基于交互式的基于Web的可视化。

可视化5个主题:

dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')

corpus = pickle.load(open('corpus.pkl', 'rb'))

lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')

import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)

pyLDAvis.display(lda_display)

使用NLTK和Gensim在Python中进行主题建模

显着性:衡量该术语告诉你这个主题的程度。

相关性:给出该主题的单词的概率的加权平均值,以及由该主题的概率标准化的主题的单词的加权平均值。

泡沫的大小衡量了相对于数据的主题的重要性。

首先,我们得到了最显着的术语,意思是术语大多告诉我们与主题相关的事情。我们也可以看个人主题。

可视化3个主题:

lda3 = gensim.models.ldamodel.LdaModel.load('model3.gensim')

lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)

pyLDAvis.display(lda_display3)

使用NLTK和Gensim在Python中进行主题建模

可视化10个主题:

lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')

lda_display10 = pyLDAvis.gensim.prepare(lda10,corpus,dictionary,sort_topics = False)

pyLDAvis.display(lda_display10)

使用NLTK和Gensim在Python中进行主题建模

当我们有5或10个主题时,我们可以看到某些主题聚集在一起,这表明主题之间的相似性。

相关推荐