用于文本分析的多项式朴素贝叶斯分类器(Python)
机器学习最流行的应用之一是分类数据的分析,特别是文本数据。在本文中,我将介绍如何为20个新闻组(http://qwone.com/~jason/20Newsgroups/)数据集实现多项式朴素贝叶斯分类器。20个新闻组数据集包含18个新闻组,其中20个主题分为两个子集:一个用于培训(或开发),另一个用于测试(或用于性能评估)。训练和测试集之间的分割基于在特定日期之前和之后发布的消息。
库
让我们导入编写实现所需的库:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import operator
类分布
我们计算每个类中文档的比例:
#Training label
train_label = open('20news-bydate/matlab/train.label')
#pi is the fraction of each class
pi = {}
#Set a class index for each document as key
for i in range(1,21):
pi[i] = 0
#Extract values from training labels
lines = train_label.readlines()
#Get total number of documents
total = len(lines)
#Count the occurence of each class
for line in lines:
val = int(line.split()[0])
pi[val] += 1
#Divide the count of each class by total documents
for key in pi:
pi[key] /= total
print("Probability of each class:")
print("".join("{}: {}".format(k, v) for k, v in pi.items()))
词汇概率分布
我们先来创建pandas数据框
#Training data
train_data = open('20news-bydate/matlab/train.data')
df = pd.read_csv(train_data, delimiter=' ', names=['docIdx', 'wordIdx', 'count'])
#Training label
label = []
train_label = open('/home/sadat/Downloads/HW2_210/20news-bydate/matlab/train.label')
lines = train_label.readlines()
for line in lines:
label.append(int(line.split()[0]))
#Increase label length to match docIdx
docIdx = df['docIdx'].values
i = 0
new_label = []
for index in range(len(docIdx)-1):
new_label.append(label[i])
if docIdx[index] != docIdx[index+1]:
i += 1
new_label.append(label[i]) #for-loop ignores last value
#Add label column
df['classIdx'] = new_label
df.head()
每个单词在每个类的概率
为了计算我们的概率,我们将找出给定类别的每个单词的平均值。
对于j类和i字,平均值由下式给出:
但是,由于有些词将有0个计数,我们将执行拉普拉斯平滑低a:
其中V是词汇表中所有单词的数组
#Alpha value for smoothing
a = 0.001
#Calculate probability of each word based on class
pb_ij = df.groupby(['classIdx','wordIdx'])
pb_j = df.groupby(['classIdx'])
Pr = (pb_ij['count'].sum() + a) / (pb_j['count'].sum() + 16689)
#Unstack series
Pr = Pr.unstack()
#Replace NaN or columns with 0 as word count with 1/(count+|V|+1)
for c in range(1,21):
Pr.loc[c,:] = Pr.loc[c,:].fillna(a/(pb_j['count'].sum()[c] + 16689))
#Convert to dictionary for greater speed
Pr_dict = Pr.to_dict()
Pr
停止词语
停止词是每个文档中都显示出来的词。
#Common stop words from online
stop_words = [
"a", "about", "above", "across", "after", "afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "as", "at", "be", "became", "because", "become","becomes", "becoming", "been", "before", "behind", "being", "beside", "besides", "between", "beyond", "both", "but", "by","can", "cannot", "cant", "could", "couldnt", "de", "describe", "do", "done", "each", "eg", "either", "else", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "find","for","found", "four", "from", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "i", "ie", "if", "in", "indeed", "is", "it", "its", "itself", "keep", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mine", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next","no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part","perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "she", "should","since", "sincere","so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "take","than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they",
"this", "those", "though", "through", "throughout",
"thru", "thus", "to", "together", "too", "toward", "towards",
"under", "until", "up", "upon", "us",
"very", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while",
"who", "whoever", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
]
现在,我们来创建词汇表数据框
vocab = open('vocabulary.txt')
vocab_df = pd.read_csv(vocab,names = ['word'])vocab_df = vocab_df.reset_index()vocab_df ['index'] = vocab_df ['index']。apply(lambda x:x + 1)vocab_df.head()
获取词汇中每个单词的计数,并将停止单词设为0:
#Index of all words
tot_list = set(vocab_df['index'])
#Index of good words
for word in stop_words:
vocab_df = vocab_df[vocab_df['word'] != word]
good_list = vocab_df['index'].tolist()
good_list = set(good_list)
#Index of stop words
bad_list = tot_list - good_list
#Set all stop words to 0
for bad in bad_list:
for j in range(1,21):
Pr_dict[j][bad] = 1/(pb_j['count'].sum()[j] + 16689)
多项式朴素贝叶斯分类器
将P的概率分布与属于每个类别的文档的比例组合。
对于类j,单词i的词频为f:
为了避免下溢,我们将使用日志总和:
一个问题是,如果再出现一个词,它再次出现的可能性就会上升。为了平滑这一点,我们取频率的对数:
此外,为了考虑停用词,我们将在每个词上添加逆文档频率(IDF)权重:
即使这个特定用例的停用词已经被设置为0,IDF的实现也被添加来泛化该函数。
#Calculate IDF tot = len(df['docIdx'].unique()) pb_ij = df.groupby(['wordIdx']) IDF = np.log(tot/pb_ij['docIdx'].count()) IDF_dict = IDF.to_dict()
def MNB(df, smooth = False, IDF = False):
'''
Multinomial Naive Bayes classifier
:param df [Pandas Dataframe]: Dataframe of data
:param smooth [bool]: Apply Smoothing if True
:param IDF [bool]: Apply Inverse Document Frequency if True
:return predict [list]: Predicted class ID
'''
#Using dictionaries for greater speed
df_dict = df.to_dict()
new_dict = {}
prediction = []
#new_dict = {docIdx : {wordIdx: count},....}
for idx in range(len(df_dict['docIdx'])):
docIdx = df_dict['docIdx'][idx]
wordIdx = df_dict['wordIdx'][idx]
count = df_dict['count'][idx]
try:
new_dict[docIdx][wordIdx] = count
except:
new_dict[df_dict['docIdx'][idx]] = {}
new_dict[docIdx][wordIdx] = count
#Calculating the scores for each doc
for docIdx in range(1, len(new_dict)+1):
score_dict = {}
#Creating a probability row for each class
for classIdx in range(1,21):
score_dict[classIdx] = 1
#For each word:
for wordIdx in new_dict[docIdx]:
#Check for frequency smoothing
#log(1+f)*log(Pr(i|j))
if smooth:
try:
probability=Pr_dict[wordIdx][classIdx]
power = np.log(1+ new_dict[docIdx][wordIdx])
#Check for IDF
if IDF:
score_dict[classIdx]+=(
power*np.log(
probability*IDF_dict[wordIdx]))
else:
score_dict[classIdx]+=power*np.log(
probability)
except:
#Missing V will have log(1+0)*log(a/16689)=0
score_dict[classIdx] += 0
#f*log(Pr(i|j))
else:
try:
probability = Pr_dict[wordIdx][classIdx]
power = new_dict[docIdx][wordIdx]
score_dict[classIdx]+=power*np.log(
probability)
#Check for IDF
if IDF:
score_dict[classIdx]+= power*np.log(
probability*IDF_dict[wordIdx])
except:
#Missing V will have 0*log(a/16689) = 0
score_dict[classIdx] += 0
#Multiply final with pi
score_dict[classIdx] += np.log(pi[classIdx])
#Get class with max probabilty for the given docIdx
max_score = max(score_dict, key=score_dict.get)
prediction.append(max_score)
return prediction
比较平滑和IDF的影响:
regular_predict = MNB(df, smooth=False, IDF=False)
smooth_predict = MNB(df, smooth=True, IDF=False)
tfidf_predict = MNB(df, smooth=False, IDF=True)
all_predict = MNB(df, smooth=True, IDF=True)
#Get list of labels
train_label = pd.read_csv('20news-bydate/matlab/train.label',
names=['t'])
train_label= train_label['t'].tolist()
total = len(train_label)
models = [regular_predict, smooth_predict,
tfidf_predict, all_predict]
strings = ['Regular', 'Smooth', 'IDF', 'Both']
for m,s in zip(models,strings):
val = 0
for i,j in zip(m, train_label):
if i != j:
val +=1
else:
pass
print(s,"Error: ",val/total * 100, "%")
正如我们所看到的,当我们删除停用词时,IDF几乎没有影响。
平滑,但是,使模型更准确。
因此,我们的最优模型是:
测试数据
现在我们已经有了模型,让我们用它来预测我们的测试数据。
#Get test data
test_data = open('20news-bydate/matlab/test.data')
df = pd.read_csv(test_data, delimiter=' ', names=['docIdx', 'wordIdx', 'count'])
#Get list of labels
test_label = pd.read_csv('/home/sadat/Downloads/HW2_210/20news-bydate/matlab/test.label', names=['t'])
test_label= test_label['t'].tolist()
#MNB Calculation
predict = MNB(df, smooth = True, IDF = False)
total = len(test_label)
val = 0
for i,j in zip(predict, test_label):
if i == j:
val +=1
else:
pass
print("Error: ",(1-(val/total)) * 100, "%")