在scikit-learn中用LinearSVC进行概率预测的多类文本分类

多类文本分类是NLP和机器学习最常见的应用之一。有几种方法可以解决这个问题,而机器学习算法的性能取决于数据的质量。LinearSVC是在基于NLP的文本分类任务范围上表现较好的算法之一。但是,如果需要对所有类都有概率分布,那么scikitt -learn中的LinearSVC不会提供像predict_proba这样的函数。

LinearSVC提供了decision_function方法。decision_function可预测样品的置信度分数。样本的置信值是该样本到超平面的有符号距离。

在这篇文章中,我将演示如何使用校准过的scikitlearn库的classifiercv类,以便在预测的输出中具有跨所有类的概率分布。在github上使用的是jupyter笔记本。我将使用data.gov上的消费者金融投诉数据集。

问题陈述

输出所有类的概率分布,以便在scikit-learn中使用LinearSVC分类器进行预测。

探索数据集

第一步是探索数据集(https://catalog.data.gov/dataset/consumer-complaint-database)。我们将查看数据集中可用类的数量和总行数。我们将使用Pandas作为Python中的一个流行库来加载数据,并概述数据的外观。

import pandas as pd

consumer_complaints_df = pd.read_csv("Consumer_Complaints.csv")

consumer_complaints_df.head()

在scikit-learn中用LinearSVC进行概率预测的多类文本分类

数据概述

我们将仅为这个任务使用两列。“Product”列将作为类和“Consumer complaint narrative”特征列。我们将用特征来训练分类器来预测类。因此,对于预测输入将是消费者投诉的叙述和输出将是产品的概率分布。最有可能的乘积将表明预测的可信度。

在数据集中打印唯一类的列表

consumer_complaints_df['Product'].unique()

array(['Mortgage', 'Credit reporting', 'Consumer Loan', 'Credit card',

'Debt collection', 'Student loan', 'Bank account or service',

'Other financial service', 'Prepaid card', 'Money transfers',

'Checking or savings account',

'Credit reporting, credit repair services, or other personal consumer reports',

'Payday loan', 'Money transfer, virtual currency, or money service',

'Credit card or prepaid card', 'Vehicle loan or lease',

'Payday loan, title loan, or personal loan', 'Virtual currency'], dtype=object)

下一步是删除任何具有Null值的行

consumer_complaints_filtered_df = consumer_complaints_df[pd.notnull(consumer_complaints_df['Consumer complaint narrative'])]

consumer_complaints_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 299516 entries, 1 to 1072068

Data columns (total 18 columns):

Date received 299516 non-null object

Product 299516 non-null object

Sub-product 247333 non-null object

Issue 299516 non-null object

Sub-issue 197380 non-null object

Consumer complaint narrative 299516 non-null object

Company public response 145114 non-null object

Company 299516 non-null object

State 298406 non-null object

ZIP code 296979 non-null object

Tags 51351 non-null object

Consumer consent provided? 299516 non-null object

Submitted via 299516 non-null object

Date sent to company 299516 non-null object

Company response to consumer 299514 non-null object

Timely response? 299516 non-null object

Consumer disputed? 164125 non-null object

Complaint ID 299516 non-null int64

dtypes: int64(1), object(17)

memory usage: 43.4+ MB

任何ML任务中非常重要的一步是可视化类的分布。类的分布对训练算法的性能有很大影响。记录数量明显较少的类可能没有很好的平均准确度。

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,6))

df = consumer_complaints_filtered_df[['Product','Consumer complaint narrative']]

df.groupby('Product').count().plot.bar(ylim=0)

plt.show()

在scikit-learn中用LinearSVC进行概率预测的多类文本分类

现在我们将从预处理开始。NLP任务的分类始终具有这个重要的预处理步骤。分类器背后有数学运算,仅适用于数字。因此,我们必须以数字形式转换我们的输入。scikit-learn中有几个类可用于文本预处理。

我们将使用CountVectorizer和TfidfTransformer。CountVectorizer将一类文本文档转换为令牌计数矩阵。TfidfTransformer将计数矩阵转换为术语频率或反向文档频率。您可以在scikit-learn网站上阅读更多相关信息。

类似地,为了将文本标签或类转换为数字形式,我们将使用LabelEncoder 。它对值为0到-1的类型的标签进行编码。

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.preprocessing import LabelEncoder

labels = df['Product']

text = df['Consumer complaint narrative']

X_train, X_test, y_train, y_test = train_test_split(text, labels, random_state=0, test_size=0.3)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tf_transformer = TfidfTransformer().fit(X_train_counts)

X_train_transformed = tf_transformer.transform(X_train_counts)

X_test_counts = count_vect.transform(X_test)

X_test_transformed = tf_transformer.transform(X_test_counts)

labels = LabelEncoder()

y_train_labels_fit = labels.fit(y_train)

y_train_lables_trf = labels.transform(y_train)

print(labels.classes_)

['Bank account or service' 'Checking or savings account' 'Consumer Loan'

'Credit card' 'Credit card or prepaid card' 'Credit reporting'

'Credit reporting, credit repair services, or other personal consumer reports'

'Debt collection' 'Money transfer, virtual currency, or money service'

'Money transfers' 'Mortgage' 'Other financial service' 'Payday loan'

'Payday loan, title loan, or personal loan' 'Prepaid card' 'Student loan'

'Vehicle loan or lease' 'Virtual currency']

最后,我们将使用LinearSVC分类器进行训练,并使用CalibratedClassifierCV来获取所有类的概率。

from sklearn.svm import LinearSVC

from sklearn.calibration import CalibratedClassifierCV

linear_svc = LinearSVC()

clf = linear_svc.fit(X_train_transformed,y_train_lables_trf)

calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc,

cv="prefit")

calibrated_svc.fit(X_train_transformed,y_train_lables_trf)

predicted = calibrated_svc.predict(X_test_transformed)

to_predict = ["I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements"]

p_count = count_vect.transform(to_predict)

p_tfidf = tf_transformer.transform(p_count)

print('Average accuracy on test set={}'.format(np.mean(predicted == labels.transform(y_test))))

print('Predicted probabilities of demo input string are')

print(calibrated_svc.predict_proba(p_tfidf))

Average accuracy on test set=0.73637527127

Predicted probabilities of demo input string are

[[ 4.66096051e-04 7.61305759e-06 2.42386129e-03 8.39870195e-04

9.63384564e-04 7.67200317e-01 2.07382738e-01 1.73294053e-02

3.91417748e-07 3.76878086e-06 2.40907318e-03 3.80234243e-10

1.16823419e-05 1.43313864e-05 6.93519524e-06 8.95787556e-04

3.78217257e-05 6.92201968e-06]]

在scikit-learn中用LinearSVC进行概率预测的多类文本分类

演示预测

我相信我们可以通过进一步微调或预处理来提高平均准确度。然而,这篇文章的目的是证明使用CalibratedClassifierCV来获得预测输出中每个类的概率。

相关推荐