K-最近邻居算法,简称K-NN,是一种经典的机器学习工作算法,在深度学习的当天经常被忽略。在本教程中,我们将在Scikit-Learn中构建一个K-NN算法,并在MNIST数据集上运行它。从那里开始,我们将建立自己的K-NN算法,希望能够开发出比Scikit-Learn K-NN更好的分类精度和分类速度。
我们将使用KNeighborsClassifier()Scikit-Learn Python库开始。这个函数需要很多参数,但在这个例子中我们只需要担心几个参数。具体来说,我们只会传递一个n_neighbors参数值(这是k值)。该weights参数给出由模型,其中,所述默认值是用于表决系统的类型uniform,这意味着每一个的ķ个分级被均等地加权p。因为我们希望Scikit-Learn找到用于归类MNIST数据本身的最佳算法,所以该algorithm参数也将保留其默认值auto。
import numpy as np
import heapq
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import datasets, model_selection
from sklearn.metrics import classification_report
mnist = datasets.fetch_mldata('MNIST original')
data, target = mnist.data, mnist.target
# make sure everything was correctly imported
data.shape, target.shape
((70000, 784), (70000,))
# make an array of indices the size of MNIST to use for making the data sets.
# This array is in random order, so we can use it to scramble up the MNIST data
indx = np.random.choice(len(target), 70000, replace=False)
# method for building datasets to test with
def mk_dataset(size):
"""makes a dataset of size "size", and returns that datasets images and targets
This is used to make the dataset that will be stored by a model and used in
experimenting with different stored dataset sizes
train_img = [data[i] for i in indx[:size]]
train_img = np.array(train_img)
train_target = [target[i] for i in indx[:size]]
train_target = np.array(train_target)
return train_img, train_target
fifty_x, fifty_y = mk_dataset(50000)
fifty_x.shape, fifty_y.shape
((50000, 784), (50000,))
twenty_x, twenty_y = mk_dataset(20000)
twenty_x.shape, twenty_y.shape
((20000, 784), (20000,))
# build model testing dataset
test_img = [data[i] for i in indx[60000:70000]]
test_img1 = np.array(test_img)
test_target = [target[i] for i in indx[60000:70000]]
test_target1 = np.array(test_target)
test_img1.shape, test_target1.shape
((10000, 784), (10000,))
def cos_knn(k, test_data, test_target, stored_data, stored_target):
"""k: number of neighbors to use for voting
test_data: a set of unobserved images to classify
test_target: the labels for the test_data (for calculating accuracy)
stored_data: the images already observed and available to the model
stored_target: labels for stored_data
# find cosine similarity for every point in test_data between every other point in stored_data
cosim = cosine_similarity(test_data, stored_data)
# get top k indices of images in stored_data that are most similar to any given test_data point
top = [(heapq.nlargest((k), range(len(i)), i.take)) for i in cosim]
# convert indices to numbers using stored target values
top = [[stored_target[j] for j in i[:k]] for i in top]
# vote, and return prediction for every image in test_data
pred = [max(set(i), key=i.count) for i in top]
pred = np.array(pred)
# print table giving classifier accuracy using test_target
print(classification_report(test_target, pred))
# stored data set size of 50,000
cos_knn(5, test_img1, test_target1, fifty_x, fifty_y)
precision recall f1-score support
0.0 0.97 0.99 0.98 992
1.0 0.98 0.99 0.98 1123
2.0 0.98 0.98 0.98 984
3.0 0.98 0.97 0.97 1089
4.0 0.99 0.97 0.98 1016
5.0 0.99 0.96 0.97 857
6.0 0.98 0.99 0.98 979
7.0 0.97 0.96 0.97 1001
8.0 0.96 0.96 0.96 993
9.0 0.95 0.97 0.96 966
avg / total 0.97 0.97 0.97 10000
CPU times: user 5min 17s, sys: 1.21 s, total: 5min 18s
Wall time: 4min 59s
我们自己制作的K-NN模型在分类速度(相当大的保证金)和准确性(一个数据集提高1%)方面都超过了Scikit-Learn K-NN!