K均值聚类-机器学习算法简介和Python实现

stevenkwong

2018-06-10

在机器学习中，我们并不总是被提供一个目标来优化，我们也不总是被提供一个目标标签来分类输入数据点。在人工智能领域，没有目标或标签来分类的问题被称为无监督学习问题。在无监督学习问题中，我们试图对数据中潜在的结构化信息进行建模。聚类是一种非监督学习问题，我们试图根据相似数据的底层结构将其分组到群组中。K-means算法是一种常见的聚类算法。K表示我们要将数据点分类成的簇的数量。

k - means伪代码

# # k - means聚类

1、选择集群的数量(K)并获取数据点

2、将质心 c_1, c_2，…c_k随机

3、重复步骤4和5，直到收敛，或者直到固定的迭代次数结束

4、对于每个数据点x_i:

-找到最近的质心(c_1, c_2 ..)c_k)

-将点分配给该群集

5、对于每个群集j = 1..k

-新质心=分配给该群集的所有点的平均值

6、结束

下面的模拟将提供对K-means算法的更好理解

K均值聚类-机器学习算法简介和Python实现

如何选择K？

在某些情况下，我们不知道clusters的数量。那么，我们如何选择K的值呢?有一种方法叫肘部法则(Elbow Method)。在这种方法中，您选择不同数量的聚类，并开始绘制群集距离质心的距离。该图看起来如下

K均值聚类-机器学习算法简介和Python实现

Elbow method

从上面的图我们可以推断，在k=4时，图形达到了一个最优的最小值。即使within-cluster的距离在4之后减少，我们还需要做更多的计算。这与收益递减律类似。因此，我们选择一个4作为最优。

k均值聚类算法的应用

行为细分
异常检测
社会网络分析
市场细分

让我们编写一些代码

我们将使用Iris数据集来构建我们的算法。即使Iris数据集具有标签，我们也会放弃它们并仅使用特征点对数据进行聚类。我们知道有3个聚类（'Iris-virginica'，'Iris-setosa'，'Iris-versicolor'）。因此，在我们的例子中k = 3。

import pandas as pd

import numpy as np

from sklearn.utils import shuffle

## Load Iris dataset

df = pd.read_csv('/Users/rohith/Documents/Datasets/Iris_dataset/iris.csv')

## Store the target vaue

classes = df['Species']

## Drop the Id and Class values from dat

df = df.drop(['Id','Species'],axis=1)

## Convert dataframe into list and then into a numpy array

data = df.values.tolist()

data = np.array(data)

## Shuffle classes and data

data,classes = shuffle(data,classes)

## First 135 points are used for training and the rest is used for testing

train_data = data[:135]

test_data = data[135:]

我们加载数据集并删除目标值。我们将特征点转换为一个numpy数组，并将其分解为训练和测试数据

## K-Means Algorithm

import random

import numpy as np

## Randomly place the centroids of the three clusters

c1 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

c2 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

c3 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),

float(np.random.randint(1,7)),float(np.random.randint(0,3))]

## Intialize the number of iterations you want to run

epochs = 1

while(epochs <= 100):

cluster_1 = []

cluster_2 = []

cluster_3 = []

for point in train_data:

## Find the eucledian distance between all points the centroid

dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +

(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5

dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +

(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5

dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +

(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5

distances = [dis_point_c1,dis_point_c2,dis_point_c3]

## Find the closest centroid to the point and assign the point to that cluster

pos = distances.index(min(distances))

if(pos == 0):

cluster_1.append(point)

elif(pos == 1):

cluster_2.append(point)

else:

cluster_3.append(point)

## Store the centroid values to calculate new centroid values

prev_c1 = c1

prev_c2 = c2

prev_c3 = c3

cluster_1 = np.array(cluster_1)

cluster_2 = np.array(cluster_2)

cluster_3 = np.array(cluster_3)

## Find mean of all points within a cluster and make it as the centroid

if(len(cluster_1) != 0):

c1 = [sum(cluster_1[:,0])/float(len(cluster_1)),

sum(cluster_1[:,1])/float(len(cluster_1)),

sum(cluster_1[:,2])/float(len(cluster_1)),

sum(cluster_1[:,3])/float(len(cluster_1))]

if(len(cluster_2) != 0):

c2 = [sum(cluster_2[:,0])/float(len(cluster_2)),

sum(cluster_2[:,1])/float(len(cluster_2)),

sum(cluster_2[:,2])/float(len(cluster_2)),

sum(cluster_2[:,3])/float(len(cluster_2))]

if(len(cluster_3) != 0):

c3 = [sum(cluster_3[:,0])/float(len(cluster_3)),

sum(cluster_3[:,1])/float(len(cluster_3)),

sum(cluster_3[:,2])/float(len(cluster_3)),

sum(cluster_3[:,3])/float(len(cluster_3))]

## If centroid values hasn't changed, algorithm has convereged

if(prev_c1 == c1 and prev_c2 == c2 and prev_c3 == c3):

print("Converged")

break

print(epochs)

epochs += 1

我们实现上面显示的伪代码，我们可以发现我们的算法在6次迭代后收敛。我们现在可以输入一个测试数据点并找到它最接近的质心并将该点分配给相应的群集

pred = []

for point in test_data:

## Find distance between test data point and centroids

dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +

(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5

dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +

(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5

dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +

(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5

## Find the cluster to which the point is closest to and append

## it to pred

distances = [dis_point_c1,dis_point_c2,dis_point_c3]

pos = distances.index(min(distances))

pred.append(pos)

## Print the predictions

print(pred)

通过提供一个抽象级别的对象，我们可以用它来实现算法，Scikit学习库再次使我们避免了编写如此多的代码行

from sklearn.cluster import KMeans

clf = KMeans(n_clusters = 3)

clf.fit(train_data)

pred = clf.predict(test_data)

结论

K-means是聚类技术的入门算法，它是最简单的。正如你会注意到的那样，没有objective/loss 函数。因此，不需要偏导数并且复杂的数学被消除。K-means是一个易于实现和方便的算法。

机器学习聚类无监督学习 python机器学习 python算法算法 python 质心

安科网

K均值聚类-机器学习算法简介和Python实现

stevenkwong

k - means伪代码

如何选择K？

k均值聚类算法的应用

让我们编写一些代码

结论

stevenkwong

相关推荐

无监督机器学习的重要指南

关于机器学习管道需要了解什么?

Python 用5行代码学机器学习—线性回归

如果不能用Python执行机器学习，那该用什么呢？

这里有你需要的5个机器学习必备技能

在云端你需要的5个机器学习技能

用Python构建和可视化决策树

关于机器学习算法的16个技巧

选择困难终结者：不同问题之下的机器学习算法

TensorFlow为新旧Mac特供新版本，速度最高提升7倍

如何通过7个步骤构建机器学习模型

机器学习新风暴：如何用ML模型预测房价？

为什么所有的机器学习模型有90％从没有投入生产

LinkedIn开源Dagli，发布Java机器学习函数库

全面解读谷歌云人工智能如何为机器学习提供帮助

关于感知器的故事：机器学习是如何发展到如今这一程度的呢？

需要知识的后深度学习时代，如何高效自动构建知识图谱

10个丰富自我的机器学习项目

机器学习如何颠覆金融行业

24个提高知识和技能极限的机器学习项目

stevenkwong