K均值聚类-机器学习算法简介和Python实现
在机器学习中,我们并不总是被提供一个目标来优化,我们也不总是被提供一个目标标签来分类输入数据点。在人工智能领域,没有目标或标签来分类的问题被称为无监督学习问题。在无监督学习问题中,我们试图对数据中潜在的结构化信息进行建模。聚类是一种非监督学习问题,我们试图根据相似数据的底层结构将其分组到群组中。K-means算法是一种常见的聚类算法。K表示我们要将数据点分类成的簇的数量。
k - means伪代码
# # k - means聚类
1、选择集群的数量(K)并获取数据点
2、将质心 c_1, c_2,…c_k随机
3、重复步骤4和5,直到收敛,或者直到固定的迭代次数结束
4、对于每个数据点x_i:
-找到最近的质心(c_1, c_2 ..)c_k)
-将点分配给该群集
5、对于每个群集j = 1..k
-新质心=分配给该群集的所有点的平均值
6、结束
下面的模拟将提供对K-means算法的更好理解
如何选择K?
在某些情况下,我们不知道clusters的数量。那么,我们如何选择K的值呢?有一种方法叫肘部法则(Elbow Method)。在这种方法中,您选择不同数量的聚类,并开始绘制群集距离质心的距离。该图看起来如下
Elbow method
从上面的图我们可以推断,在k=4时,图形达到了一个最优的最小值。即使within-cluster的距离在4之后减少,我们还需要做更多的计算。这与收益递减律类似。因此,我们选择一个4作为最优。
k均值聚类算法的应用
- 行为细分
- 异常检测
- 社会网络分析
- 市场细分
让我们编写一些代码
我们将使用Iris数据集来构建我们的算法。即使Iris数据集具有标签,我们也会放弃它们并仅使用特征点对数据进行聚类。我们知道有3个聚类('Iris-virginica','Iris-setosa','Iris-versicolor')。因此,在我们的例子中k = 3。
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
## Load Iris dataset
df = pd.read_csv('/Users/rohith/Documents/Datasets/Iris_dataset/iris.csv')
## Store the target vaue
classes = df['Species']
## Drop the Id and Class values from dat
df = df.drop(['Id','Species'],axis=1)
## Convert dataframe into list and then into a numpy array
data = df.values.tolist()
data = np.array(data)
## Shuffle classes and data
data,classes = shuffle(data,classes)
## First 135 points are used for training and the rest is used for testing
train_data = data[:135]
test_data = data[135:]
我们加载数据集并删除目标值。我们将特征点转换为一个numpy数组,并将其分解为训练和测试数据
## K-Means Algorithm
import random
import numpy as np
## Randomly place the centroids of the three clusters
c1 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
c2 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
c3 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
## Intialize the number of iterations you want to run
epochs = 1
while(epochs <= 100):
cluster_1 = []
cluster_2 = []
cluster_3 = []
for point in train_data:
## Find the eucledian distance between all points the centroid
dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +
(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5
dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +
(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5
dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +
(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5
distances = [dis_point_c1,dis_point_c2,dis_point_c3]
## Find the closest centroid to the point and assign the point to that cluster
pos = distances.index(min(distances))
if(pos == 0):
cluster_1.append(point)
elif(pos == 1):
cluster_2.append(point)
else:
cluster_3.append(point)
## Store the centroid values to calculate new centroid values
prev_c1 = c1
prev_c2 = c2
prev_c3 = c3
cluster_1 = np.array(cluster_1)
cluster_2 = np.array(cluster_2)
cluster_3 = np.array(cluster_3)
## Find mean of all points within a cluster and make it as the centroid
if(len(cluster_1) != 0):
c1 = [sum(cluster_1[:,0])/float(len(cluster_1)),
sum(cluster_1[:,1])/float(len(cluster_1)),
sum(cluster_1[:,2])/float(len(cluster_1)),
sum(cluster_1[:,3])/float(len(cluster_1))]
if(len(cluster_2) != 0):
c2 = [sum(cluster_2[:,0])/float(len(cluster_2)),
sum(cluster_2[:,1])/float(len(cluster_2)),
sum(cluster_2[:,2])/float(len(cluster_2)),
sum(cluster_2[:,3])/float(len(cluster_2))]
if(len(cluster_3) != 0):
c3 = [sum(cluster_3[:,0])/float(len(cluster_3)),
sum(cluster_3[:,1])/float(len(cluster_3)),
sum(cluster_3[:,2])/float(len(cluster_3)),
sum(cluster_3[:,3])/float(len(cluster_3))]
## If centroid values hasn't changed, algorithm has convereged
if(prev_c1 == c1 and prev_c2 == c2 and prev_c3 == c3):
print("Converged")
break
print(epochs)
epochs += 1
我们实现上面显示的伪代码,我们可以发现我们的算法在6次迭代后收敛。我们现在可以输入一个测试数据点并找到它最接近的质心并将该点分配给相应的群集
pred = []
for point in test_data:
## Find distance between test data point and centroids
dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +
(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5
dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +
(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5
dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +
(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5
## Find the cluster to which the point is closest to and append
## it to pred
distances = [dis_point_c1,dis_point_c2,dis_point_c3]
pos = distances.index(min(distances))
pred.append(pos)
## Print the predictions
print(pred)
通过提供一个抽象级别的对象,我们可以用它来实现算法,Scikit学习库再次使我们避免了编写如此多的代码行
from sklearn.cluster import KMeans
clf = KMeans(n_clusters = 3)
clf.fit(train_data)
pred = clf.predict(test_data)
结论
K-means是聚类技术的入门算法,它是最简单的。正如你会注意到的那样,没有objective/loss 函数。因此,不需要偏导数并且复杂的数学被消除。K-means是一个易于实现和方便的算法。