子空间聚类概述及Python实现

这篇文章解决了以下问题:

  1. 处理高维数据的挑战是什么?
  2. 什么是子空间聚类?
  3. 如何在python中实现子空间聚类算法

子空间聚类概述及Python实现

高维数据包括具有几十到几千个特征(或维度)的输入。这是一个典型的上下文问题,例如在生物信息学(各种排序数据)或NLP中,如果词汇量非常大,就会遇到这种情况。高维数据是具有挑战性的,因为:

  • 它使得可视化和理解输入变得困难,通常需要预先应用降维技术。它导致了“维度诅咒”,即随着维数的增加,所有子空间的完整枚举变得难以处理
  • 大多数底层聚类技术依赖于结果和降维技术的选择
  • 许多维度可能是不相关的,并且可以在有噪声的数据中屏蔽现有的聚类
  • 一种常见的技术是执行特征选择(删除不相关的维度),但是在某些情况下,识别冗余维度并不容易

什么是子空间聚类?

子空间聚类是一种在不同子空间(一个或多个维度的选择)中发现聚类的技术。基本的假设是,我们可以找到只由维度子集定义的有效聚类(不需要具有所有N个特征的一致性)。子空间聚类是传统N维聚类分析的扩展,它允许通过创建行和列聚类同时对特征和观测进行分组。

在特征和观察的空间中,产生的聚类可能重叠。一个例子如下图所示。我们可以注意到,两个聚类中的点可能非常接近,这可能会混淆许多分析整个特征空间的传统聚类算法。

子空间聚类概述及Python实现

此外,我们可以看到子空间聚类成功地找到了一个子空间(维a和维c),其中预期的集群很容易识别。

子空间聚类概述及Python实现

子空间聚类的类型

基于搜索策略,我们可以区分两种类型的子空间聚类,如下图所示:自下而上的方法首先在低维(1 D)空间中找到聚类并迭代合并它们以处理更高维空间(直到ND )。自上而下的方法在整个维度集中查找聚类并评估每个聚类的子空间。下图取自同一篇论文,概述了最常见的子空间聚类算法。

在搜索策略上,我们可以区分两种子空间聚类,如下图所示:自下而上的方法首先在低维(1d)空间中寻找聚类,然后迭代合并它们以处理高维空间(直到N D)。下图取自篇论文(https://www.kdd.org/exploration_files/parsons.pdf),概述了最常见的子空间聚类算法。

子空间聚类概述及Python实现

Clique算法

简而言之,该算法的功能如下:对于每个维度(特征),我们用nBins(输入参数)分割空间;对于每个bin,我们计算直方图(计数数量)。我们只考虑dense单元,即计数高于作为第二个输入参数给定的阈值的bins。dense单元的特点如下:

  • 它所属的维度(例如特征1)
  • bin的索引(或位置)(从0到nBins)
  • 在bin里的观察

导入Python库

import os
import sys
import numpy as np
import scipy.sparse.csgraph
from sklearn import preprocessing
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
from functools import reduce
import seaborn as sns
from collections import Counter
import itertools

子空间聚类概述及Python实现

在2D中生成4个聚类

from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
n_components = 4
data, truth = make_blobs(n_samples=100, centers=n_components, random_state=42, n_features=2)
data = preprocessing.MinMaxScaler().fit_transform(data)
plt.scatter(data[:, 0], data[:, 1], s=50, c = truth)
plt.title(f"Example of a mixture of {n_components} distributions")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2");

子空间聚类概述及Python实现

子空间聚类概述及Python实现

class DenseUnit1D:
 """
 This class 
 """
 def __init__(self, dimension, bin, minBin, maxBin, points):
 self.dimension = dimension # dimension index
 self.bin = bin # bin number
 self.minBin = minBin # inferior bin limit
 self.maxBin = maxBin # superior bin limit
 self.points = points # observation indexes in input data
 def distance(self, du):
 # Not in the same dimension, can't be neighbors
 if self.dimension != du.dimension:
 return -1
 return abs(self.bin -du.bin)
 
 def __eq__(self, other):
 """Overrides the default implementation"""
 if isinstance(other, DenseUnit):
 return (Counter(self.dimension) == Counter(other.dimension) and
 Counter(self.points) == Counter(other.points) )
 return False
 
 def __hash__(self):
 return hash(str(self))
 
 def __str__(self):
 return (f'Dimension {self.dimension}, bin {self.bin}, points {len(self.points)},' +
 f'[{round(self.minBin, 2)}, {round(self.maxBin, 2)}]')
def neighbour(denseUnits1, denseUnits2):
 """
 Determines if 2 dense units are neighbouring
 """
 # We allow only 1 bin deviation in one subspace
 distance = 0
 for subspace in range(len(denseUnits1)):
 subspaceDistance = denseUnits1[subspace].distance(denseUnits2[subspace])
 if subspaceDistance == -1:
 return False
 distance += subspaceDistance
 if distance > 1:
 return 0
 return True

子空间聚类概述及Python实现

设置参数

thresholdPoints = 2dense
nbBins = 8

子空间聚类概述及Python实现

创建一dimensionals dense单元

def createDenseUnitsAndGrid(data, thresholdPoints = thresholdPoints, nbBins = nbBins):
 """
 This method will return an array of lists, each list containing 1D dense units
 In 1 D subspace, each list will contain only one element
 """
 denseUnits1D = []
 grid=[] # this is used for rendering purposes
 for curDim in range(data.shape[1]):
 minDim = min(data[:,curDim])
 maxDim = max(data[:,curDim])
 binSize = (maxDim - minDim)/nbBins
 points = data[:, curDim]
 g = [] # grid lines for current dimension
 g.append(minDim)
 for i in range(nbBins):
 endBin = minDim + binSize
 g.append(endBin)
 # Retrieve bin points per dimension
 if i == nbBins -1 : # last bin, make sure all points are included
 binPoints = np.where((points>=minDim) & (points<=maxDim))[0]
 endBin = maxDim
 else:
 binPoints = np.where((points>=minDim) & (points<endBin))[0]
 # Store only dense bins
 if len(binPoints)>thresholdPoints:
 denseUnits1D.append([DenseUnit1D(curDim, i,minDim,endBin,binPoints)])
 minDim = endBin
 grid.append(g)
 return denseUnits1D, grid
denseUnits1D, grid = createDenseUnitsAndGrid(data)

子空间聚类概述及Python实现

在网格上绘制原始数据集

plt.scatter(data[:, 0], data[:, 1])
for g in grid[0]:
 plt.axvline(x=g, c = 'red', linestyle ='--')
 plt.xlabel('Feature 0')
 
for g in grid[1]:
 plt.axhline(y=g, c = 'red', linestyle ='--')
 plt.ylabel('Feature 1')

子空间聚类概述及Python实现

子空间聚类概述及Python实现

clique算法背后的直觉是存在于k维空间中的聚类也可以在k-1中找到。我们从1D开始,每个维度我们都试图找到dense bins。如果2个或更多dense bins相邻,我们将它们合并成一个更大的bin。通过将所有现有dense bins转换为图可以容易地实现该操作,其中如果2个dense单元属于相同维度并且它们的bin索引之间的差异不大于1(例如,特征3和bin 4对应的dense单元与特征3和bin 5对应的dense单元相邻)则绘制边缘。可以通过计算上述图上的连通分量来识别要合并的dense单元。

def denseBinsToClusters(candidates, plot = False, debug = False):
 """
 This method takes as input a collection of subspace candidates.
 A subspace candidate is a list of 1D dense units.
 This method will merge neighbouring units by projecting them onto a 
 graph, where we can easily compute connected components
 """
 graph = np.identity(len(candidates))
 for i in range(len(candidates)):
 for j in range(len(candidates)):
 graph[i, j] = int(neighbour(candidates[i], candidates[j]))
 # Find connected components in order to merge neighbouring bins
 nbConnectedComponents, components = scipy.sparse.csgraph.connected_components(
 graph, directed=False)
 if debug:
 print(graph)
 print(nbConnectedComponents, components)
 candidates = np.array(candidates)
 clusterAssignment = -1 * np.ones(data.shape[0])
 # For every cluster
 for i in range(nbConnectedComponents):
 # Get dense units of the cluster
 cluster_dense_units = candidates[np.where(components == i)[0]]
 if debug:
 for v in cluster_dense_units:
 for z in v:
 print(z)
 clusterDimensions = {}
 for j in range(len(cluster_dense_units)):
 for k in range(len(cluster_dense_units[j])):
 if cluster_dense_units[j][k].dimension not in clusterDimensions:
 clusterDimensions[cluster_dense_units[j][k].dimension] = []
 clusterDimensions[cluster_dense_units[j][k].dimension].extend(cluster_dense_units[j][k].points)
 points =reduce(np.intersect1d, list(clusterDimensions.values()))
 clusterAssignment[points] = i
 if plot:
 pred = -1 * np.ones(data.shape[0])
 pred[points] = i
 plt.figure()
 plt.title(f'In yellow, clusters in {list(clusterDimensions.keys())} dimensions ')
 plt.scatter(data[:, 0], data[:, 1], c = pred)
 for g in grid[0]:
 plt.axvline(x=g, c = 'red', linestyle ='--')
 for g in grid[1]:
 plt.axhline(y=g, c = 'red', linestyle ='--')
 plt.show()
 if debug:
 print(clusterDimensions.keys(), points)
 return clusterAssignment
denseBinsToClusters(denseUnits1D, plot = True, debug = False);

子空间聚类概述及Python实现

子空间聚类概述及Python实现

接下来,我们要计算每个子空间中从2到输入维数的所有有效聚类。该操作归结为计算k维度中的dense单元的组合,并且只保留大于初始最小密度阈值的dense continuous bins的重叠结果。一旦我们计算了k-1维度的dense单元,我们就可以通过计算最后k-1个候选者的所有组合来扩展到k维度。

def getSubspaceCandidates(previousUnits, subspaceDimension = 2):
 import itertools
 candidates = []
 for ix1, ix2 in itertools.combinations(range(len(previousUnits)), 2):
 dims =[]
 candidate = []
 for i in range(len(previousUnits[ix1])):
 dims.append(previousUnits[ix1][i].dimension)
 candidate.append(previousUnits[ix1][i])
 points1= previousUnits[ix1][i].points
 for i in range(len(previousUnits[ix2])):
 dims.append(previousUnits[ix2][i].dimension)
 candidate.append(previousUnits[ix2][i])
 points2= previousUnits[ix2][i].points 
 points = np.intersect1d(points1, points2) # check points in common
 if np.unique(dims).shape[0] == subspaceDimension and points.shape[0]>thresholdPoints:
# print(f'

adding candidate: {len(points)}')
# for v in candidate:
# print(v)
 candidates.append(candidate)
 return candidates
for subspaceDim in range(2, data.shape[1] +1):
 subspaceCandidates = getSubspaceCandidates(denseUnits1D, subspaceDimension = subspaceDim)
 pred = denseBinsToClusters(subspaceCandidates, plot = True, debug = False);

子空间聚类概述及Python实现

子空间聚类概述及Python实现

在2d中,我们能够检索如下图所示的聚类。紫色点被排除在聚类之外,因为它们位于稀疏的网格bins中。

plt.scatter(data[:, 0], data[:, 1], c = pred)
for g in grid[0]:
 plt.axvline(x=g, c = 'red', linestyle ='--')
 plt.xlabel('Feature 0')
 
for g in grid[1]:
 plt.axhline(y=g, c = 'red', linestyle ='--')
 plt.ylabel('Feature 1')

子空间聚类概述及Python实现

子空间聚类概述及Python实现

验证结果

from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(truth, pred)

子空间聚类概述及Python实现

0.9090183974614624

Clique聚类因其对输入参数(bins数和最小密度)的高灵敏度而受到批评,这可能导致非常不同的结果。但是,它是自下而上子空间聚类族中的一种基本算法。有多种方法可以优化clique算法。

相关推荐