子空间聚类概述及Python实现
这篇文章解决了以下问题:
- 处理高维数据的挑战是什么?
- 什么是子空间聚类?
- 如何在python中实现子空间聚类算法
高维数据包括具有几十到几千个特征(或维度)的输入。这是一个典型的上下文问题,例如在生物信息学(各种排序数据)或NLP中,如果词汇量非常大,就会遇到这种情况。高维数据是具有挑战性的,因为:
- 它使得可视化和理解输入变得困难,通常需要预先应用降维技术。它导致了“维度诅咒”,即随着维数的增加,所有子空间的完整枚举变得难以处理
- 大多数底层聚类技术依赖于结果和降维技术的选择
- 许多维度可能是不相关的,并且可以在有噪声的数据中屏蔽现有的聚类
- 一种常见的技术是执行特征选择(删除不相关的维度),但是在某些情况下,识别冗余维度并不容易
什么是子空间聚类?
子空间聚类是一种在不同子空间(一个或多个维度的选择)中发现聚类的技术。基本的假设是,我们可以找到只由维度子集定义的有效聚类(不需要具有所有N个特征的一致性)。子空间聚类是传统N维聚类分析的扩展,它允许通过创建行和列聚类同时对特征和观测进行分组。
在特征和观察的空间中,产生的聚类可能重叠。一个例子如下图所示。我们可以注意到,两个聚类中的点可能非常接近,这可能会混淆许多分析整个特征空间的传统聚类算法。
此外,我们可以看到子空间聚类成功地找到了一个子空间(维a和维c),其中预期的集群很容易识别。
子空间聚类的类型
基于搜索策略,我们可以区分两种类型的子空间聚类,如下图所示:自下而上的方法首先在低维(1 D)空间中找到聚类并迭代合并它们以处理更高维空间(直到ND )。自上而下的方法在整个维度集中查找聚类并评估每个聚类的子空间。下图取自同一篇论文,概述了最常见的子空间聚类算法。
在搜索策略上,我们可以区分两种子空间聚类,如下图所示:自下而上的方法首先在低维(1d)空间中寻找聚类,然后迭代合并它们以处理高维空间(直到N D)。下图取自篇论文(https://www.kdd.org/exploration_files/parsons.pdf),概述了最常见的子空间聚类算法。
Clique算法
简而言之,该算法的功能如下:对于每个维度(特征),我们用nBins(输入参数)分割空间;对于每个bin,我们计算直方图(计数数量)。我们只考虑dense单元,即计数高于作为第二个输入参数给定的阈值的bins。dense单元的特点如下:
- 它所属的维度(例如特征1)
- bin的索引(或位置)(从0到nBins)
- 在bin里的观察
导入Python库
import os import sys import numpy as np import scipy.sparse.csgraph from sklearn import preprocessing from sklearn import metrics import matplotlib.pyplot as plt import pandas as pd from functools import reduce import seaborn as sns from collections import Counter import itertools
在2D中生成4个聚类
from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler n_components = 4 data, truth = make_blobs(n_samples=100, centers=n_components, random_state=42, n_features=2) data = preprocessing.MinMaxScaler().fit_transform(data) plt.scatter(data[:, 0], data[:, 1], s=50, c = truth) plt.title(f"Example of a mixture of {n_components} distributions") plt.xlabel("Feature 1") plt.ylabel("Feature 2");
class DenseUnit1D: """ This class """ def __init__(self, dimension, bin, minBin, maxBin, points): self.dimension = dimension # dimension index self.bin = bin # bin number self.minBin = minBin # inferior bin limit self.maxBin = maxBin # superior bin limit self.points = points # observation indexes in input data def distance(self, du): # Not in the same dimension, can't be neighbors if self.dimension != du.dimension: return -1 return abs(self.bin -du.bin) def __eq__(self, other): """Overrides the default implementation""" if isinstance(other, DenseUnit): return (Counter(self.dimension) == Counter(other.dimension) and Counter(self.points) == Counter(other.points) ) return False def __hash__(self): return hash(str(self)) def __str__(self): return (f'Dimension {self.dimension}, bin {self.bin}, points {len(self.points)},' + f'[{round(self.minBin, 2)}, {round(self.maxBin, 2)}]') def neighbour(denseUnits1, denseUnits2): """ Determines if 2 dense units are neighbouring """ # We allow only 1 bin deviation in one subspace distance = 0 for subspace in range(len(denseUnits1)): subspaceDistance = denseUnits1[subspace].distance(denseUnits2[subspace]) if subspaceDistance == -1: return False distance += subspaceDistance if distance > 1: return 0 return True
设置参数
thresholdPoints = 2dense nbBins = 8
创建一dimensionals dense单元
def createDenseUnitsAndGrid(data, thresholdPoints = thresholdPoints, nbBins = nbBins): """ This method will return an array of lists, each list containing 1D dense units In 1 D subspace, each list will contain only one element """ denseUnits1D = [] grid=[] # this is used for rendering purposes for curDim in range(data.shape[1]): minDim = min(data[:,curDim]) maxDim = max(data[:,curDim]) binSize = (maxDim - minDim)/nbBins points = data[:, curDim] g = [] # grid lines for current dimension g.append(minDim) for i in range(nbBins): endBin = minDim + binSize g.append(endBin) # Retrieve bin points per dimension if i == nbBins -1 : # last bin, make sure all points are included binPoints = np.where((points>=minDim) & (points<=maxDim))[0] endBin = maxDim else: binPoints = np.where((points>=minDim) & (points<endBin))[0] # Store only dense bins if len(binPoints)>thresholdPoints: denseUnits1D.append([DenseUnit1D(curDim, i,minDim,endBin,binPoints)]) minDim = endBin grid.append(g) return denseUnits1D, grid denseUnits1D, grid = createDenseUnitsAndGrid(data)
在网格上绘制原始数据集
plt.scatter(data[:, 0], data[:, 1]) for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1')
clique算法背后的直觉是存在于k维空间中的聚类也可以在k-1中找到。我们从1D开始,每个维度我们都试图找到dense bins。如果2个或更多dense bins相邻,我们将它们合并成一个更大的bin。通过将所有现有dense bins转换为图可以容易地实现该操作,其中如果2个dense单元属于相同维度并且它们的bin索引之间的差异不大于1(例如,特征3和bin 4对应的dense单元与特征3和bin 5对应的dense单元相邻)则绘制边缘。可以通过计算上述图上的连通分量来识别要合并的dense单元。
def denseBinsToClusters(candidates, plot = False, debug = False): """ This method takes as input a collection of subspace candidates. A subspace candidate is a list of 1D dense units. This method will merge neighbouring units by projecting them onto a graph, where we can easily compute connected components """ graph = np.identity(len(candidates)) for i in range(len(candidates)): for j in range(len(candidates)): graph[i, j] = int(neighbour(candidates[i], candidates[j])) # Find connected components in order to merge neighbouring bins nbConnectedComponents, components = scipy.sparse.csgraph.connected_components( graph, directed=False) if debug: print(graph) print(nbConnectedComponents, components) candidates = np.array(candidates) clusterAssignment = -1 * np.ones(data.shape[0]) # For every cluster for i in range(nbConnectedComponents): # Get dense units of the cluster cluster_dense_units = candidates[np.where(components == i)[0]] if debug: for v in cluster_dense_units: for z in v: print(z) clusterDimensions = {} for j in range(len(cluster_dense_units)): for k in range(len(cluster_dense_units[j])): if cluster_dense_units[j][k].dimension not in clusterDimensions: clusterDimensions[cluster_dense_units[j][k].dimension] = [] clusterDimensions[cluster_dense_units[j][k].dimension].extend(cluster_dense_units[j][k].points) points =reduce(np.intersect1d, list(clusterDimensions.values())) clusterAssignment[points] = i if plot: pred = -1 * np.ones(data.shape[0]) pred[points] = i plt.figure() plt.title(f'In yellow, clusters in {list(clusterDimensions.keys())} dimensions ') plt.scatter(data[:, 0], data[:, 1], c = pred) for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.show() if debug: print(clusterDimensions.keys(), points) return clusterAssignment denseBinsToClusters(denseUnits1D, plot = True, debug = False);
接下来,我们要计算每个子空间中从2到输入维数的所有有效聚类。该操作归结为计算k维度中的dense单元的组合,并且只保留大于初始最小密度阈值的dense continuous bins的重叠结果。一旦我们计算了k-1维度的dense单元,我们就可以通过计算最后k-1个候选者的所有组合来扩展到k维度。
def getSubspaceCandidates(previousUnits, subspaceDimension = 2): import itertools candidates = [] for ix1, ix2 in itertools.combinations(range(len(previousUnits)), 2): dims =[] candidate = [] for i in range(len(previousUnits[ix1])): dims.append(previousUnits[ix1][i].dimension) candidate.append(previousUnits[ix1][i]) points1= previousUnits[ix1][i].points for i in range(len(previousUnits[ix2])): dims.append(previousUnits[ix2][i].dimension) candidate.append(previousUnits[ix2][i]) points2= previousUnits[ix2][i].points points = np.intersect1d(points1, points2) # check points in common if np.unique(dims).shape[0] == subspaceDimension and points.shape[0]>thresholdPoints: # print(f' adding candidate: {len(points)}') # for v in candidate: # print(v) candidates.append(candidate) return candidates for subspaceDim in range(2, data.shape[1] +1): subspaceCandidates = getSubspaceCandidates(denseUnits1D, subspaceDimension = subspaceDim) pred = denseBinsToClusters(subspaceCandidates, plot = True, debug = False);
在2d中,我们能够检索如下图所示的聚类。紫色点被排除在聚类之外,因为它们位于稀疏的网格bins中。
plt.scatter(data[:, 0], data[:, 1], c = pred) for g in grid[0]: plt.axvline(x=g, c = 'red', linestyle ='--') plt.xlabel('Feature 0') for g in grid[1]: plt.axhline(y=g, c = 'red', linestyle ='--') plt.ylabel('Feature 1')
验证结果
from sklearn.metrics.cluster import adjusted_rand_score adjusted_rand_score(truth, pred)
0.9090183974614624
Clique聚类因其对输入参数(bins数和最小密度)的高灵敏度而受到批评,这可能导致非常不同的结果。但是,它是自下而上子空间聚类族中的一种基本算法。有多种方法可以优化clique算法。