单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering
Dimensionality reduction.
Throughout the manuscript we use diffusion maps,a non-linear dimensionality reduction technique37. We calculate a cell-to-celldistance matrix using 1 - Pearson correlation and use the diffuse function of thediffusionMap R package with default parameters to obtain the first 50 DMCs. Todetermine the significant DMCs, we look at the reduction of eigenvalues associatedwith DMCs. We determine all dimensions with an eigenvalue of at least 4% relativeto the sum of the first 50 eigenvalues as significant, and scale all dimensions to havemean 0 and standard deviation of 1.
Initial clustering of all cells.
To identify contaminating cell populations and assessoverall heterogeneity in the data, we clustered all single cells. We first combined allDrop-seq samples and normalized the data (21,566 cells, 10,791 protein-codinggenes detected in at least 3 cells and mean UMI at least 0.005) using regularizednegative binomial regression as outlined above (correcting for sequencing depthrelated factors and cell cycle). We identified 731 highly variable genes; that is, genesfor which the z-scored standard deviation was at least 1. We used the variable genesto perform dimensionality reduction using diffusion maps as outlined above (withrelative eigenvalue cutoff of 2%), which returned 10 significant dimensions.
Forclustering we used a modularity optimization algorithm that finds communitystructure in the data with Jaccard similarities (neighbourhood size 9, Euclideandistance in diffusion map coordinates) as edge weights between cells38. With the
goal of overclustering the data to identify rare populations, the small neighbourhood size resulted in 15 clusters, of which two were clearly separated from the restand expressed marker genes expected from contaminating cells (Neurod6 from
excitatory neurons, Igfbp7 from epithelial cells). These cells represent rare cellularcontaminants in the original sample (2.6% and 1%), and were excluded fromfurther analysis, leaving 20,788 cells.