提取k-means特定于群集的功能

时间:2016-07-20 18:18:37

标签: r heatmap k-means

我在简化的PCA空间中使用k-means聚类一些基因表达数据,现在我想提取最能描述每个聚类的不同特征。这些是在每个群集中高度表达的功能。

我发布了一个可重复的示例,以显示我的逻辑以及我在哪里停止。

# Create test matrix
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Cell", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")

# plot the inital heatmap
library(pheatmap)
pheatmap(t(test))

# preform PCA
pca = prcomp(t(test), center=TRUE, scale=TRUE)
rotation = data.frame(pca$x)
plot(rotation[1:3], pch=16, cex=0.6, cex.main=0.9)

# preform Kmeans in PCA space
wss = (nrow(rotation)-1)*sum(apply(rotation,2,var))
for (i in 2:9) wss[i] <- sum(kmeans(rotation, centers=i)$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
km = kmeans(rotation, 2)
cluster_assignment = as.factor(km$cluster)

# plot k-means cluster assignment in PCA space
library(ggplot2)
ggplot(rotation, aes(rotation$PC1, rotation$PC2, color=cluster_assignment, label=rownames(rotation))) + geom_point() + geom_text()

# create a cluster annotation data.frame
n_clusters = length(unique(km$cluster))   
temp_cluster_design_list = list()
for(n in 1:n_clusters){
  temp_cluster_design = data.frame(row.names=row.names(t(test)[cluster_assignment %in% n,]))
  temp_cluster_design$cluster = n 
  temp_cluster_design_list[[n+1]] <- temp_cluster_design
}
cluster_design = do.call(rbind, temp_cluster_design_list)

# cluster_design looks something like this: 
#        cluster
# Cell1        1
# Cell3        1
# Cell5        1
# Cell7        1
# Cell9        1
# Cell2        2
# Cell4        2
# Cell6        2
# Cell8        2
# Cell10       2

# heatmap with cell cluster annotation
PCA_heatmap_data = t(test)[row.names(cluster_design),]
cluster_design$cluster = as.factor(cluster_design$cluster)
pheatmap(PCA_heatmap_data, annotation_row=cluster_design)


# extract out expressed genes from each cluster
for(n in 1:n_clusters) {
    temp_cluster = t(test)[cluster_assignment %in% n,]
#     ??? somehow extract out the expressed gene names specific to each cluster 
}

上面的代码最终给我留下了一个类似于this的热图。现在,我想对热图中的每个聚类做的是提取高度表达的基因名称。我最终想写一个看起来像这样的表:

GENE      CLUSTER
Gene20    cluster2
Gene19    cluster2
Gene15    cluster2
Gene18    cluster2
Gene16    cluster2
Gene17    cluster2
Gene9     cluster1
Gene8     cluster1
Gene4     cluster1
Gene3     cluster1
...       ...

我不确定最好和最有效的方法。我很感激您可以帮我折腾!谢谢!

修改

我们如何定义高度表达?这个我不太确定,希望得到一些见解。也许比较所有细胞中每个基因的平均值,并将其与群集中的平均值进行比较?这可能有用,但我认为这很容易受到异常值的影响。另一个想法是扩展每个聚类并采取高度表达的基因?

我如何在非常相似的群集中识别高度表达的基因?例如,this热图显示三个群集,但红色和绿色群集非常相似。看来Gene9是cluster2特异性的吗?

1 个答案:

答案 0 :(得分:1)

threshold <- 1
nms       <- as.character()
for(n in 1:n_clusters) {

  for(i in 1:ncol(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])){
     if(mean(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),i]) > threshold){
             nms <- c(nms, colnames(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])[i])
     }
  }
  if(n == 1) result <- data.frame(GENE = nms, CLUSTER = rep(n,length(nms))); rm(nms); nms <- as.character()
  if(n >  1) result <- rbind(result, data.frame(GENE = nms, CLUSTER = rep(n,length(nms))))
}

result
     GENE CLUSTER
1   Gene7       1
2  Gene11       1
3  Gene12       1
4  Gene13       1
5  Gene14       1
6  Gene15       1
7  Gene16       1
8  Gene17       1
9  Gene18       1
10 Gene19       1
11 Gene20       1
12  Gene1       2
13  Gene2       2
14  Gene3       2
15  Gene4       2
16  Gene5       2
17  Gene6       2
18  Gene7       2
19  Gene8       2
20  Gene9       2
21 Gene10       2

我将threshold作为参数保留,以便您可以随意定义它。在这个例子中,我用1作为阈值。