我在简化的PCA空间中使用k-means聚类一些基因表达数据,现在我想提取最能描述每个聚类的不同特征。这些是在每个群集中高度表达的功能。
我发布了一个可重复的示例,以显示我的逻辑以及我在哪里停止。
# Create test matrix
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Cell", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")
# plot the inital heatmap
library(pheatmap)
pheatmap(t(test))
# preform PCA
pca = prcomp(t(test), center=TRUE, scale=TRUE)
rotation = data.frame(pca$x)
plot(rotation[1:3], pch=16, cex=0.6, cex.main=0.9)
# preform Kmeans in PCA space
wss = (nrow(rotation)-1)*sum(apply(rotation,2,var))
for (i in 2:9) wss[i] <- sum(kmeans(rotation, centers=i)$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
km = kmeans(rotation, 2)
cluster_assignment = as.factor(km$cluster)
# plot k-means cluster assignment in PCA space
library(ggplot2)
ggplot(rotation, aes(rotation$PC1, rotation$PC2, color=cluster_assignment, label=rownames(rotation))) + geom_point() + geom_text()
# create a cluster annotation data.frame
n_clusters = length(unique(km$cluster))
temp_cluster_design_list = list()
for(n in 1:n_clusters){
temp_cluster_design = data.frame(row.names=row.names(t(test)[cluster_assignment %in% n,]))
temp_cluster_design$cluster = n
temp_cluster_design_list[[n+1]] <- temp_cluster_design
}
cluster_design = do.call(rbind, temp_cluster_design_list)
# cluster_design looks something like this:
# cluster
# Cell1 1
# Cell3 1
# Cell5 1
# Cell7 1
# Cell9 1
# Cell2 2
# Cell4 2
# Cell6 2
# Cell8 2
# Cell10 2
# heatmap with cell cluster annotation
PCA_heatmap_data = t(test)[row.names(cluster_design),]
cluster_design$cluster = as.factor(cluster_design$cluster)
pheatmap(PCA_heatmap_data, annotation_row=cluster_design)
# extract out expressed genes from each cluster
for(n in 1:n_clusters) {
temp_cluster = t(test)[cluster_assignment %in% n,]
# ??? somehow extract out the expressed gene names specific to each cluster
}
上面的代码最终给我留下了一个类似于this的热图。现在,我想对热图中的每个聚类做的是提取高度表达的基因名称。我最终想写一个看起来像这样的表:
GENE CLUSTER
Gene20 cluster2
Gene19 cluster2
Gene15 cluster2
Gene18 cluster2
Gene16 cluster2
Gene17 cluster2
Gene9 cluster1
Gene8 cluster1
Gene4 cluster1
Gene3 cluster1
... ...
我不确定最好和最有效的方法。我很感激您可以帮我折腾!谢谢!
修改
我们如何定义高度表达?这个我不太确定,希望得到一些见解。也许比较所有细胞中每个基因的平均值,并将其与群集中的平均值进行比较?这可能有用,但我认为这很容易受到异常值的影响。另一个想法是扩展每个聚类并采取高度表达的基因?
我如何在非常相似的群集中识别高度表达的基因?例如,this热图显示三个群集,但红色和绿色群集非常相似。看来Gene9是cluster2特异性的吗?
答案 0 :(得分:1)
threshold <- 1
nms <- as.character()
for(n in 1:n_clusters) {
for(i in 1:ncol(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])){
if(mean(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),i]) > threshold){
nms <- c(nms, colnames(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])[i])
}
}
if(n == 1) result <- data.frame(GENE = nms, CLUSTER = rep(n,length(nms))); rm(nms); nms <- as.character()
if(n > 1) result <- rbind(result, data.frame(GENE = nms, CLUSTER = rep(n,length(nms))))
}
result
GENE CLUSTER 1 Gene7 1 2 Gene11 1 3 Gene12 1 4 Gene13 1 5 Gene14 1 6 Gene15 1 7 Gene16 1 8 Gene17 1 9 Gene18 1 10 Gene19 1 11 Gene20 1 12 Gene1 2 13 Gene2 2 14 Gene3 2 15 Gene4 2 16 Gene5 2 17 Gene6 2 18 Gene7 2 19 Gene8 2 20 Gene9 2 21 Gene10 2
我将threshold
作为参数保留,以便您可以随意定义它。在这个例子中,我用1作为阈值。