我对那里的许多R软件包都不熟悉,因此如果在其他地方解决了该问题,我会为我的谷歌搜索技术感到抱歉。
我正在尝试按汉明距离对字符序列进行分组并返回分组大小。汉明距离定义为将seqA转换为SeqB所需的字符差异数。例如,我有以下顺序:
[1] "24 sequences with ID 64"
[1] " AAAAAACAAAGAACC 64" " AAAAAAAAAAACTAT 64"
[3] " AAAAATGCGTGTATA 64" " AAAAAACAAAGAACC 64"
[5] " AAAAAAAAAAACTAT 64" " AAAAATGCGTGTATA 64"
[7] " AAAAAACAAAGAACC 64" " AAAAAAAAAAACTAT 64"
[9] " AAAAATGCGTGTATA 64" " AAAAAACAAAGAACC 64"
[11] " AAAAAAAAAAACTAT 64" " AAAAATGCGTGTATA 64"
[13] " AAAAAACAAAGAACC 64" " AAAAAAAAAAACTAT 64"
[15] " AAAAATGCGTGTATA 64" " AAAAAACAAAGAACC 64"
[17] " AAAAAAAAAAACTAT 64" " AAAAATGCGTGTATA 64"
[19] " AAAAAACAAAGAACC 64" " AAAAAAAAAAACTAT 64"
[21] " AAAAATGCGTGTATA 64" " AAAAAACAAAGAACC 64"
[23] " AAAAAAAAAAACTAT 64" " AAAAATGCGTGTATA 64"
我知道这里有三组独特的序列,它们的汉明距离为:
[,1] [,2] [,3]
[1,] 0 6 8
[2,] 6 0 10
[3,] 8 10 0
鉴于所有三个序列的距离相差超过2个(改变序列A以使其看起来像序列B所需的位置),我将考虑保留三个唯一的序列集。
如果我有一组序列,汉明距离看起来像这样:
[,1] [,2] [,3]
[1,] 0 2 13
[2,] 2 0 13
[3,] 13 13 0
我要说的是,第1和第2组实际上是相同的,它们满足<= 2距离阈值,而第3组本身就是一个唯一的组。因此,我希望看到类似以下内容的输出:
sum(group1,group2)
sum(group3)
我可以在纸和笔上弄清楚该怎么做。但是由于缺乏R的经验,我不知道该去哪里。任何帮助深表感谢。
答案 0 :(得分:1)
我不确定我是否能找到您所需要的所有内容,但是这里有一个脚本可能会对您有所帮助。
我制作了一个脚本来构建组并输出列表。它不是很漂亮,对于R初学者来说可能很难理解,但这是我发现的更简单的方法:
make.groupe <- function(the_mat, min_dist = 2) {
# prepare the result list
res <- NULL
# 1 member group:
res <- as.list(rownames(the_mat)[apply(the_mat,1, function(xx) all(xx>min_dist | xx==0, na.rm=T) )])
# 2 members group:
the_mat[upper.tri(the_mat, diag = F)] <- NA
library(reshape2)
group <- subset(melt(the_mat), value!=0)
group <- group[group$value <= min_dist,1:2]
res <- unname(append(res, lapply(apply(unname(as.matrix(group)),1,as.list),unlist)))
res
}
您将矩阵和最小距离赋予函数:
mat1 <- matrix(c(0,2,13,2,0,13,13,13,0),3,3, dimnames = list(c("g1","g2","g3"),c("g1","g2","g3")))
make.groupe(mat1, 2)
[[1]]
[1] "g3"
[[2]]
[1] "g2" "g1"
也可以使用第一个矩阵:
mat2 <- matrix(c(0,6,8,6,0,8,8,10,0),3,3, dimnames = list(c("g1","g2","g3"),c("g1","g2","g3")))
make.groupe(mat2, 2)
[[1]]
[1] "g1"
[[2]]
[1] "g2"
[[3]]
[1] "g3"
如果您更改了最小距离,它将起作用:
mat2 <- matrix(c(0,6,8,6,0,8,8,10,0),3,3, dimnames = list(c("g1","g2","g3"),c("g1","g2","g3")))
make.groupe(mat2, 6)
[[1]]
[1] "g3"
[[2]]
[1] "g2" "g1"
更大的矩阵也可以工作:
mat3 <- matrix(c(0,2,8,9,2,0,7,8,8,7,0,1,9,8,1,0),4,4, dimnames = list(c("g1","g2","g3","g4"),c("g1","g2","g3","g4")))
make.groupe(mat3, 2)
[[1]]
[1] "g2" "g1"
[[2]]
[1] "g4" "g3"
对于3个或更多的组,它不起作用。
另一个选择是使用集群功能,但是它不会产生列表:
cutree(hclust(as.dist(mat1)), h=2)
其中h是最小距离。这样会产生一个向量,其中相似的索引代表相同的组:
cutree(hclust(as.dist(mat1)), h=2)
g1 g2 g3
1 1 2
cutree(hclust(as.dist(mat3)), h=2)
g1 g2 g3 g4
1 1 2 2
cutree(hclust(as.dist(mat2)), h=2)
g1 g2 g3
1 2 3