我有一个每个8个字母的DNA序列字。大约有5万字,样本为“AAAAAAAA”“TTTTTTTT”“AAAAACGC”“AAAACCTG”等。现在我想按照这样的顺序对所有单词进行分组,使得6个相似字母的所有单词组合在一起。请有人帮助我。 我需要将所有2个替换字聚类成一个聚类,将2个以上的替换字聚类到另一个聚类中。例如,“AAAAACCA”可以属于“AAAAAAAA”和“AAAACCCA”。但是,“AAAAACCA”应该属于群集“AAAACCCA”,因为与“AAAAAAAA”相比它是1替换。假设“AAAAAAAG”可以属于“AAAAAAAA”或“AAAAAAAC”,但不能同时属于两者。我希望你明白我的观点,如果你有任何进一步的澄清,请评论我。谢谢。
words <- sample[1:25]
> group <- lapply(words, function(x)list(x,words[agrep(x, words,max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))]))
> group
[[1]]
[[1]][[1]]
[1] "AAAAAAAA"
[[1]][[2]]
[1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
[9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCA" "AAAAACGA"
[[2]]
[[2]][[1]]
[1] "AAAAAAAC"
[[2]][[2]]
[1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
[9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCC"
[[3]]
[[3]][[1]]
[1] "AAAAAAAG"
[[3]][[2]]
[1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
[9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCG"
如何减少输出中的redundency。
答案 0 :(得分:4)
使用adist
来电,您可以:
words <- c("AAAAAAAA", "TTTTTTTT", "AAAAAAGC", "AAAACCAA")
lapply(words, function(x) words[adist(x, words) < 3])
您也可以使用agrep
尝试此操作,但它可能会慢得多:
words <- c("AAAAAAAA", "TTTTTTTT", "AAAAAAGC", "AAAACCAA")
d<-lapply(words,
function(x) list(match.word=x, six.letter.grp = words[agrep(x, words,
max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))]))
这会输出以下列表,其中显示您要匹配的字词,以及它匹配的所有字词,包括字词本身,但您可以根据您想要的内容调整输出:
[[1]]
[[1]]$match.word
[1] "AAAAAAAA"
[[1]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"
[[2]]
[[2]]$match.word
[1] "TTTTTTTT"
[[2]]$six.letter.grp
[1] "TTTTTTTT"
[[3]]
[[3]]$match.word
[1] "AAAAAAGC"
[[3]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC"
[[4]]
[[4]]$match.word
[1] "AAAACCAA"
[[4]]$six.letter.grp
[1] "AAAAAAAA" "AAAACCAA"
对于更紧凑的列表结构,您可以尝试:
d <- lapply(words, function(x) words[agrep(x, words,
max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))])
names(d) <- words
d
#$AAAAAAAA
#[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"
#
#$TTTTTTTT
#[1] "TTTTTTTT"
#
#$AAAAAAGC
#[1] "AAAAAAAA" "AAAAAAGC"
#
#$AAAACCAA
#[1] "AAAAAAAA" "AAAACCAA"