DNA seqs中的模糊匹配

时间:2018-02-15 18:17:52

标签: r tidyverse fuzzy fuzzyjoin

出于代表的目的,我生成了一个名为random_DNA_tbl的tibble,它是10个DNA序列(100个碱基)的随机选择。我有一个叫做subseq_tbl的单独的tibble,有3个较短的序列,匹配random_DNA_tbl中100%到3个序列,但我也想使用 subseq_tbl 中的序列模糊匹配到其他序列在 random_DNA_tbl 中。我正在跳跃能够使用fuzzyjoin包stringdist_XX_join函数,但是这些似乎不起作用,即使subseq序列实际上是完美匹配并且与其他匹配函数一起工作,例如regex_XX_join。

library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2", 
                               "random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6", 
                               "random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC", 
                  "TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA", 
                  "GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT", 
                  "GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG", 
                  "ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC", 
                  "ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC", 
                  "GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA", 
                  "CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT", 
                  "GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT", 
                  "CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA, 
                                                           -10L), class = c("tbl_df", "tbl", "data.frame"))

subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))

不起作用:

stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))

工作:

regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))

我尝试在stringdist中调整max_dist参数,但无济于事。有人可以解决这个问题吗?

0 个答案:

没有答案