在一组序列中找到每个模式

时间:2018-05-16 09:43:37

标签: r loops pattern-matching

我试图在一组序列中找到一个模式。我想要做的是首先采取模式的第一个序列并逐一匹配所有受试者并给出输出(p1 vs s1,p1 vs s2,p1 vs s3,p1 vs s4),然后取第二种模式并与所有受试者匹配(p2 vs s1,p2 vs s2,p2 vs s3,p2 vs s4)等等,即以迭代方式。输入(模式和主题)是DNAstringSet实例(Biostrings)。

我已经使用了这个功能

mat=nucleotideSubstitutionMatrix(match=2,mismatch = -3,baseOnly = TRUE)
localAlign=pairwiseAlignment(pattern,subject,type="local",
                             substitutionMatrix=mat,
                             gapOpening=-5, gapExtension=-2)

但这样它实际上匹配p1 vs s1,p2 vs s2,p3 vs s3 nd p4 vs s4

示例:

输入:图案

A DNAStringSet instance of length 734
      width seq                                                                         names               
  [1]  1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC P1 
  [2]  1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC P2
  [3]  1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC P3
  [4]  1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA P4

输入:主题

A DNAStringSet instance of length 1000
       width seq                                                                        names               
   [1]  1000 GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA S1
   [2]  1000 GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA S2
   [3]  1000 GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT S3
   [4]  1000 GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT S4

1 个答案:

答案 0 :(得分:0)

如果我理解你的问题,那么你可以尝试一下。

    
library(dplyr)

df <- pattern_df %>%
  left_join(subject_df, "seq") %>%
  group_by(width.x, seq, names.x) %>%
  summarise(subject_names = paste(names.y, collapse=",")) %>%
  `colnames<-`(c("width", "seq", "names", "subject_names"))

输出为:

> df
  width seq                                                                         names subject_names     
1  1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC P1    NA           
2  1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA P4    S5,S6        
3  1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC P3    NA           
4  1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC P2    NA 

示例数据: 注意我在subject_df的最后添加了两行)

pattern_df <- structure(list(width = c(1000L, 1000L, 1000L, 1000L), seq = c("GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC", 
"TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC", 
"GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC", 
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA"
), names = c("P1", "P2", "P3", "P4")), .Names = c("width", "seq", 
"names"), class = "data.frame", row.names = c(NA, -4L))
#  width                                                                         seq names
#1  1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC    P1
#2  1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC    P2
#3  1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC    P3
#4  1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA    P4


subject_df <- structure(list(width = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L
), seq = c("GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA", 
"GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA", 
"GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT", 
"GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT", 
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA", 
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA"
), names = c("S1", "S2", "S3", "S4", "S5", "S6")), .Names = c("width", 
"seq", "names"), class = "data.frame", row.names = c(NA, -6L))
#  width                                                                         seq names
#1  1000 GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA    S1
#2  1000 GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA    S2
#3  1000 GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT    S3
#4  1000 GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT    S4
#5  1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA    S5
#6  1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA    S6