在两个矩阵中找到序列之间的部分重叠

时间:2018-05-08 16:56:56

标签: r

我将从可重现的示例开始,它是我的真实数据的一部分:

数据文件1:

> dput(exp_data)
structure(c("ACLVDGSYHDVDSSVLAFQLAAR", "AELNQVVR", "AFEPGLLAK", 
"AFSVFLFNSK", "AFYEFQQR", "AGEPLYVLLCCWVAAVGAGLLK", "AIKDFPHR", 
"AIRIPVVR", "AIVWSGEELGAK", "ALAALQGR", "ALEGIYACCFR", "ANLSSVQIDR", 
"ANLSSVQIDRELK", "ASYTMQLAK", "ATRVEEGGEEENVMAK", "AVELVILPR", 
"AVPLKDYR", "CLAAIEGR", "DIVSEHPER", "DLVDFAEFR", "DLVDFAEFRK", 
"DMIVTNLGAKPLVLQIPIGAEDVFK", "DQSDREVDVTQNR", "DQVSIIPFR", "DQVSIIPFRGDAAEVLLPPSR", 
"DQVTAEDVGIVIPNCLR", "DRVTPDDVATVIPNCLR", "DSILQSIHEPELISAFDTGGAELLYEIR", 
"DSLVQSGAKPELIAAFDTNGAELLYEIR", "DTITGETLSDPENPVVLER", "EDGVMTAELLQR", 
"EGISISHPAR", "EIGGIAISGR", "EILVQHLLVK", "ELHGESEEERVKEEEIK", 
"23", " 8", " 9", "10", " 8", "22", " 8", " 8", "12", " 8", "11", 
"10", "13", " 9", "16", " 9", " 8", " 8", " 9", " 9", "10", "25", 
"13", " 9", "21", "17", "17", "28", "28", "19", "12", "10", "10", 
"10", "17"), .Dim = c(35L, 2L), .Dimnames = list(c("14037", "24071", 
"27989", "31522", "32851", "35458", "49646", "52332", "54727", 
"57052", "61034", "82744", "82797", "104573", "110271", "115602", 
"121061", "133577", "163666", "175488", "175522", "177867", "183262", 
"183690", "183703", "183742", "184949", "186146", "186828", "193019", 
"213233", "222624", "232405", "233822", "244244"), c("Sequence", 
"Length")))

数据文件2:

> dput(exp_sel)
structure(c(" 49", " 80", " 45", " 61", " 40", " 45", "107", 
" 75", " 40", " 60", " 43", " 57", " 80", " 51", " 55", " 39", 
"MAMTPVASSSPVSTCRLFRCNLLPDLLPKPLFLSLPKRNRIASCRFTVR", "MAADALRISSSSSGSLVCNLNGSQRRPVLLPLSHRATFLGLPPRASSSSISSSIPQFLGTSRIGLGSSKLSQKKKQFSVF", 
"MSASSLFNLPLIRLRSLALSSSFSSFRFAHRPLSSISPRKLPNFR", "MFSLKSLISSPFTQSTTHGLFTNPITRPVNPLPRTVSFTVTASMIPKRSSANMIPKNPPAR", 
"MQICQTKLNFTFPNPTNPNFCKPKALQWSPPRRISLLPCR", "MVVVTHISTSFHQISPSFFHLRLRNPSTTSSSRPKLDGGFALSIR", 
"MASSSSMQMVHTSRSIAQIGFGVKSQLVSANRTTQSVCFGARSSGIALSSRLHYASPIKQFSGVYATTKHQRTACVKSM", 
"MELSLLRPTTQSLLPSFSKPNLRLAELNQVVRLRC", "MASSSLPLSLPFPLRSLTSTTRSLPFQCSPLFFSIPSSIV", 
"MASLLGTSSSAIWASPSLSSPSSKPSSSPICFRPGKLFGSKLNAGIQIRPKKNRSRYHVS", 
"MALQAADLVDFAEFRRKDAKLNASSSSFKDSSLFGASITDQIKSEHGSSSLRFKREQSLRNLAIRA", 
"MELSLSTSSASPAVLRRQASPLLHKQQVLGVSFASALKPASYTMQLAKSRRPLPRPITC", 
"MFRVTGTLSAASSPAVAAASFSAALRLSITPTLAIASPPHLRWFSKFSRQFLGGRISSLRPRIPSPCPIRLSGFPALKMRA", 
"MLSLTATTLSSSIFTQSKTHGFFNTRPVYRKPFTTITSALIPASNRQAPPK", "MASLLGRSPSSILTCPRISSPSSTSSMSHLCFGPEKLSGRIQFNPKKNRSRYHVS", 
"MAVSPHISPTLSRYKFFSTSVVENPNFSPYRIYSRRRVT"), .Dim = c(16L, 2L), .Dimnames = list(
    c("2", "6", "10", "11", "14", "15", "16", "17", "20", "21", 
    "22", "23", "24", "25", "26", "27"), c("Length", "Sequence"
    )))

我想从数据文件1(exp_data)中的每一行中选择一个Sequence,并尝试查找是否可以在数据文件2的列Sequence中的任何行中找到此特定字符串( exp_sel)。问题是这些序列不相同,并且数据文件1中序列的部分重叠预计会出现在数据文件2的序列列中。

示例输出:

来自数据文件1的序列:

AFYEFQQR

数据文件2中的序列:

MAMTPVASSSPV的 AFYEFQQR NLLPDLLPKPLFLSLPKRNRIASCRFTVR

有匹配所以请将此行保留在exp_data中。如果此序列没有太多 - 删除此行。

2 个答案:

答案 0 :(得分:5)

你可以这样做......

exp_data[sapply(exp_data[,1], function(x) any(grepl(x, exp_sel[,2]))), ]

       Sequence    Length
24071  "AELNQVVR"  " 8"  
104573 "ASYTMQLAK" " 9"  
175488 "DLVDFAEFR" " 9"  

如果sapply值中的任何一个包含TRUE的相应元素,则exp_sel会生成exp_data的逻辑向量。

答案 1 :(得分:1)

我认为该请求是针对exp_sel中包含exp_data中任何项目的项目的制表:

 exp_sel[ unlist( sapply(exp_data, grep, exp_sel[, "Sequence"])), ]
   Length Sequence                                                            
17 " 75"  "MELSLLRPTTQSLLPSFSKPNLRLAELNQVVRLRC"                               
23 " 57"  "MELSLSTSSASPAVLRRQASPLLHKQQVLGVSFASALKPASYTMQLAKSRRPLPRPITC"       
22 " 43"  "MALQAADLVDFAEFRRKDAKLNASSSSFKDSSLFGASITDQIKSEHGSSSLRFKREQSLRNLAIRA"

但是在重新阅读之后,看来我读错了。也许它无论如何都会有用。