合并部分匹配字符串

时间:2014-06-15 11:27:03

标签: r perl pattern-matching

我正在努力尝试组合来自两个文件的部分匹配的字符串。

文件1包含唯一字符串列表。这些字符串与文件2中的许多字符串部分匹配。如何为每个匹配的案例合并文件1中的行和文件2

File1中

mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660

文件2

mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

期望的输出

mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

我尝试在R中使用pmatch(),但是没有把它弄好。我看起来像perl会处理的东西??

也许是这样的:

perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1

3 个答案:

答案 0 :(得分:4)

这是一个简短的Perl解决方案,它将file1中的所有数据保存在哈希中,然后在扫描file2时检索它

use strict;
use warnings;
use autodie;

my @files = qw/ file1.txt file2.txt /;

my %file1 = do {
  open my $fh, '<', $files[0];
  map /([^_]+)_(\S+)/, <$fh>;
};

open my $fh, '<', $files[1];
while (<$fh>) {
  my ($key) = /([^_]+)/;
  printf "%-32s%s", "${key}_$file1{$key}", $_;
}

<强>输出

mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239     mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

答案 1 :(得分:3)

当然你可以用R来做。的确,pmatch整个字符串不会给你想要的结果 - 你必须匹配合适的子串。

我假设在文件1中第一个标识符是677而不是667,否则很难猜测匹配方案(我假设你的例子只是更大数据库的一部分)。

file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))

file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))

library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")

cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
##      file1                            file2                                     
## [1,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"  
## [2,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239"    "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT" 
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC" 
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"

答案 2 :(得分:2)

您可以agrep进行模糊搜索。你应该玩距离。我在这里手动修复它到11。

基本上我这样做是为了提取与file1中每个单词匹配的行号:

sapply(file1,agrep,file2,max=11)
$`mmu-miR-677-5p_MIMAT0017239`
[1] 1 2 3

$`mmu-miR-181a-1-3p_MIMAT0000660`
[1] 4 5

获取data.frame的结果:

do.call(rbind,
     lapply(file1,
       function(x)
        data.frame(file1=x,
                   file2=agrep(x,file2,max=11,value=T))))


                         file1                                    file2
1    mmu-miR-677-5p_MIMAT0017239   mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
2    mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
3    mmu-miR-677-5p_MIMAT0017239  mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
4 mmu-miR-181a-1-3p_MIMAT0000660  mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
5 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC