输出匹配字符串的左侧或右侧部分

时间:2014-09-27 23:01:42

标签: r perl

我有两个文件,file1包含file2的子字符串。我想匹配file1到file2并输出匹配左边的部分而不是匹配本身。我还想知道如何输出匹配右边的内容而不是匹配本身。 这是我的数据的一部分(这些字符串可能不匹配,只是示例数据:

文件1

 ACUGUACAGGCCACUGCCUUGC
 CUGCGCAAGCUACUGCCUUGCU
 UGGAAUGUAAAGAAGUAUGUAU
 CGAAUCAUUAUUUGCUGCUCUA
 AUCACAUUGCCAGGGAUUACC
 UUCACAGUGGCUAAGUUCUGC

file2的

 CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
 CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
 GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
 CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
 GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC

示例:

文件1:

                                                  GCUGUGGAGAUAACUGCGC

file2

  CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC

输出

  CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC

3 个答案:

答案 0 :(得分:1)

打开要测试的字符串的文件句柄:

use strict;
use warnings;
use autodie;

open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n";
open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n";

while ( !eof $fh1 && !eof $fh2 ) {
    chomp( my $line1 = <$fh1> );
    chomp( my $line2 = <$fh2> );

    print join( ' ', split /$line1/, $line2, 2 ), "\n";
}

输出:

GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC

答案 1 :(得分:1)

以下几种方法只保留模式之前的文本(如果存在)

a <- "GCUGUGGAGAUAACUGCGC"
b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC"

strsplit(b, a)[[1]][1]
sub(paste0(a, ".*$"), "", b)

现在,您只需要将文件读入R并循环遍历每个模式。我不确定你在寻找什么,但这是一个想法

# read data into 2 variables, a and b
# you could use readLines() to do read from disk
a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC"))

b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC"))

现在,循环遍历第一个文件中的每个值

lapply(a, function(x) sapply(strsplit(b, x), "[", 1))

答案 2 :(得分:1)

你甚至可以在Perl代码下面尝试使用$ PREMATCH($`),$ POSTMATCH($')和$ MATCH($&amp;)的字符串之前,之后和匹配:

<强> InputFiles:

<强> FILE1.TXT:

ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC

<强> FILE2.TXT:

CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC

<强>代码:

use strict;
use warnings;

open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!";
open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!";

while(!eof $fh1 && !eof $fh2)
 {
    chomp( my $line1 = <$fh1> );
    chomp( my $line2 = <$fh2> );

    if($line2 =~ /$line1/isg)
     {
          print "Prematch: $`\n";         
          print "Postmatch: $'\t";
          }
     }     
close($fh1);
close($fh2);

<强>输出:

Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA    Postmatch: CAGG
Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG
Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA  Postmatch: UUCAGGC
Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G
Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC