perl regex: multiple matches as variables

时间:2016-08-31 18:14:05

标签: regex perl variables

I am not interested in how to use a variable in a regex search. Instead, I am curious how I can turn multiple regex matches into variables.

I have a file that looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893 
Length=10658

 Score = 33.7 bits (18),  Expect = 0.19
 Identities = 18/18 (100%), Gaps = 0/18 (0%)
 Strand=Plus/Minus

Query  3     CTATTTAAACCTAATCGG  20
             ||||||||||||||||||
Sbjct  10604  CTATTTAAACCTAATCGG  10587


>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727 
Length=4184

 Score = 33.7 bits (18),  Expect = 0.19
 Identities = 18/18 (100%), Gaps = 0/18 (0%)
 Strand=Plus/Plus

Query  3    CTATTTAAACCTAATCGG  20
            ||||||||||||||||||
Sbjct  85   CTATTTAAACCTAATCGG  102

My ultimate goal is to search this (very large) file and only extract lines that look like ">m160505_..." based on the end position of the subject match (seen as 10587 and 102 in the above example). If the end position of the subject is within 500 of the Length of the query length, or if it is within 500 absolutely, the >m... line gets printed. I realize this seems complicated, so looking at my code might help clarify things. This is what my code looks like so far:

use strict;
use warnings;

my $file = '/path/to/file.txt';
my $data;
{
    open my $fh, '<', $file or die;
    local $/ = undef;
    $data = <$fh>;
    close $fh;
}
my @matches = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct  [0-9]+  [TAGC]+  ([0-9]+)/g;
foreach (@matches) {
    print "$_\n";
} 

This prints out something like the following:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
 10658 
 10587 
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727 
  4184 
  102 

From here I need to change things so the regex matches turn into variables (flexible vairables). I would like to be able to use them in something like the following:

 my $mVariable = "m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727"; 
 my $firstnumber = 10685; 
 my $secondnumber = 10587; 
 if ($firstnumber - $secondnumber < 500 || $secondnumber < 500) { 
      print $mVariable, "\n"; 
 } 

Thanks for your help! If I can clarify something please let me know.

3 个答案:

答案 0 :(得分:2)

It's wasteful and unnecessary to read an entire file into memory; more so if it is a very large file

My solution below sets the record separator to > so that the file can be read one chunk at a time. The variables that you describe are extracted from the chunk, and the remainder of the loop is skipped if any of them aren't found

This program expects the path to the input file as a parameter on the command line

use strict;
use warnings 'all';
use feature 'say';

local $/ = ">";

while ( <> ) {

    next unless my ($m_variable) = / ^ ( m \d+ .+ ) /x;
    next unless my ($length)     = / ^ Length=(\d+) /xm;
    next unless my ($end_pos)    = / ^ Sbjct \b .*  \b (\d+) /xm;

    if ( abs($length - $end_pos) < 500 or $end_pos < 500 ) {
        say $m_variable;
    }
}

output

m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893 

答案 1 :(得分:0)

Perl stores the results of captures in special numbered variables. The first capture group is $1, the second is $2, and so on. Their values are set every time a regex match succeeds (whether replacement or matching).

So, in your case, you could do something like this:

my $string = "m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727";
if ($string =~ /^m(\d+)_(\d+)/) {
    if ($1 < 500 || $2 < 500) {
        do_something()
    }
}

答案 2 :(得分:0)

When matching against a pattern you could extract the $1, $2, etc... using a list. So instead of:

my @matches = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct  [0-9]+  [TAGC]+  ([0-9]+)/g;

you could use this:

my ($m_var, $first, $second) = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct  [0-9]+  [TAGC]+  ([0-9]+)/g;