I am not interested in how to use a variable in a regex search. Instead, I am curious how I can turn multiple regex matches into variables.
I have a file that looks like this:
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
Length=10658
Score = 33.7 bits (18), Expect = 0.19
Identities = 18/18 (100%), Gaps = 0/18 (0%)
Strand=Plus/Minus
Query 3 CTATTTAAACCTAATCGG 20
||||||||||||||||||
Sbjct 10604 CTATTTAAACCTAATCGG 10587
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727
Length=4184
Score = 33.7 bits (18), Expect = 0.19
Identities = 18/18 (100%), Gaps = 0/18 (0%)
Strand=Plus/Plus
Query 3 CTATTTAAACCTAATCGG 20
||||||||||||||||||
Sbjct 85 CTATTTAAACCTAATCGG 102
My ultimate goal is to search this (very large) file and only extract lines that look like ">m160505_..." based on the end position of the subject match (seen as 10587 and 102 in the above example). If the end position of the subject is within 500 of the Length of the query length, or if it is within 500 absolutely, the >m... line gets printed. I realize this seems complicated, so looking at my code might help clarify things. This is what my code looks like so far:
use strict;
use warnings;
my $file = '/path/to/file.txt';
my $data;
{
open my $fh, '<', $file or die;
local $/ = undef;
$data = <$fh>;
close $fh;
}
my @matches = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct [0-9]+ [TAGC]+ ([0-9]+)/g;
foreach (@matches) {
print "$_\n";
}
This prints out something like the following:
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
10658
10587
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727
4184
102
From here I need to change things so the regex matches turn into variables (flexible vairables). I would like to be able to use them in something like the following:
my $mVariable = "m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727";
my $firstnumber = 10685;
my $secondnumber = 10587;
if ($firstnumber - $secondnumber < 500 || $secondnumber < 500) {
print $mVariable, "\n";
}
Thanks for your help! If I can clarify something please let me know.
答案 0 :(得分:2)
It's wasteful and unnecessary to read an entire file into memory; more so if it is a very large file
My solution below sets the record separator to >
so that the file can be read one chunk at a time. The variables that you describe are extracted from the chunk, and the remainder of the loop is skipped if any of them aren't found
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
use feature 'say';
local $/ = ">";
while ( <> ) {
next unless my ($m_variable) = / ^ ( m \d+ .+ ) /x;
next unless my ($length) = / ^ Length=(\d+) /xm;
next unless my ($end_pos) = / ^ Sbjct \b .* \b (\d+) /xm;
if ( abs($length - $end_pos) < 500 or $end_pos < 500 ) {
say $m_variable;
}
}
m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
答案 1 :(得分:0)
Perl stores the results of captures in special numbered variables. The first capture group is $1
, the second is $2
, and so on. Their values are set every time a regex match succeeds (whether replacement or matching).
So, in your case, you could do something like this:
my $string = "m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727";
if ($string =~ /^m(\d+)_(\d+)/) {
if ($1 < 500 || $2 < 500) {
do_something()
}
}
答案 2 :(得分:0)
When matching against a pattern you could extract the $1, $2, etc... using a list. So instead of:
my @matches = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct [0-9]+ [TAGC]+ ([0-9]+)/g;
you could use this:
my ($m_var, $first, $second) = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct [0-9]+ [TAGC]+ ([0-9]+)/g;