我有一个300GB的文件,我需要一些行,如下所示。从下面显示的行中,我只需要以>miR
开头的行。
我编写了一个Perl程序,它实际上打印了我想要的输出,但是当我为更大的文件(下面显示的类似行)应用相同的代码时,最多300 GB的数据,如何继续这个?是否有任何替代方法可以在此代码中实现,因为代码在运行时将被终止。
#!/usr/bin/perl -w
$len=@ARGV;
if($len eq 0){
print "Give file \n";
exit;
}
$file=$ARGV[0];
open(FH,$file) || die "cant open file\n";
@lines=<FH>;
close FH;
while ($line=<FH>){
chomp $line;
if ($line =~ /^>miR/){
$_=$line;
s/>//g && s/,//g;
print "$_\n";
if($_=~ /(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)/){
print $1,"\t",$2,"\t",$7,"\t",$3,"\n";
}
Forward: Score: 124.000000 Q:2 to 18 R:1 to 20 Align Len (17) (64.71%) (82.35%)
Query: 3' gaauAUUCGUUAG-AAUGGUAa 5'
|:: :|||| || ||||
Ref: 5' --ctTGGTTAATCATTCCCATt 3'
Energy: -10.480000 kCal/Mol
Scores for this hit:
>miR844a AT2G33810, 124.00 -10.48 2 18 1 20 17 64.71% 82.35%
Forward: Score: 120.000000 Q:2 to 19 R:289 to 308 Align Len (17) (64.71%) (76.47%)
Query: 3' gaaUAUUCGUUAGAAUGGUAa 5'
||::| || || ||||
Ref: 5' ttgATGGG-AAAATTTCCATt 3'
Energy: -9.850000 kCal/Mol
Scores for this hit:
>miR844a AT2G33810, 120.00 -9.85 2 19 289 308 17 64.71% 76.47%
Forward: Score: 118.000000 Q:2 to 19 R:483 to 503 Align Len (17) (64.71%) (82.35%)
Query: 3' gaaUAUUCGUUAGAAUGGUAa 5'
:||: |||| ||:|||
Ref: 5' gggGTAGAAAATCATATCATa 3'
答案 0 :(得分:2)
我们可以设置local $/ = '>'
(作为记录分隔符),然后按如下方式使用它:
use Modern::Perl;
{
local $/ = '>';
while (<DATA>){
next if !/^miR/;
s/,//g;
my($var0, $var1, $var2, $var6) = (split ' ', $_, 8)[0..2, 6];
say"$var0,\t$var1,\t$var6,\t$var2";
}
}
__DATA__
>miR844a AT2G33810, 124.00 -10.48 2 18 1 20 17 64.71% 82.35%
Forward: Score: 120.000000 Q:2 to 19 R:289 to 308 Align Len (17) (64.71%) (76.47%)
Query: 3' gaaUAUUCGUUAGAAUGGUAa 5'
||::| || || ||||
Ref: 5' ttgATGGG-AAAATTTCCATt 3'
Energy: -9.850000 kCal/Mol
Scores for this hit:
>moR844a AT2G33810, 120.00 -9.85 2 19 289 308 17 64.71% 76.47%
Forward: Score: 118.000000 Q:2 to 19 R:483 to 503 Align Len (17) (64.71%) (82.35%)
Query: 3' gaaUAUUCGUUAGAAUGGUAa 5'
:||: |||| ||:|||
Ref: 5' gggGTAGAAAATCATATCATa 3'
>miR844a AT2G33810, 120.00 -9.85 2 19 289 308 17 64.71% 76.47%
Forward: Score: 118.000000 Q:2 to 19 R:483 to 503 Align Len (17) (64.71%) (82.35%)
Query: 3' gaaUAUUCGUUAGAAUGGUAa 5'
:||: |||| ||:|||
Ref: 5' gggGTAGAAAATCATATCATa 3'
输出:
miR844a, AT2G33810, 1, 124.00
miR844a, AT2G33810, 289, 120.00
如果当前的记录不以“miR”开头,则请求下一条记录(记录以“&gt;”开头),否则删除任何逗号,然后拆分记录以获取您所追求的值(来自你的正则表达式)。
希望这有帮助!