如何匹配多个选项并在perl中提取它?

时间:2016-08-05 19:50:33

标签: perl

我有一个带有分类分配的txt文件,如:

#name_file

Bacteria;WS3;PRR-12;SSS58A 0.0 0.12 0.6

Bacteria;WS3;PRR-12;Sediment-1 0.5 0.1 0.3

Bacteria;Terrabacteria_group;Firmicutes;Bacilli; unclassified_Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X 0.1 0.3 0.5

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X._Incertae_Sedis;Thermicanus 0.4 0.13 0.9

Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae 0.1 0.2 0.6

Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae;BD2-6 0.0 0.0 0.6

Bacteria;PVC_group;Lentisphaerae;Lentisphaeria;Lentisphaerales 0.7 0.2 0.1

所以我想提取第一个和第二个(仅当第二个完成ales_incertae_sedis)与每行中匹配“ales”的单词时,打印OUT将如下:

Bacillales
Bacillales;Bacillales_incertae_sedis 
Bacillales;Bacillales_incertae_sedis
Nitrospirales
Nitrospirales
Lentisphaerales

但不是第三个:

Bacillales;Bacillales_incertae_sedis;Bacillales_Family

我试过了:

use strict;
use warnings;
use Getopt::Long;

GetOptions (
    'i=s'       =>\$infile,
);


open INFILE, '<', "$infile", or die "cant open file $infile";    
open OUTFILE, '>', "$results.txt" or die "cant open"; 

while ( <INFILE>) {
    my $line = $_;
    chomp($line);
    if ($line=~ m/^#/g) {
        next;
    }
    elsif ($line=~ m/^$/g){
        next;
    }

    elsif($line){
        my @taxonomic=$_;
        foreach (@taxonomic){
            ($taxon, $val1, $val2, $val3) = split(/\t/,$_);
        }
    #here is the problem 
        my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);
        foreach (@orden){
           if ($_=~m/^$/g){
               next;
           }
           elsif ($_=~ m/^unclassified/g){
               next;
           }
           else {
               print OUTFILE "$_\n";
           }
       }
   }
}
close INFILE;            
close OUTFILE;
exit;

我的问题是这一行:

my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);

我试图选择倍数选项

my (@orden) = ($taxon=~ m/(\w*ales)[\;]?(;\w*ales_incertae_sedis)/g);
my (@orden) = ($taxon=~ m/(\w*ales[;\w*ales_incertae_sedis]?)[\;]?/g);

但它不起作用。

非常感谢

1 个答案:

答案 0 :(得分:0)

试试这个

use warnings;
use strict;
my $m;

while ( <INFILE>>) 
{
    if($_=~/(?:([a-z]+ales;[^;]+;).+?family|(\w+ales;))/i )
    {
            $m = $1 || $2;
            print "$m\n" if($m!~/^unc/)

    }   
}

在上面,我使用了非捕获组(?:)

有关非捕获组see this answer的更多信息