棘手的正则表达式分组

时间:2016-10-06 12:14:40

标签: regex perl

请在那里考虑这个正则表达式:

gene_id\t"(\w+.\d+)"|transcript_id\t"(\w+.\d+)"|gene_name\t"(\w+.\d+)"|transcript_name\t("\S+)

并且紧接着考虑这个文本:

chr1    HAVANA  exon    183647567       183647797       .       -       .       gene_id "ENSG00000173627.7"     transcript_id   "ENST00000481562.1"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "APOBEC4"       transcript_type "processed_transcript"  transcript_status       "KNOWN" transcript_name "APOBEC4-002"   exon_number     2       exon_id "ENSE00001907807.1"     level   2       transcript_support_level        "3"     havana_gene     "OTTHUMG00000035459.2"  havana_transcript       "OTTHUMT00000086127.1"
chr1    HAVANA  gene    183646404       183653316       .       -       .       gene_id "ENSG00000173627.7"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "APOBEC4"               level   2       havana_gene     “OTTHUMG00000035459.2”
chr12   HAVANA  gene    28133249        28581511        .       +       .       gene_id "ENSG00000123106.9"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "CCDC91"                level   2       tag     "ncRNA_host"    havana_gene     "OTTHUMG00000169141.2"

在Perl上分组时,我可以获得1美元而不是2美元和3美元。 有什么想法吗?

3 个答案:

答案 0 :(得分:3)

您正在使用|运算符,因此每个选项中只捕获了1个组。那么为什么要捕获$2$3

答案 1 :(得分:3)

我可能会以不同的方式解决这个问题。我可以建议这样的事情:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

#field names
my @desired_fields = qw ( gene_id transcript_id gene_name transcript_name );

while (<DATA>) {
   #match 'word' and 'quoted word' and select into a hash. 
   my %key_values = m/(\w+)\s+\"([^\"]+)\"/g;
   #print what we captured for debugging reasons:
   print Dumper \%key_values;

   #print line number
   print "Line: $.\n";
   #iterate @desired fields, print a line if it's defined. 
   for (@desired_fields) {
      print "$_ => $key_values{$_}\n" if defined $key_values{$_};
   }
}


__DATA__
chr1    HAVANA  exon    183647567       183647797       .       -       .       gene_id "ENSG00000173627.7"     transcript_id   "ENST00000481562.1"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "APOBEC4"       transcript_type "processed_transcript"  transcript_status       "KNOWN" transcript_name "APOBEC4-002"   exon_number     2       exon_id "ENSE00001907807.1"     level   2       transcript_support_level        "3"     havana_gene     "OTTHUMG00000035459.2"  havana_transcript       "OTTHUMT00000086127.1"
chr1    HAVANA  gene    183646404       183653316       .       -       .       gene_id "ENSG00000173627.7"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "APOBEC4"               level   2       havana_gene     “OTTHUMG00000035459.2”
chr12   HAVANA  gene    28133249        28581511        .       +       .       gene_id "ENSG00000123106.9"     gene_type       "protein_coding"        gene_status     "KNOWN" gene_name       "CCDC91"                level   2       tag     "ncRNA_host"    havana_gene     "OTTHUMG00000169141.2"

答案 2 :(得分:0)

如果要在一个匹配中拾取所有组,则必须将它们包装在非捕获组中并添加量词。这意味着您还必须考虑您不关心的字段以及插入的空白。这个正则表达式适用于您的样本:

(?:
  \h+
  (?:
    gene_id\h+"([^"]+)"         |
    transcript_id\h+"([^"]+)"   |
    gene_name\h+"([^"]+)"       |
    transcript_name\h+"([^"]+)" |
    \w+\h+\S+
  )
)+
$

DEMO

请注意,即使您感兴趣的字段都不存在,这也会匹配。如果gene_id字段是必需的,并且始终是第一个,就像在样本中一样,您可以使正则表达式更精确,更高效:

gene_id\h+"([^"]+)"
(?:
  \h+
  (?:
    transcript_id\h+"([^"]+)"   |
    gene_name\h+"([^"]+)"       |
    transcript_name\h+"([^"]+)" |
    \w+\h+\S+
  )
)+
$

DEMO