请在那里考虑这个正则表达式:
gene_id\t"(\w+.\d+)"|transcript_id\t"(\w+.\d+)"|gene_name\t"(\w+.\d+)"|transcript_name\t("\S+)
并且紧接着考虑这个文本:
chr1 HAVANA exon 183647567 183647797 . - . gene_id "ENSG00000173627.7" transcript_id "ENST00000481562.1" gene_type "protein_coding" gene_status "KNOWN" gene_name "APOBEC4" transcript_type "processed_transcript" transcript_status "KNOWN" transcript_name "APOBEC4-002" exon_number 2 exon_id "ENSE00001907807.1" level 2 transcript_support_level "3" havana_gene "OTTHUMG00000035459.2" havana_transcript "OTTHUMT00000086127.1"
chr1 HAVANA gene 183646404 183653316 . - . gene_id "ENSG00000173627.7" gene_type "protein_coding" gene_status "KNOWN" gene_name "APOBEC4" level 2 havana_gene “OTTHUMG00000035459.2”
chr12 HAVANA gene 28133249 28581511 . + . gene_id "ENSG00000123106.9" gene_type "protein_coding" gene_status "KNOWN" gene_name "CCDC91" level 2 tag "ncRNA_host" havana_gene "OTTHUMG00000169141.2"
在Perl上分组时,我可以获得1美元而不是2美元和3美元。 有什么想法吗?
答案 0 :(得分:3)
您正在使用|
运算符,因此每个选项中只捕获了1个组。那么为什么要捕获$2
和$3
?
答案 1 :(得分:3)
我可能会以不同的方式解决这个问题。我可以建议这样的事情:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
#field names
my @desired_fields = qw ( gene_id transcript_id gene_name transcript_name );
while (<DATA>) {
#match 'word' and 'quoted word' and select into a hash.
my %key_values = m/(\w+)\s+\"([^\"]+)\"/g;
#print what we captured for debugging reasons:
print Dumper \%key_values;
#print line number
print "Line: $.\n";
#iterate @desired fields, print a line if it's defined.
for (@desired_fields) {
print "$_ => $key_values{$_}\n" if defined $key_values{$_};
}
}
__DATA__
chr1 HAVANA exon 183647567 183647797 . - . gene_id "ENSG00000173627.7" transcript_id "ENST00000481562.1" gene_type "protein_coding" gene_status "KNOWN" gene_name "APOBEC4" transcript_type "processed_transcript" transcript_status "KNOWN" transcript_name "APOBEC4-002" exon_number 2 exon_id "ENSE00001907807.1" level 2 transcript_support_level "3" havana_gene "OTTHUMG00000035459.2" havana_transcript "OTTHUMT00000086127.1"
chr1 HAVANA gene 183646404 183653316 . - . gene_id "ENSG00000173627.7" gene_type "protein_coding" gene_status "KNOWN" gene_name "APOBEC4" level 2 havana_gene “OTTHUMG00000035459.2”
chr12 HAVANA gene 28133249 28581511 . + . gene_id "ENSG00000123106.9" gene_type "protein_coding" gene_status "KNOWN" gene_name "CCDC91" level 2 tag "ncRNA_host" havana_gene "OTTHUMG00000169141.2"
答案 2 :(得分:0)
如果要在一个匹配中拾取所有组,则必须将它们包装在非捕获组中并添加量词。这意味着您还必须考虑您不关心的字段以及插入的空白。这个正则表达式适用于您的样本:
(?:
\h+
(?:
gene_id\h+"([^"]+)" |
transcript_id\h+"([^"]+)" |
gene_name\h+"([^"]+)" |
transcript_name\h+"([^"]+)" |
\w+\h+\S+
)
)+
$
请注意,即使您感兴趣的字段都不存在,这也会匹配。如果gene_id
字段是必需的,并且始终是第一个,就像在样本中一样,您可以使正则表达式更精确,更高效:
gene_id\h+"([^"]+)"
(?:
\h+
(?:
transcript_id\h+"([^"]+)" |
gene_name\h+"([^"]+)" |
transcript_name\h+"([^"]+)" |
\w+\h+\S+
)
)+
$