这是我输入的Genbank文件的一部分:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
我想提取期刊名称,例如J. Gen. Virol。 (不包括发行号和页面)
这是我的代码,它没有给出任何结果,所以我想知道出了什么问题。我确实使用了1美元,2美元等的括号......虽然它有效但我的导师告诉我不使用该方法尝试,而是使用substr。
foreach my $line (@lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
答案 0 :(得分:4)
另一种方法是利用BioPerl来解析GenBank文件:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
如果您运行此脚本并将AC_000005.1保存在名为AC_000005.1.gb的文件中,您将获得:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003) J. Virol. 68 (1), 379-389 (1994) J. Virol. 67 (2), 682-693 (1993) J. Virol. 63 (8), 3535-3540 (1989) Nucleic Acids Res. 9 (23), 6571-6589 (1981) Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
答案 1 :(得分:1)
使用单个正则表达式来捕获整个substr
行并使用括号来捕获表示日记信息的文本,而不是匹配和使用JOURNAL
,这样更容易:
foreach my $line (@lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
正则表达式查找JOURNAL
后跟一个或多个空格字符,(.+
)捕获该行中的其余字符。
要在不使用$1
的情况下获取文字,我认为您正在尝试执行以下操作:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
如果您已经知道左侧列中有多少个字符,您可以执行substr($line, 12)
(或其他)从字符12开始检索$line
的子字符串:
foreach my $line (@lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
您可以结合使用这两种技术来消除期刊数据中的问题编号和日期:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
我认为从Pubmed API获取数据会更容易! ;)