我有一个文字,我需要:
文字如下(also on pastebin):
AceView: gene:1700049G17Rik, a comprehensive annotation of human, mouse and worm genes with mRNAs or ESTsAceView.
<META NAME="title"
CONTENT="
AceView: gene:1700049G17Rik a comprehensive annotation of human, mouse and worm genes with mRNAs or EST">
<META NAME="keywords"
CONTENT="
AceView, genes, Acembly, AceDB, Homo sapiens, Human,
nematode, Worm, Caenorhabditis elegans , WormGenes, WormBase, mouse,
mammal, Arabidopsis, gene, alternative splicing variant, structure,
sequence, DNA, EST, mRNA, cDNA clone, transcript, transcription, genome,
transcriptome, proteome, peptide, GenBank accession, dbest, RefSeq,
LocusLink, non-coding, coding, exon, intron, boundary, exon-intron
junction, donor, acceptor, 3'UTR, 5'UTR, uORF, poly A, poly-A site,
molecular function, protein annotation, isoform, gene family, Pfam,
motif ,Blast, Psort, GO, taxonomy, homolog, cellular compartment,
disease, illness, phenotype, RNA interference, RNAi, knock out mutant
expression, regulation, protein interaction, genetic, map, antisense,
trans-splicing, operon, chromosome, domain, selenocysteine, Start, Met,
Stop, U12, RNA editing, bibliography">
<META NAME="Description"
CONTENT= "
AceView offers a comprehensive annotation of human, mouse and nematode genes
reconstructed by co-alignment and clustering of all publicly available
mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable
up-to-date resource on the genes, their functions, alternative variants,
expression, regulation and interactions, in the hope to stimulate
further validating experiments at the bench
">
<meta name="author"
content="Danielle Thierry-Mieg and Jean Thierry-Mieg,
NCBI/NLM/NIH, mieg@ncbi.nlm.nih.gov">
<!--
var myurl="av.cgi?db=mouse" ;
var db="mouse" ;
var doSwf="s" ;
var classe="gene" ;
//-->
但是我坚持使用以下脚本逻辑。什么是实现这一目标的正确方法?
#!/usr/bin/perl -w
my $INFILE_file_name = $file; # input file name
open ( INFILE, '<', $INFILE_file_name )
or croak "$0 : failed to open input file $INFILE_file_name : $!\n";
my @allsum;
while ( <INFILE> ) {
chomp;
my $line = $_;
my @temp1 = ();
if ( $line =~ /^ AceView summary/ ) {
print "$line\n";
push @temp1, $line;
}
elsif( $line =~ /Please quote/) {
push @allsum, [@temp1];
@temp1 = ();
}
elsif ($line =~ /The closest human gene/) {
push @allsum, $line;
}
}
close ( INFILE ); # close input file
# Do something with @allsum
我需要处理许多文件。
答案 0 :(得分:5)
您可以在标量上下文中使用范围运算符来提取整个段落:
while (<INFILE>) {
chomp;
if (/AceView summary/ .. /Please quote/) {
print "$_\n";
}
print "$_\n" if /^The closest human gene/;
}
答案 1 :(得分:4)
如果我理解正确的话,你会从http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=mouse&c=gene&a=fiche&l=1700049G17Rik那里得到这些信息,这些信息会让我看到一个最可怕的HTML大杂烩(可能与垃圾医疗保险计划发现者的呕吐物并列第一)。
但是,它仍然不匹配HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new('ace.html');
my ($summary, $closest_human);
while ( my $tag = $parser->get_tag('span') ) {
next unless $tag->get_attr('class') eq 'hh3';
next unless $parser->get_text('/span') eq 'AceView summary';
$summary = $parser->get_text('span');
$summary =~ s/^\s+//;
$summary =~ s/\s*Please quote:.*\z//;
last;
}
while ( my $tag = $parser->get_tag('b') ) {
$closest_human = $parser->get_text('/b');
next unless $closest_human eq 'The closest human genes';
$closest_human .= $parser->get_text('br');
last;
}
print "=== Summary ===\n\n$summary\n\n";
print "=== Closest Human Gene ==\n\n$closest_human\n"
输出(剪切):
=== Summary === Note that this locus is complex: it appears to produce several proteins with no sequence overlap. Expression: According to AceView, this gene is well expressed, ... Please see the Jackson Laboratory Mouse Genome Database/Informatics site MGI_192 0680 for in depth functional annotation of this gene. === Closest Human Gene == The closest human genes, according to BlastP, are the AceView genes ZNF780AandZN F780B (e=10^-15,), ZNF766 (e=2 10^-15,), ZNF607andZNF781andZFP30 (e=2 10^-14,).
答案 2 :(得分:1)
OTTOMH我用一个简单的状态机来完成这个提取部分。从state = 0开始,在/AceView summary/
时将其设置为1,在/Please quote/
上将其设置为零。然后,如果$ state == 1,则将$_
推送到输出数组。
但我更喜欢尤金的回答。这是Perl,有很多方法可以让你的谚语猫皮肤......