该项目需要grep在线数据并生成它的xml文件。这是输出的方式:
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="306" issue="1">
<Provider>Cadmus</Provider>
<IssueDate>January 1, 2014</IssueDate>
<PageRange>C1-C76</PageRange>
<TOC>
<TocSection>
<Heading>Editorial Focus</Heading>
<DOI>10.1152/ajpcell.00342.2013</DOI>
</TocSection>
<TocSection>
<Heading>Review</Heading>
<DOI>10.1152/ajpcell.00281.2013</DOI>
</TocSection>
<TocSection>
<Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading>
<DOI>10.1152/ajpcell.00156.2013</DOI>
<DOI>10.1152/ajpcell.00066.2013</DOI>
</TocSection>
<TocSection>
<Heading>Articles</Heading>
<DOI>10.1152/ajpcell.00130.2013</DOI>
<DOI>10.1152/ajpcell.00047.2013</DOI>
<DOI>10.1152/ajpcell.00070.2013</DOI>
<DOI>10.1152/ajpcell.00096.2013</DOI>
</TocSection>
<TocSection>
<Heading>Corrigendum</Heading>
<DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI>
</TocSection>
</TOC>
</MetaIssue>
我得到的输出是:
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="306" issue="1">
<Provider>Cadmus</Provider>
<IssueDate>January 1, 2014 </IssueDate>
<PageRange>C1-</PageRange>
<TOC>
<TocSection>
<Heading>Review</Heading>
<DOI>10.1152/ajpcell.00281.2013</DOI>
</TocSection>
<TocSection>
<Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading>
<DOI>10.1152/ajpcell.00156.2013</DOI>
</TocSection>
<TocSection>
<Heading>Articles</Heading>
<DOI>10.1152/ajpcell.00130.2013</DOI>
</TocSection>
<TocSection>
<Heading>Corrigendum</Heading>
<DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI>
</TocSection>
</TOC>
</MetaIssue>
我尝试的代码是:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $path1 = $ARGV[0];
open(F6, ">meta_issue.xml");
print "Enter the URL:";
my $url = <STDIN>;
chomp $url;
print "Enter the Volume Number:";
my $vol = <STDIN>;
chomp $vol;
print "Enter the Issue Number:";
my $iss = <STDIN>;
chomp $iss;
my $website_content = get($url);
print F6 "\<\!DOCTYPE MetaIssue SYSTEM \"http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd\">\n";
print F6 "<MetaIssue volume=\"$vol\" issue=\"$iss\">\n";
print F6 "<Provider>Cadmus</Provider>\n";
if ($website_content =~ m#<span class="highwire-cite-metadata-date">(.*?)</span>#s) {
#<span class="highwire-cite-metadata-date">January 1, 2014 </span>
print F6 "<IssueDate>$1</IssueDate>\n"; #<IssueDate>January 1, 2014</IssueDate>
}
if ($website_content =~ m#(<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)#gs) {
#.*?(?!<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)$#gs) #<PageRange>C1-C76</PageRange>
my $first = $2;
print F6 "<PageRange>$2-</PageRange>\n";
}
print F6 "<TOC>\n";
while ($website_content =~ m#<h2 id=".*?" class=".*?">(.*?)</h2>#gs) {
my $h = $1;
print F6 "<TocSection>\n";
print F6 "<Heading>$h</Heading>\n";
if ( $website_content =~ m#(.*?<p><span class="label">DOI:</span>\s?(.*?)\n?</p>\s?</span>\s?\n?</div>.*?)#gs ) {
my $doi = $1;
my $doi1 = $2;
print F6 "<DOI>$doi1</DOI>\n";
print F6 "</TocSection>\n";
}
}
print F6 "</TOC>\n</MetaIssue>\n";
注意:每个<Heading>
可能有一个或多个<DOI>
值,我无法检索
我无法在<DOI>
下放置特定的<Heading>
值。
我无法从
中检索最后一次出现的数字<span class="label">:</span>\s?(.*?)(-(.*?))?</span>
因为存在诸如</span> c14</span>
或<span> c12-c14</span>
之类的变体。所以从这里我需要grep最后一个数字,c14
我在cmd中执行代码如下;
D:\Code>Perl File name (Enter)
Enter the URl: http://ajpcell.physiology.org/content/306/1
Enter the Volume Number: 306
Enter the Issue Number: 1
在网址
中1)http://ajpendo.physiology.org/content/283/5
2)http://ajpendo.physiology.org/content/280/1
DOI不存在,所以在这种情况下,输出代替
<DOI>$_</DOI> tag
应该是
<ResId type=”publisher-id”>$volume/$issue/$first_page</ResId>
其中$ first_page特定于该特定部分。
我添加了&#34;否则{}循环&#34; in&#34; sub retrieve_doi()&#34;以及&#34; for {}循环&#34;在下面,但没有得到所需的输出。
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use HTML::Parser;
use WWW::Mechanize;
my ($date, $first_page, $last_page, @toc);
sub get_date {
my ($self, $tag, $attr) = @_;
if ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-date' eq $attr->{class}
and not defined $date
) {
$self->handler(text => \&next_text_to_date, 'self, text');
} elsif ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-pages' eq $attr->{class}
) {
if (not defined $first_page) {
$self->handler(text => \&parse_first_page, 'self, text');
} else {
$self->handler(text => \&parse_last_page, 'self, text');
}
} elsif ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-doi' eq $attr->{class}
) {
$self->handler(text => \&retrieve_doi, 'self, text');
} elsif ('div' eq $tag
and $attr->{class}
and $attr->{class} =~ /\bissue-toc-section\b/
) {
$self->handler(text => \&next_text_to_toc, 'self, text');
}
}
sub next_text_to_date {
my ($self, $text) = @_;
$text =~ s/^\s+|\s+$//g;
$date = $text;
$self->handler(text => undef);
}
sub parse_first_page {
my ($self, $text) = @_;
if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
$first_page = $1;
$self->handler(text => undef);
}
}
sub parse_last_page {
my ($self, $text) = @_;
if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
$last_page = $1;
$self->handler(text => undef);
}
}
sub next_text_to_toc {
my ($self, $text) = @_;
push @toc, [$text];
$self->handler(text => undef);
}
sub retrieve_doi {
my ($self, $text) = @_;
if ('DOI:' ne $text)
{
$text =~ s/^\s+|\s+$//g;
push @{ $toc[-1] }, $text;
$self->handler(text => undef);
}
else #UPDATE
{
$text =~ s/^\s+|\s+$//g;
push @{ $toc[-1] }, $text;
$self->handler(text => undef);
}
}
print STDERR 'Enter the URL: ';
chomp(my $url = <>);
my ($volume, $issue) = (split m(/), $url)[-2, -1];
my $p = 'HTML::Parser'->new( api_version => 3,
start_h => [ \&get_date, 'self, tagname, attr' ],
);
my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
$mech->get($url);
my $contents = $mech->content;
$p->parse($contents);
$p->eof;
my $toc;
for my $section (@toc) {
$toc .= "<TocSection>\n";
$toc .= "<Heading>".shift(@$section)."</Heading>\n";
$toc .= join q(), map "<DOI>$_</DOI>\n", @$section;
$toc .= join q(), map "<ResId type=”publisher-id”>$volume/$issue/$first_page</ResId>\n", @$section; #UPDATE
$toc .= "</TocSection>\n";
}
open (F6, ">meta_issue_$issue.xml");
print F6 <<"__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
<Provider>Cadmus</Provider>
<IssueDate>$date</IssueDate>
<PageRange>$first_page-$last_page</PageRange>
<TOC>
$toc</TOC>
</MetaIssue>
__HTML__
请告诉我如何更新代码以获得所需的输出。
答案 0 :(得分:0)
使用适当的模块解析HTML:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use HTML::Parser;
use WWW::Mechanize;
my ($date, $first_page, $last_page, @toc);
sub get_info {
my ($self, $tag, $attr) = @_;
if ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-date' eq $attr->{class}
and not defined $date
) {
$self->handler(text => \&next_text_to_date, 'self, text');
} elsif ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-pages' eq $attr->{class}
) {
if (not defined $first_page) {
$self->handler(text => \&parse_first_page, 'self, text');
} else {
$self->handler(text => \&parse_last_page, 'self, text');
}
} elsif ('span' eq $tag
and $attr->{class}
and 'highwire-cite-metadata-doi' eq $attr->{class}
) {
$self->handler(text => \&retrieve_doi, 'self, text');
} elsif ('div' eq $tag
and $attr->{class}
and $attr->{class} =~ /\bissue-toc-section\b/
) {
$self->handler(text => \&next_text_to_toc, 'self, text');
}
}
sub next_text_to_date {
my ($self, $text) = @_;
$text =~ s/^\s+|\s+$//g;
$date = $text;
$self->handler(text => undef);
}
sub parse_first_page {
my ($self, $text) = @_;
if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
$first_page = $1;
$self->handler(text => undef);
}
}
sub parse_last_page {
my ($self, $text) = @_;
if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
$last_page = $1;
$self->handler(text => undef);
}
}
sub next_text_to_toc {
my ($self, $text) = @_;
push @toc, [$text];
$self->handler(text => undef);
}
sub retrieve_doi {
my ($self, $text) = @_;
if ('DOI:' ne $text) {
$text =~ s/^\s+|\s+$//g;
push @{ $toc[-1] }, $text;
$self->handler(text => undef);
}
}
print STDERR 'Enter the URL: ';
chomp(my $url = <>);
my ($volume, $issue) = (split m(/), $url)[-2, -1];
my $p = 'HTML::Parser'->new( api_version => 3,
start_h => [ \&get_info, 'self, tagname, attr' ],
);
my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
$mech->get($url);
my $contents = $mech->content;
$p->parse($contents);
$p->eof;
my $toc;
for my $section (@toc) {
$toc .= " <TocSection>\n";
$toc .= " <Heading>" . shift(@$section) . "</Heading>\n";
$toc .= join q(), map " <DOI>$_</DOI>\n", @$section;
$toc .= " </TocSection>\n";
}
print << "__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
<Provider>Cadmus</Provider>
<IssueDate>$date</IssueDate>
<PageRange>$first_page-$last_page</PageRange>
<TOC>
$toc </TOC>
</MetaIssue>
__HTML__
HTML::Parser是基于回调的,这意味着您可以在遇到已解析文档中的给定事件时为其运行子例程。我使用一般回调get_info
,它在HTML中搜索所需信息的各种指标。因为我们经常对“给定跨度之后的最近文本”之类的东西感兴趣,所以它只是注册文本的新回调。例如,当找到具有类highwire-cite-metadata-date
的span并且尚未定义date时,它会注册一个新的文本处理程序,它将运行next_text_to_date
。处理程序只是将文本分配给$date
变量并删除处理程序。我不确定这是“正确”的方法,但至少在这种情况下,它有效。
我使用WWW::Mechanize以便能够指定用户代理。使用更简单的LWP::Simple的默认值,我没有得到整个HTML。
模板的输出气味。切换到Template可能是一个很好的进步。