Question

该项目需要grep在线数据并生成它的xml文件。这是输出的方式：

<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd"> 
<MetaIssue volume="306" issue="1"> 
  <Provider>Cadmus</Provider> 
  <IssueDate>January 1, 2014</IssueDate>  
  <PageRange>C1-C76</PageRange> 
  <TOC> 
    <TocSection> 
      <Heading>Editorial Focus</Heading> 
      <DOI>10.1152/ajpcell.00342.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Review</Heading> 
      <DOI>10.1152/ajpcell.00281.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading> 
      <DOI>10.1152/ajpcell.00156.2013</DOI> 
      <DOI>10.1152/ajpcell.00066.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Articles</Heading> 
      <DOI>10.1152/ajpcell.00130.2013</DOI> 
      <DOI>10.1152/ajpcell.00047.2013</DOI> 
      <DOI>10.1152/ajpcell.00070.2013</DOI> 
      <DOI>10.1152/ajpcell.00096.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Corrigendum</Heading> 
      <DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI> 
    </TocSection> 
  </TOC> 
</MetaIssue>

我得到的输出是：

<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd"> 
<MetaIssue volume="306" issue="1"> 
  <Provider>Cadmus</Provider> 
  <IssueDate>January 1, 2014 </IssueDate> 
  <PageRange>C1-</PageRange> 
  <TOC> 
    <TocSection> 
      <Heading>Review</Heading> 
      <DOI>10.1152/ajpcell.00281.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading> 
      <DOI>10.1152/ajpcell.00156.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Articles</Heading> 
      <DOI>10.1152/ajpcell.00130.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Corrigendum</Heading> 
      <DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI> 
    </TocSection> 
  </TOC> 
</MetaIssue>

我尝试的代码是：

#!/usr/bin/perl  
use strict;
use warnings;

use LWP::Simple;

my $path1 = $ARGV[0];
open(F6, ">meta_issue.xml");

print "Enter the URL:";
my $url = <STDIN>;
chomp $url;

print "Enter the Volume Number:";
my $vol = <STDIN>;
chomp $vol;

print "Enter the Issue Number:";
my $iss = <STDIN>;
chomp $iss;

my $website_content = get($url);

print F6 "\<\!DOCTYPE MetaIssue SYSTEM \"http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd\">\n";
print F6 "<MetaIssue volume=\"$vol\" issue=\"$iss\">\n";
print F6 "<Provider>Cadmus</Provider>\n";

if ($website_content =~ m#<span class="highwire-cite-metadata-date">(.*?)</span>#s) {
    #<span class="highwire-cite-metadata-date">January 1, 2014 </span>

  print F6 "<IssueDate>$1</IssueDate>\n";    #<IssueDate>January 1,         2014</IssueDate>
}

if ($website_content =~ m#(<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)#gs) {
    #.*?(?!<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)$#gs)  #<PageRange>C1-C76</PageRange>

  my $first = $2;
  print F6 "<PageRange>$2-</PageRange>\n";
}

print F6 "<TOC>\n";

while ($website_content =~ m#<h2 id=".*?" class=".*?">(.*?)</h2>#gs) {
  my $h = $1;
  print F6 "<TocSection>\n";
  print F6 "<Heading>$h</Heading>\n";

  if ( $website_content =~ m#(.*?<p><span class="label">DOI:</span>\s?(.*?)\n?</p>\s?</span>\s?\n?</div>.*?)#gs ) {
    my $doi  = $1;
    my $doi1 = $2;
    print F6 "<DOI>$doi1</DOI>\n";
    print F6 "</TocSection>\n";
  }
}

print F6 "</TOC>\n</MetaIssue>\n";

注意：每个<Heading>可能有一个或多个<DOI>值，我无法检索

我无法在<DOI>下放置特定的<Heading>值。
我无法从
中检索最后一次出现的数字
```
:\s?(.*?)(-(.*?))?
```
因为存在诸如 c14或 c12-c14之类的变体。所以从这里我需要grep最后一个数字，c14

我在cmd中执行代码如下;

    D:\Code>Perl File name (Enter) 
    Enter the URl: http://ajpcell.physiology.org/content/306/1 
    Enter the Volume Number: 306 
    Enter the Issue Number: 1

更新

在网址

1）http://ajpendo.physiology.org/content/283/5

2）http://ajpendo.physiology.org/content/280/1

DOI不存在，所以在这种情况下，输出代替

      <DOI>$_</DOI> tag

应该是

      <ResId type=”publisher-id”>$volume/$issue/$first_page</ResId>

其中$ first_page特定于该特定部分。

我添加了＆＃34;否则{}循环＆＃34; in＆＃34; sub retrieve_doi（）＆＃34;以及＆＃34; for {}循环＆＃34;在下面，但没有得到所需的输出。

    #!/usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    use HTML::Parser;
    use WWW::Mechanize;

    my ($date, $first_page, $last_page, @toc);
    sub get_date {
      my ($self, $tag, $attr) = @_;
       if ('span' eq $tag
         and $attr->{class}
         and 'highwire-cite-metadata-date' eq $attr->{class}
         and not defined $date
          ) {
    $self->handler(text => \&next_text_to_date, 'self, text');

             } elsif ('span' eq $tag
                  and $attr->{class}
         and 'highwire-cite-metadata-pages' eq $attr->{class}
        ) {
    if (not defined $first_page) {
        $self->handler(text => \&parse_first_page, 'self, text');
    } else {
        $self->handler(text => \&parse_last_page, 'self, text');
    }

} elsif ('span' eq $tag
         and $attr->{class}
         and 'highwire-cite-metadata-doi' eq $attr->{class}
        ) {
    $self->handler(text => \&retrieve_doi, 'self, text');

} elsif ('div' eq $tag
         and $attr->{class}
         and $attr->{class} =~ /\bissue-toc-section\b/
        ) {
    $self->handler(text => \&next_text_to_toc, 'self, text');
}
}


sub next_text_to_date {
my ($self, $text) = @_;
$text =~ s/^\s+|\s+$//g;
$date = $text;
$self->handler(text => undef);
}


sub parse_first_page {
my ($self, $text) = @_;
if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
    $first_page = $1;
    $self->handler(text => undef);
}
}


sub parse_last_page {
my ($self, $text) = @_;
if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
    $last_page = $1;
    $self->handler(text => undef);
  }
 }

sub next_text_to_toc {
my ($self, $text) = @_;
push @toc, [$text];
$self->handler(text => undef);
}

sub retrieve_doi {
my ($self, $text) = @_;
if ('DOI:' ne $text) 
{
    $text =~ s/^\s+|\s+$//g;
    push @{ $toc[-1] }, $text;
    $self->handler(text => undef);
}
else        #UPDATE
{
    $text =~ s/^\s+|\s+$//g;
    push @{ $toc[-1] }, $text;
    $self->handler(text => undef);
 }
 }

  print STDERR 'Enter the URL: ';
  chomp(my $url = <>);
  my ($volume, $issue) = (split m(/), $url)[-2, -1];

  my $p = 'HTML::Parser'->new( api_version => 3,
                         start_h => [ \&get_date, 'self, tagname, attr'    ],
                       );

  my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
  $mech->get($url);
  my $contents = $mech->content;
  $p->parse($contents);
  $p->eof;

  my $toc;

for my $section (@toc) {
$toc .= "<TocSection>\n";
$toc .= "<Heading>".shift(@$section)."</Heading>\n";
$toc .= join q(), map "<DOI>$_</DOI>\n", @$section;
$toc .= join q(), map "<ResId type=”publisher-id”>$volume/$issue/$first_page</ResId>\n", @$section; #UPDATE
$toc .= "</TocSection>\n";
}

     open (F6, ">meta_issue_$issue.xml");

     print F6 <<"__HTML__";
     <!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
     <MetaIssue volume="$volume" issue="$issue">
     <Provider>Cadmus</Provider>
     <IssueDate>$date</IssueDate>
     <PageRange>$first_page-$last_page</PageRange>
    <TOC>
    $toc</TOC>
   </MetaIssue>
   __HTML__

请告诉我如何更新代码以获得所需的输出。

Answer 1

使用适当的模块解析HTML：

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use HTML::Parser;
use WWW::Mechanize;

my ($date, $first_page, $last_page, @toc);
sub get_info {
    my ($self, $tag, $attr) = @_;
    if ('span' eq $tag
        and $attr->{class}
        and 'highwire-cite-metadata-date' eq $attr->{class}
        and not defined $date
       ) {
        $self->handler(text => \&next_text_to_date, 'self, text');

    } elsif ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-pages' eq $attr->{class}
            ) {
        if (not defined $first_page) {
            $self->handler(text => \&parse_first_page, 'self, text');
        } else {
            $self->handler(text => \&parse_last_page, 'self, text');
        }

    } elsif ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-doi' eq $attr->{class}
            ) {
        $self->handler(text => \&retrieve_doi, 'self, text');

    } elsif ('div' eq $tag
             and $attr->{class}
             and $attr->{class} =~ /\bissue-toc-section\b/
            ) {
        $self->handler(text => \&next_text_to_toc, 'self, text');
    }
}


sub next_text_to_date {
    my ($self, $text) = @_;
    $text =~ s/^\s+|\s+$//g;
    $date = $text;
    $self->handler(text => undef);
}


sub parse_first_page {
    my ($self, $text) = @_;
    if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
        $first_page = $1;
        $self->handler(text => undef);
    }
}


sub parse_last_page {
    my ($self, $text) = @_;
    if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
        $last_page = $1;
        $self->handler(text => undef);
    }
}


sub next_text_to_toc {
    my ($self, $text) = @_;
    push @toc, [$text];
    $self->handler(text => undef);
}


sub retrieve_doi {
    my ($self, $text) = @_;
    if ('DOI:' ne $text) {
        $text =~ s/^\s+|\s+$//g;
        push @{ $toc[-1] }, $text;
        $self->handler(text => undef);
    }
}


print STDERR 'Enter the URL: ';
chomp(my $url = <>);
my ($volume, $issue) = (split m(/), $url)[-2, -1];

my $p = 'HTML::Parser'->new( api_version => 3,
                             start_h => [ \&get_info, 'self, tagname, attr' ],
                           );

my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
$mech->get($url);
my $contents = $mech->content;
$p->parse($contents);
$p->eof;

my $toc;
for my $section (@toc) {
    $toc .= "    <TocSection>\n";
    $toc .= "      <Heading>" . shift(@$section) . "</Heading>\n";
    $toc .= join q(), map "      <DOI>$_</DOI>\n", @$section;
    $toc .= "    </TocSection>\n";
}

print << "__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
  <Provider>Cadmus</Provider>
  <IssueDate>$date</IssueDate>
  <PageRange>$first_page-$last_page</PageRange>
  <TOC>
$toc  </TOC>
</MetaIssue>
__HTML__

基本解释：

HTML::Parser是基于回调的，这意味着您可以在遇到已解析文档中的给定事件时为其运行子例程。我使用一般回调get_info，它在HTML中搜索所需信息的各种指标。因为我们经常对“给定跨度之后的最近文本”之类的东西感兴趣，所以它只是注册文本的新回调。例如，当找到具有类highwire-cite-metadata-date的span并且尚未定义date时，它会注册一个新的文本处理程序，它将运行next_text_to_date。处理程序只是将文本分配给$date变量并删除处理程序。我不确定这是“正确”的方法，但至少在这种情况下，它有效。

我使用WWW::Mechanize以便能够指定用户代理。使用更简单的LWP::Simple的默认值，我没有得到整个HTML。

模板的输出气味。切换到Template可能是一个很好的进步。

检索在线数据并生成它的xml输出

更新

1 个答案:

基本解释：