检索在线数据并生成它的xml输出

时间:2015-04-06 18:04:36

标签: windows perl

该项目需要grep在线数据并生成它的xml文件。这是输出的方式:

<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd"> 
<MetaIssue volume="306" issue="1"> 
  <Provider>Cadmus</Provider> 
  <IssueDate>January 1, 2014</IssueDate>  
  <PageRange>C1-C76</PageRange> 
  <TOC> 
    <TocSection> 
      <Heading>Editorial Focus</Heading> 
      <DOI>10.1152/ajpcell.00342.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Review</Heading> 
      <DOI>10.1152/ajpcell.00281.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading> 
      <DOI>10.1152/ajpcell.00156.2013</DOI> 
      <DOI>10.1152/ajpcell.00066.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Articles</Heading> 
      <DOI>10.1152/ajpcell.00130.2013</DOI> 
      <DOI>10.1152/ajpcell.00047.2013</DOI> 
      <DOI>10.1152/ajpcell.00070.2013</DOI> 
      <DOI>10.1152/ajpcell.00096.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Corrigendum</Heading> 
      <DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI> 
    </TocSection> 
  </TOC> 
</MetaIssue>       

我得到的输出是:

<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd"> 
<MetaIssue volume="306" issue="1"> 
  <Provider>Cadmus</Provider> 
  <IssueDate>January 1, 2014 </IssueDate> 
  <PageRange>C1-</PageRange> 
  <TOC> 
    <TocSection> 
      <Heading>Review</Heading> 
      <DOI>10.1152/ajpcell.00281.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>CALL FOR PAPERS | Stem Cell Physiology and Pathophysiology</Heading> 
      <DOI>10.1152/ajpcell.00156.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Articles</Heading> 
      <DOI>10.1152/ajpcell.00130.2013</DOI> 
    </TocSection> 
    <TocSection> 
      <Heading>Corrigendum</Heading> 
      <DOI>10.1152/ajpcell.zh0-7419-corr.2014</DOI> 
    </TocSection> 
  </TOC> 
</MetaIssue> 

我尝试的代码是:

#!/usr/bin/perl  
use strict;
use warnings;

use LWP::Simple;

my $path1 = $ARGV[0];
open(F6, ">meta_issue.xml");

print "Enter the URL:";
my $url = <STDIN>;
chomp $url;

print "Enter the Volume Number:";
my $vol = <STDIN>;
chomp $vol;

print "Enter the Issue Number:";
my $iss = <STDIN>;
chomp $iss;

my $website_content = get($url);

print F6 "\<\!DOCTYPE MetaIssue SYSTEM \"http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd\">\n";
print F6 "<MetaIssue volume=\"$vol\" issue=\"$iss\">\n";
print F6 "<Provider>Cadmus</Provider>\n";

if ($website_content =~ m#<span class="highwire-cite-metadata-date">(.*?)</span>#s) {
    #<span class="highwire-cite-metadata-date">January 1, 2014 </span>

  print F6 "<IssueDate>$1</IssueDate>\n";    #<IssueDate>January 1,         2014</IssueDate>
}

if ($website_content =~ m#(<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)#gs) {
    #.*?(?!<span class="label">:</span>\s?(.*?)(-(.*?))?</span>)$#gs)  #<PageRange>C1-C76</PageRange>

  my $first = $2;
  print F6 "<PageRange>$2-</PageRange>\n";
}

print F6 "<TOC>\n";

while ($website_content =~ m#<h2 id=".*?" class=".*?">(.*?)</h2>#gs) {
  my $h = $1;
  print F6 "<TocSection>\n";
  print F6 "<Heading>$h</Heading>\n";

  if ( $website_content =~ m#(.*?<p><span class="label">DOI:</span>\s?(.*?)\n?</p>\s?</span>\s?\n?</div>.*?)#gs ) {
    my $doi  = $1;
    my $doi1 = $2;
    print F6 "<DOI>$doi1</DOI>\n";
    print F6 "</TocSection>\n";
  }
}

print F6 "</TOC>\n</MetaIssue>\n";

注意:每个<Heading>可能有一个或多个<DOI>值,我无法检索

  1. 我无法在<DOI>下放置特定的<Heading>值。

  2. 我无法从

    中检索最后一次出现的数字
    <span class="label">:</span>\s?(.*?)(-(.*?))?</span>
    

    因为存在诸如</span> c14</span><span> c12-c14</span>之类的变体。所以从这里我需要grep最后一个数字,c14

  3. 我在cmd中执行代码如下;

        D:\Code>Perl File name (Enter) 
        Enter the URl: http://ajpcell.physiology.org/content/306/1 
        Enter the Volume Number: 306 
        Enter the Issue Number: 1 
    

    更新

    在网址

    1)http://ajpendo.physiology.org/content/283/5

    2)http://ajpendo.physiology.org/content/280/1

    DOI不存在,所以在这种情况下,输出代替

          <DOI>$_</DOI> tag 
    

    应该是

          <ResId type=”publisher-id”>$volume/$issue/$first_page</ResId> 
    

    其中$ first_page特定于该特定部分。

    我添加了&#34;否则{}循环&#34; in&#34; sub retrieve_doi()&#34;以及&#34; for {}循环&#34;在下面,但没有得到所需的输出。

        #!/usr/bin/perl
        use warnings;
        use strict;
        use feature qw{ say };
        use HTML::Parser;
        use WWW::Mechanize;
    
        my ($date, $first_page, $last_page, @toc);
        sub get_date {
          my ($self, $tag, $attr) = @_;
           if ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-date' eq $attr->{class}
             and not defined $date
              ) {
        $self->handler(text => \&next_text_to_date, 'self, text');
    
                 } elsif ('span' eq $tag
                      and $attr->{class}
             and 'highwire-cite-metadata-pages' eq $attr->{class}
            ) {
        if (not defined $first_page) {
            $self->handler(text => \&parse_first_page, 'self, text');
        } else {
            $self->handler(text => \&parse_last_page, 'self, text');
        }
    
    } elsif ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-doi' eq $attr->{class}
            ) {
        $self->handler(text => \&retrieve_doi, 'self, text');
    
    } elsif ('div' eq $tag
             and $attr->{class}
             and $attr->{class} =~ /\bissue-toc-section\b/
            ) {
        $self->handler(text => \&next_text_to_toc, 'self, text');
    }
    }
    
    
    sub next_text_to_date {
    my ($self, $text) = @_;
    $text =~ s/^\s+|\s+$//g;
    $date = $text;
    $self->handler(text => undef);
    }
    
    
    sub parse_first_page {
    my ($self, $text) = @_;
    if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
        $first_page = $1;
        $self->handler(text => undef);
    }
    }
    
    
    sub parse_last_page {
    my ($self, $text) = @_;
    if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
        $last_page = $1;
        $self->handler(text => undef);
      }
     }
    
    sub next_text_to_toc {
    my ($self, $text) = @_;
    push @toc, [$text];
    $self->handler(text => undef);
    }
    
    sub retrieve_doi {
    my ($self, $text) = @_;
    if ('DOI:' ne $text) 
    {
        $text =~ s/^\s+|\s+$//g;
        push @{ $toc[-1] }, $text;
        $self->handler(text => undef);
    }
    else        #UPDATE
    {
        $text =~ s/^\s+|\s+$//g;
        push @{ $toc[-1] }, $text;
        $self->handler(text => undef);
     }
     }
    
      print STDERR 'Enter the URL: ';
      chomp(my $url = <>);
      my ($volume, $issue) = (split m(/), $url)[-2, -1];
    
      my $p = 'HTML::Parser'->new( api_version => 3,
                             start_h => [ \&get_date, 'self, tagname, attr'    ],
                           );
    
      my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
      $mech->get($url);
      my $contents = $mech->content;
      $p->parse($contents);
      $p->eof;
    
      my $toc;
    
    for my $section (@toc) {
    $toc .= "<TocSection>\n";
    $toc .= "<Heading>".shift(@$section)."</Heading>\n";
    $toc .= join q(), map "<DOI>$_</DOI>\n", @$section;
    $toc .= join q(), map "<ResId type=”publisher-id”>$volume/$issue/$first_page</ResId>\n", @$section; #UPDATE
    $toc .= "</TocSection>\n";
    }
    
         open (F6, ">meta_issue_$issue.xml");
    
         print F6 <<"__HTML__";
         <!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
         <MetaIssue volume="$volume" issue="$issue">
         <Provider>Cadmus</Provider>
         <IssueDate>$date</IssueDate>
         <PageRange>$first_page-$last_page</PageRange>
        <TOC>
        $toc</TOC>
       </MetaIssue>
       __HTML__
    

    请告诉我如何更新代码以获得所需的输出。

1 个答案:

答案 0 :(得分:0)

使用适当的模块解析HTML:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use HTML::Parser;
use WWW::Mechanize;

my ($date, $first_page, $last_page, @toc);
sub get_info {
    my ($self, $tag, $attr) = @_;
    if ('span' eq $tag
        and $attr->{class}
        and 'highwire-cite-metadata-date' eq $attr->{class}
        and not defined $date
       ) {
        $self->handler(text => \&next_text_to_date, 'self, text');

    } elsif ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-pages' eq $attr->{class}
            ) {
        if (not defined $first_page) {
            $self->handler(text => \&parse_first_page, 'self, text');
        } else {
            $self->handler(text => \&parse_last_page, 'self, text');
        }

    } elsif ('span' eq $tag
             and $attr->{class}
             and 'highwire-cite-metadata-doi' eq $attr->{class}
            ) {
        $self->handler(text => \&retrieve_doi, 'self, text');

    } elsif ('div' eq $tag
             and $attr->{class}
             and $attr->{class} =~ /\bissue-toc-section\b/
            ) {
        $self->handler(text => \&next_text_to_toc, 'self, text');
    }
}


sub next_text_to_date {
    my ($self, $text) = @_;
    $text =~ s/^\s+|\s+$//g;
    $date = $text;
    $self->handler(text => undef);
}


sub parse_first_page {
    my ($self, $text) = @_;
    if ($text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/) {
        $first_page = $1;
        $self->handler(text => undef);
    }
}


sub parse_last_page {
    my ($self, $text) = @_;
    if ($text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/) {
        $last_page = $1;
        $self->handler(text => undef);
    }
}


sub next_text_to_toc {
    my ($self, $text) = @_;
    push @toc, [$text];
    $self->handler(text => undef);
}


sub retrieve_doi {
    my ($self, $text) = @_;
    if ('DOI:' ne $text) {
        $text =~ s/^\s+|\s+$//g;
        push @{ $toc[-1] }, $text;
        $self->handler(text => undef);
    }
}


print STDERR 'Enter the URL: ';
chomp(my $url = <>);
my ($volume, $issue) = (split m(/), $url)[-2, -1];

my $p = 'HTML::Parser'->new( api_version => 3,
                             start_h => [ \&get_info, 'self, tagname, attr' ],
                           );

my $mech = 'WWW::Mechanize'->new(agent => 'Mozilla');
$mech->get($url);
my $contents = $mech->content;
$p->parse($contents);
$p->eof;

my $toc;
for my $section (@toc) {
    $toc .= "    <TocSection>\n";
    $toc .= "      <Heading>" . shift(@$section) . "</Heading>\n";
    $toc .= join q(), map "      <DOI>$_</DOI>\n", @$section;
    $toc .= "    </TocSection>\n";
}

print << "__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
  <Provider>Cadmus</Provider>
  <IssueDate>$date</IssueDate>
  <PageRange>$first_page-$last_page</PageRange>
  <TOC>
$toc  </TOC>
</MetaIssue>
__HTML__

基本解释:

HTML::Parser是基于回调的,这意味着您可以在遇到已解析文档中的给定事件时为其运行子例程。我使用一般回调get_info,它在HTML中搜索所需信息的各种指标。因为我们经常对“给定跨度之后的最近文本”之类的东西感兴趣,所以它只是注册文本的新回调。例如,当找到具有类highwire-cite-metadata-date的span并且尚未定义date时,它会注册一个新的文本处理程序,它将运行next_text_to_date。处理程序只是将文本分配给$date变量并删除处理程序。我不确定这是“正确”的方法,但至少在这种情况下,它有效。

我使用WWW::Mechanize以便能够指定用户代理。使用更简单的LWP::Simple的默认值,我没有得到整个HTML。

模板的输出气味。切换到Template可能是一个很好的进步。