HTML :: TableExtract - 使用html文件但不使用相应URL的脚本

时间:2016-01-04 15:25:04

标签: html perl

我正在使用以下脚本,该脚本将从此url获取的HTML页面作为输入: http://omim.org/entry/600185

use HTML::TableExtract;

my $doc = 'OMIM_2.htm';
my $headers =  [ 'Phenotype', 'Inheritance' ];

my $table_extract = HTML::TableExtract->new(headers => $headers);

$table_extract->parse_file($doc);
my ($table) = $table_extract->tables;

for my $row ($table->rows) {
    foreach $info (@$row) {
        if ($info =~ m/(\S+)/) {
             $info =~ s/^\s+(.+)\s+$/$1/;
             print $info."\t";
        }
     }
    print "\n";
}

它做我想要的,从而提取"表型"和#34;继承"表中的字段。 不过,我想直接从URL获取此信息,我尝试修改脚本:

use HTML::TableExtract;

my $doc = 'http://omim.org/entry/600185';
my $headers =  [ 'Phenotype', 'Inheritance' ];

my $table_extract = HTML::TableExtract->new(headers => $headers);

$table_extract->parse($doc);
my ($table) = $table_extract->tables;

for my $row ($table->rows) {
    foreach $info (@$row) {
         if ($info =~ m/(\S+)/) {
             $info =~ s/^\s+(.+)\s+$/$1/;
             print $info."\t";
        }
     }
     print "\n";
}

我当然犯了一个错误,因为我收到了以下错误:

Can't call method "rows" on an undefined value at Test_OMIM.perl line 11.

更有趣的是,如果文件被调用,我也获得了这个错误" OMIM_2.html"并且没有" OMIM_2.htm"。逻辑?

感谢您的帮助。

1 个答案:

答案 0 :(得分:3)

当您希望获得HTML时,您正在为HTML::TableExtract提供一个网址。要下载HTML,您可以这样做

use strict;
use warnings qw/ all FATAL /;

use LWP::UserAgent;

my $ua       = LWP::UserAgent->new;
my $response = $ua->get('http://omim.org/entry/600185');
my $html     = $response->content;

print $html;

输出

Your client was identified as a crawler.


Please note:

- The robots.txt files disallows the crawling of the site except to Google, Bing 
  and Yahoo crawlers.

- The raw data is available via FTP on the http://omim.org/downloads link on the site.

- We have an API you can learn about at http://omim.org/api and http://omim.org/help/api, 
  this provides access to the data in XML, JSON, Python and Ruby formats.

- You should feel free to contact us at http://omim.org/contact to figure the best 
  approach to getting the data you need.

请注意,您可能会遇到此问题,因为omim.org不希望您自动下载HTML,但希望您使用原始数据或API。 This is their robots.txt文档,所有自动化软件都应该自愿阅读并遵守