Question

我正在使用以下脚本，该脚本将从此url获取的HTML页面作为输入： http://omim.org/entry/600185

use HTML::TableExtract;

my $doc = 'OMIM_2.htm';
my $headers =  [ 'Phenotype', 'Inheritance' ];

my $table_extract = HTML::TableExtract->new(headers => $headers);

$table_extract->parse_file($doc);
my ($table) = $table_extract->tables;

for my $row ($table->rows) {
    foreach $info (@$row) {
        if ($info =~ m/(\S+)/) {
             $info =~ s/^\s+(.+)\s+$/$1/;
             print $info."\t";
        }
     }
    print "\n";
}

它做我想要的，从而提取＆＃34;表型＆＃34;和＃34;继承＆＃34;表中的字段。不过，我想直接从URL获取此信息，我尝试修改脚本：

use HTML::TableExtract;

my $doc = 'http://omim.org/entry/600185';
my $headers =  [ 'Phenotype', 'Inheritance' ];

my $table_extract = HTML::TableExtract->new(headers => $headers);

$table_extract->parse($doc);
my ($table) = $table_extract->tables;

for my $row ($table->rows) {
    foreach $info (@$row) {
         if ($info =~ m/(\S+)/) {
             $info =~ s/^\s+(.+)\s+$/$1/;
             print $info."\t";
        }
     }
     print "\n";
}

我当然犯了一个错误，因为我收到了以下错误：

Can't call method "rows" on an undefined value at Test_OMIM.perl line 11.

更有趣的是，如果文件被调用，我也获得了这个错误＆＃34; OMIM_2.html＆＃34;并且没有＆＃34; OMIM_2.htm＆＃34;。逻辑？

感谢您的帮助。

Answer 1

当您希望获得HTML时，您正在为HTML::TableExtract提供一个网址。要下载HTML，您可以这样做

use strict;
use warnings qw/ all FATAL /;

use LWP::UserAgent;

my $ua       = LWP::UserAgent->new;
my $response = $ua->get('http://omim.org/entry/600185');
my $html     = $response->content;

print $html;

输出

Your client was identified as a crawler.


Please note:

- The robots.txt files disallows the crawling of the site except to Google, Bing 
  and Yahoo crawlers.

- The raw data is available via FTP on the http://omim.org/downloads link on the site.

- We have an API you can learn about at http://omim.org/api and http://omim.org/help/api, 
  this provides access to the data in XML, JSON, Python and Ruby formats.

- You should feel free to contact us at http://omim.org/contact to figure the best 
  approach to getting the data you need.

请注意，您可能会遇到此问题，因为omim.org不希望您自动下载HTML，但希望您使用原始数据或API。 This is their robots.txt文档，所有自动化软件都应该自愿阅读并遵守

HTML :: TableExtract - 使用html文件但不使用相应URL的脚本

1 个答案:

输出