我正在使用以下脚本,该脚本将从此url获取的HTML页面作为输入: http://omim.org/entry/600185
use HTML::TableExtract;
my $doc = 'OMIM_2.htm';
my $headers = [ 'Phenotype', 'Inheritance' ];
my $table_extract = HTML::TableExtract->new(headers => $headers);
$table_extract->parse_file($doc);
my ($table) = $table_extract->tables;
for my $row ($table->rows) {
foreach $info (@$row) {
if ($info =~ m/(\S+)/) {
$info =~ s/^\s+(.+)\s+$/$1/;
print $info."\t";
}
}
print "\n";
}
它做我想要的,从而提取"表型"和#34;继承"表中的字段。 不过,我想直接从URL获取此信息,我尝试修改脚本:
use HTML::TableExtract;
my $doc = 'http://omim.org/entry/600185';
my $headers = [ 'Phenotype', 'Inheritance' ];
my $table_extract = HTML::TableExtract->new(headers => $headers);
$table_extract->parse($doc);
my ($table) = $table_extract->tables;
for my $row ($table->rows) {
foreach $info (@$row) {
if ($info =~ m/(\S+)/) {
$info =~ s/^\s+(.+)\s+$/$1/;
print $info."\t";
}
}
print "\n";
}
我当然犯了一个错误,因为我收到了以下错误:
Can't call method "rows" on an undefined value at Test_OMIM.perl line 11.
更有趣的是,如果文件被调用,我也获得了这个错误" OMIM_2.html"并且没有" OMIM_2.htm"。逻辑?
感谢您的帮助。
答案 0 :(得分:3)
当您希望获得HTML时,您正在为HTML::TableExtract
提供一个网址。要下载HTML,您可以这样做
use strict;
use warnings qw/ all FATAL /;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://omim.org/entry/600185');
my $html = $response->content;
print $html;
Your client was identified as a crawler.
Please note:
- The robots.txt files disallows the crawling of the site except to Google, Bing
and Yahoo crawlers.
- The raw data is available via FTP on the http://omim.org/downloads link on the site.
- We have an API you can learn about at http://omim.org/api and http://omim.org/help/api,
this provides access to the data in XML, JSON, Python and Ruby formats.
- You should feel free to contact us at http://omim.org/contact to figure the best
approach to getting the data you need.
请注意,您可能会遇到此问题,因为omim.org
不希望您自动下载HTML,但希望您使用原始数据或API。 This is their robots.txt
文档,所有自动化软件都应该自愿阅读并遵守