我的目标是从以下网站中标题为“激动剂”,“拮抗剂”和“变构调节剂”的表中提取链接:
http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1
我一直在使用HTML :: TableExtract来提取表,但是无法获取HTML :: LinkExtor来检索有问题的链接。这是我到目前为止的代码:
use warnings;
use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
my @names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html2/`;
foreach (@names)
{
chomp ($_);
my $te = HTML::TableExtract->new( headers => [ "Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference" ] );
my $le = HTML::LinkExtor->new();
$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't reset";
close RESET;
#open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't append to file";
foreach my $ts ($te->tables)
{
foreach my $row ($ts->rows)
{
$le->parse($row->[0]);
for my $link_tag ( $le->links )
{
my %links = @$link_tag;
print @$link_tag, "\n";
}
}
}
#print "Links extracted from $_\n";
}
我尝试过使用此网站上其他线程的一些示例代码(Perl parse links from HTML Table)无济于事。我不确定这是解析还是表识别的问题。任何提供的帮助将不胜感激。谢谢!
答案 0 :(得分:3)
尝试将其作为基本脚本(您只需将其调整为获取链接):
use warnings; use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
use WWW::Mechanize;
use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $m = WWW::Mechanize->new( autocheck => 1, quiet => 0 );
$m->agent_alias("Linux Mozilla");
$m->cookie_jar({});
my $te = HTML::TableExtract->new(
headers => [
"Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference"
]
);
$te->parse(
$m->get("http://tinyurl.com/jvwov9m")->content
);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', @$row), "\n";
}
}
答案 1 :(得分:2)
你没有描述问题是什么......究竟什么不起作用? $row->[0]
包含什么?但部分问题可能是TableExtract默认只返回'visible'文本,而不是原始html。您可能希望在HTML :: TableExtract中使用keep_html选项。