Perl的新手,我正在挖掘我能做的事情以及所有这些优秀图书馆的支持和文档;但是,我正在处理我正在处理的脚本的问题。在实现HTML :: TagFilter之前,我使用第63行(打印FH $ tree-> as_HTML)来打印文件我正在寻找的html内容。我专门查看了body标签中的所有内容。现在我只想打印出没有任何属性的p标签,h标签和img标签。当我运行我的代码时,文件在正确的目录中创建,但在每个文件中打印一个哈希对象(HTML :: Element = HASH(0x3a104c8))。
use open qw(:locale);
use strict;
use warnings qw(all);
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI::Split qw/ uri_split uri_join /;
use HTML::TagFilter;
my @links;
open(FH, "<", "index/site-index.txt")
or die "Failed to open file: $!\n";
while(<FH>) {
chomp;
push @links, $_;
}
close FH;
my $dir = "";
while($dir eq ""){
print "What is the name of the site we are working on? ";
$dir = <STDIN>;
chomp $dir;
}
#make directory to store files
mkdir($dir);
my $entities = "";
my $indent_char = "\t";
my $filter = HTML::TagFilter->new(
allow=>{ p => { none => [] }, h1 => { none => [] }, h2 => { none => [] }, h3 => { none => [] }, h4 => { none => [] }, h5 => { none => [] }, h6 => { none => [] }, img => { none => [] }, },
log_rejects => 1,
strip_comments => 1
);
foreach my $url (@links){
#print $url;
my ($filename) = $url =~ m#([^/]+)$#;
#print $filename;
$filename =~ tr/=/_/;
$filename =~ tr/?/_/;
#print "\n";
my $currentfile = $dir . '/' . $filename . '.html';
print "Preparing " . $currentfile . "\n" . "\n";
open (FH, '>', $currentfile)
or die "Failed to open file: $!\n";
my $tree = HTML::TreeBuilder->new_from_url($url);
$tree->parse($url);
$tree = $tree->look_down('_tag', 'body');
if($tree){
$tree->dump; # a method we inherit from HTML::Element
print FH $filter->filter($tree);
#print FH $tree->as_HTML($entities, $indent_char), "\n";
} else{
warn "No body tag found";
}
print "File " . $currentfile . " completed.\n" . "\n";
close FH;
}
为什么会发生这种情况?如何打印我要查找的实际内容?
谢谢。
答案 0 :(得分:1)
$filter->filter()
期望HTML,HTML::TreeBuilder
不是HTML,而是HTML::Element
的子类。 look_down()
会返回HTML::Element
。这就是您从打印中看到的内容,因为当您将此引用视为字符串时,您将获得该对象的字符串表示形式。 HTML::Element=HASH(0x7f81509ab6d8)
,表示对象HTML::Element
,由HASH
结构和此对象的内存地址解析。
您可以通过使用look_down中的HTML调用过滤器来解决所有问题:
print FH $filter->filter($tree->as_HTML);