HTML :: TagFilter返回HTML :: Element HASH对象

时间:2015-11-19 20:55:33

标签: html perl

Perl的新手,我正在挖掘我能做的事情以及所有这些优秀图书馆的支持和文档;但是,我正在处理我正在处理的脚本的问题。在实现HTML :: TagFilter之前,我使用第63行(打印FH $ tree-> as_HTML)来打印文件我正在寻找的html内容。我专门查看了body标签中的所有内容。现在我只想打印出没有任何属性的p标签,h标签和img标签。当我运行我的代码时,文件在正确的目录中创建,但在每个文件中打印一个哈希对象(HTML :: Element = HASH(0x3a104c8))。

use open qw(:locale);
use strict;
use warnings qw(all);

use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI::Split qw/ uri_split uri_join /;
use HTML::TagFilter;

my @links;

open(FH, "<", "index/site-index.txt")
    or die "Failed to open file: $!\n";
while(<FH>) { 
    chomp; 
    push @links, $_;
} 
close FH;

my $dir = "";
while($dir eq ""){
print "What is the name of the site we are working on? ";
$dir = <STDIN>;
chomp $dir; 
}

#make directory to store files
mkdir($dir);

my $entities = "";
my $indent_char = "\t";
my $filter = HTML::TagFilter->new(
    allow=>{ p => { none => [] }, h1 => { none => [] }, h2 => { none => [] }, h3 => { none => [] }, h4 => { none => [] }, h5 => { none => [] }, h6 => { none => [] }, img => { none => [] },  },
    log_rejects => 1,
    strip_comments => 1
    );

 foreach my $url (@links){

    #print $url;

    my ($filename) = $url =~ m#([^/]+)$#;

    #print $filename;
    $filename =~ tr/=/_/;
    $filename =~ tr/?/_/;
    #print "\n";

    my $currentfile = $dir . '/' . $filename . '.html';

    print "Preparing " . $currentfile . "\n" . "\n";

    open (FH, '>', $currentfile)
        or die "Failed to open file: $!\n";


    my $tree = HTML::TreeBuilder->new_from_url($url);
    $tree->parse($url);
    $tree = $tree->look_down('_tag', 'body');
    if($tree){
        $tree->dump; # a method we inherit from HTML::Element
        print FH $filter->filter($tree);
        #print FH $tree->as_HTML($entities, $indent_char), "\n";
    } else{
        warn "No body tag found";
    }

    print "File " . $currentfile . " completed.\n" . "\n";

    close FH;

}

为什么会发生这种情况?如何打印我要查找的实际内容?

谢谢。

1 个答案:

答案 0 :(得分:1)

$filter->filter()期望HTML,HTML::TreeBuilder不是HTML,而是HTML::Element的子类。 look_down()会返回HTML::Element。这就是您从打印中看到的内容,因为当您将此引用视为字符串时,您将获得该对象的字符串表示形式。 HTML::Element=HASH(0x7f81509ab6d8),表示对象HTML::Element,由HASH结构和此对象的内存地址解析。

您可以通过使用look_down中的HTML调用过滤器来解决所有问题:

         print FH $filter->filter($tree->as_HTML);