我有一个充满链接的.html文件,我想提取没有http://的域名(因此只是链接的主机名部分,例如blah.com)列出它们并删除重复项。
这是我到目前为止所提出的 - 我认为问题是我试图传递$ tree数据的方式
#!/usr/local/bin/perl -w
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI;
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my $u1 = URI->new($tree);
print "host: ", $u1->host, "\n";
print "Hey, here's a dump of the parse tree of $file_name:\n";
# Now that we're done with it, we must destroy it.
# $tree = $tree->delete; # Not required with weak references
}
答案 0 :(得分:4)
就个人而言,我会使用Mojo :: DOM,并使用URI模块提取域: `
use Mojo::DOM;
use URI;
use List::AllUtils qw/uniq/;
my @domains = sort +uniq
map eval { URI->new( $_->{href} )->authority } // (),
Mojo::DOM->new( $html_code )->find("a[href]")->each;
(P.S。异常处理->authority
是因为某些URI会在这里cro;如mailto:s)
答案 1 :(得分:2)
这是另一种选择:
use strict;
use warnings;
use Regexp::Common qw/URI/;
use URI;
my %hosts;
while (<>) {
$hosts{ URI->new($1)->host }++ while /$RE{URI}{-keep}/g;
}
print "$_\n" for keys %hosts;
命令行用法:perl script.pl htmlFile1 [htmlFile2 ...] [>outFile]
您可以发送脚本多个html文件。最后一个可选参数将输出定向到文件。
使用cnn.com主页作为html源的部分输出:
www.huffingtonpost.com
a.visualrevenue.com
earlystart.blogs.cnn.com
reliablesources.blogs.cnn.com
insideman.blogs.cnn.com
cnnphotos.blogs.cnn.com
cnnpresents.blogs.cnn.com
i.cdn.turner.com
www.stylelist.com
js.revsci.net
z.cdn.turner.com
www.cnn.com
...
希望这有帮助!