蜘蛛网站并检索包含关键字的所有链接

时间:2014-12-15 20:22:17

标签: bash copy wget

如何制作将复制所有链接的Bash脚本(非下载网站)。该函数仅用于获取所有链接,然后将其保存在txt文件中。

我试过这段代码:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

示例:网站中有下载链接(例如,dlink.com),所以我只想复制包含dlink.com的所有单词并将其保存到txt文件中。

我用Google搜索过,我发现它没有用。

1 个答案:

答案 0 :(得分:2)

Perl中使用正确的解析器:

#!/usr/bin/env perl -w

use strict;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my $ua = LWP::UserAgent->new;
my ($url, $f, $p, $res);

if(@ARGV) { 
    $url = $ARGV[0]; }
else {
    print "Enter an URL : ";
    $url = <>;
    chomp($url);
}

my @array = ();
sub callback {
   my($tag, %attr) = @_;
   return if $tag ne 'a';  # we only look closer at <a href ...>
   push(@array, values %attr) if $attr{href} =~ /dlink\.com/i;
}

# Make the parser.  Unfortunately, we don’t know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
                    sub {$p->parse($_[0])});

# Expand all URLs to absolute ones
my $base = $res->base;
@array = map { $_ = url($_, $base)->abs; } @array;

# Print them out
print join("\n", @array), "\n";