Perl爬虫正在抛出" XPath失败"即使XPath可用,也会出错

时间:2014-10-30 17:17:10

标签: perl xpath web-crawler

即使xpath可用,我的Perl爬虫也会抛出“XPath failure”错误。有时即使内容存在,我也会获得空的爬网内容,但如果我再次运行该脚本,我将获得正确的爬网内容。如果我处理100个URL,那么我会得到3到5个URL的xpath失败错误。我将超时时间增加到660秒,但错误仍然存​​在。当我再次运行脚本时,一切正常。

这是我的代码的抓取工具部分;我正在从数据库中读取输入(urlxpath)。

use LWP::Simple;
use File::Compare;
use HTML::TreeBuilder::XPath;

my $url    = $results->{url};
my $xpath  = $results->{xpath};
my $region = $results->{region};

my $ua = LWP::UserAgent->new( agent => "Mozilla/30.0" );
$ua->timeout(660);
$ua->env_proxy;

my $req = HTTP::Request->new( GET => "$url" );
my $res = $ua->request($req);
delete $ENV{HTTP_PROXY};
my $error = $res->status_line;

if ( $res->is_success )    #if status is 200
{
    $htmlcreation = "$competitor.html";
    my $xp = HTML::TreeBuilder::XPath->new_from_content( $res->decoded_content );

    if ( $xp->findnodes_as_string($xpath) ) {

        my $raw_html           = $xp->findnodes_as_string($xpath);
        my $hs                 = HTML::Strip->new();
        my $clean_text_content = decode_utf8( $hs->parse( encode_utf8($raw_html) ) );
        $hs->eof;

        foreach ($competitor) {
            open HTML_crawl, '>:encoding(cp1252)', "/var/www/auto-eu/folder_history/crawl_for_$date/$htmlcreation";
            my $fileper = "/var/www/auto-eu/folder_history/crawl_for_$date/$htmlcreation";
            chmod( 0777, $fileper );

            print HTML_crawl $clean_text_content;    #write crawled content with competitor name
            close HTML_crawl;

        }
    } else {
        print "xpath failed";
    }

} else {
    print "service unavailable";
}

0 个答案:

没有答案