Question

我正在尝试使用简单的Perl脚本从网络上的《华盛顿邮报》新闻页面中获取标题：

#! /usr/bin/env perl

use strict;
use warnings;
use LWP::Simple;
use Web::Scraper;

my $url = 'https://www.washingtonpost.com/outlook/why-trump-is-flirting-with-abandoning-fox-news-for-one-america/2019/10/11/785fa156-eba4-11e9-85c0-85a098e47b37_story.html';
my  $scraper = scraper{
        process '//h1[@data-qa="headline"]', 'headline' => 'TEXT',
      };


my $html = get($url);
print $html;
my $res = $scraper->scrape ($html);

我遇到的问题是，即使获取完全相同的URL，它也只能工作大约1/2的时间。返回的源代码与其他时间的格式完全不同。

也许这是针对未知代理的防刮擦措施？我不确定，但事实就是如此。

我可以采用一种简单的解决方法来接受Cookie吗？

Answer 1

将$scraper修改为以下内容，以使其可与其他源代码一起使用：

my $scraper = scraper {
        process '//h1[@data-qa="headline"]', 'headline' => 'TEXT',
        process '//h1[@itemprop="headline"]', 'headline2' => 'TEXT',
};

将填充headline或headline。

为同一网址获取不同的源代码

1 个答案: