Perl:抓取网站以及如何使用 Perl Selenium:Chrome 从网站下载 PDF 文件

时间:2021-07-15 18:47:37

标签: selenium perl selenium-webdriver selenium-chromedriver

所以我正在研究使用 Perl 上的 Selenium:Chrome 抓取网站,我只是想知道如何从 2017 年到 2021 年下载所有 pdf 文件并将其存储到此网站的文件夹中 https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 。到目前为止,这就是我所做的

use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;


my $collection_name = "mre_zen_test3";
make_path("$collection_name");

#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;

#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);

#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");

此脚本仅下载网站的全部内容。希望这里有人可以帮助我并教我。非常感谢。

1 个答案:

答案 0 :(得分:5)

这里是一些有效的代码,希望能帮助你开始

use warnings;
use strict;
use feature 'say';
use Path::Tiny;  # only convenience

use Selenium::Chrome;

my $base_url = q(https://www.fda.gov/drugs/)
    . q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);

my $show = 1;  # to see navigation. set to false for headless operation
    
# A little demo of how to set some browser options
my %chrome_capab = do {
    my @cfg = ($show) 
        ? ('window-position=960,10', 'window-size=950,1180')
        : 'headless';
    'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
};

my $drv = Selenium::Chrome->new( %chrome_capab );

my @years = 2017..2021;
foreach my $year (@years) {
    my $url = $base_url . "untitled-letters-$year";

    $drv->get($url);

    say "\nPage title: ", $drv->get_title;
    sleep 1 if $show;

    my $elem = $drv->find_element(
        q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
    );
    sleep 1 if $show;
    
    # Downloading the file is surprisingly not simple with Selenium (see text)
    # But as we found the link we can get its url and then use Selenium-provided 
    # user-agent (it's LWP::UserAgent)
    my $href = $elem->get_attribute('href');
    say "pdf's url: $href";

    my $response = $drv->ua->get($href);
    die $response->status_line if not $response->is_success;

    say "Downloading 'Content-Type': ", $response->header('Content-Type'); 
    my $filename = "download_$year.pdf";
    say "Save as $filename";
    path($filename)->spew( $response->decoded_content );
}

这需要走捷径、切换方法并回避一些问题(需要解决这些问题才能更充分地利用这个有用的工具)。它从每一页下载一个 pdf;下载所有我们需要更改用于定位它们的 XPath 表达式

my @hrefs = 
    map { $_->get_attribute('href') } 
    $drv->find_elements(
        # There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
        q{//li[contains(text(), '(PDF)')]}
      . q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]} 
    );

现在循环链接,更仔细地形成文件名,然后像上面的程序一样下载每个链接。如果需要,我可以进一步填补空白。

代码将 pdf 文件放在磁盘上的工作目录中。请在运行之前检查一下,以确保没有任何内容被覆盖!

请参阅 Selenium::Remove::Driver 了解初学者。


注意:此特定任务不需要 Selenium;都是直接的 HTTP 请求,没有 JavaScript。所以 LWP::UserAgentMojo 就可以了。但我认为您想学习如何使用 Selenium,因为它经常被需要并且很有用。