使用WWW :: Mechanize保存PDF文件会破坏它们

时间:2014-07-19 20:31:17

标签: perl web-scraping www-mechanize

我正在尝试编写一个可以登录美国银行并下载PDF语句的脚本。我已经管理了所有困难的技巧,而且我已经挂断了保存PDF文件。我已经尝试了':content_file' => "some file path"方法和$mech->save_content("same file path")。通常,这些中的任何一个都可以正常工作(即使对于PDF)。典型的BoA PDF声明长4页,大小约为400k。

如果我使用前一种方法,它会将文件截断为33k,并且它无法通过Mac上的预览打开(但我可以在Sublime中看到PDF标题和EPS二进制乱码)。如果我使用后一种方法,它会保存文件95个额外字节(与在Chrome中下载相比),这会以某种方式搞砸第二页(4)。唯一明显不同的是,Mechanize下载的文件有一个额外的行,其中包含字符“0”和最后一些换行符。 diff报告“二进制文件2014-06-19 Statement.pdf和eStmt_2014-06-19.pdf不同”。我不知道如何确定剩余的92个字节的差异。

哦,找到了一些东西:使用save_content(),PDF中每隔几百行,我得到一个换行符,字符串“8000”,另一个尾随换行符......然后二进制文件再次启动。不确定那是什么。看起来有10个这样的实例(因此占了另外50个额外字节)。

有谁知道这里会发生什么?

我有以下代码:

#!/usr/bin/perl
use strict;

use WWW::Mechanize;
use Date::Parse;
use DateTime;
use File::Path;

########################################################################################################################
#                Change only the configuration settings in this section, nothing above or below it.                    #
########################################################################################################################

# Credentials
my $username = "someusername";
my $password = "somepassword";

# Enclose value in double quotes, folders with spaces in the name are ok.
my $root_folder = "/Users/john/Documents/Important/Credit Card Statements";

########################################################################################################################
########################################################################################################################

# Suddenly web robot.
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Mac Safari');

# First we have to log in.
$mech->get("https://www.bankofamerica.com/");

# Login, blah.
$mech->submit_form(
  form_name => 'frmSignIn',
  fields  => { Access_ID => $username },
);

# Dumb thing uses a meta refresh...
$mech->follow_link(url_regex => qr/signOn\.go/);

# This is what they call two factor authentication. Heh.
$mech->submit_form(
  form_name => 'ConfirmSitekeyForm',
  fields  => { password => $password },
);

# Just the single account for now... maybe make this a loop later?
#for my $link ($mech->find_all_links(url_regex => qr/redirect\.go.+?target=acctDetails/)) {
$mech->follow_link(url_regex => qr/redirect\.go.+?target=acctDetails/);

# We need the last four digits, easiest here.
my ($fourdigits) = $mech->content() =~ /<span class="bold TL_NPI_AcctName">.+? - (\d{4})</;

# Go to the account details page... 
$mech->follow_link(url_regex => qr/redirect\.go.+?target=statements/);

# Now we need to select which documents we want...
# I'm assuming that you're running this daily in cron. Therefor, we're only going to search the last 60 days.
my $mech2 = $mech->clone();

$mech2->submit_form(
  form_name => 'statementsAndDocTab',
  fields  => { docItemSelected   => 'All',
               dateRangeSelected => '60D',
               selectedDocCode   => 'All',
               selectedDateRange => '60D',
             },
);

# These are nasty javascripty links. I think I have to post to this damn thing, to get a pdf response back. Need to
# regex-loop.
my $page = $mech2->content();
while ($page =~ /id="hidden-documentId\d+" value="(\d+)" name="statement-name".+?onclick="docInboxModuleAccountSkin.downloadLayerSubmit\(this,'downloadPdf','(.+?)', '(.+?)','([0-9\/]+)','(.+?)'/gs) {
    my $documentId = $1;
    my $actionurl = "https://secure.bankofamerica.com" . $2 . "&nocache=" . sprintf("%05d", int(rand(100000)));
    my $docName = $3;
    my $boadate = $4;
    my $documentTypeId = $5;
    my $year = DateTime->from_epoch(epoch => str2time($boadate))->year;
    my $date = DateTime->from_epoch(epoch => str2time($boadate))->ymd;

    # There are more than just statements here. What do we name the files?
    my $filename;
    if    ($docName =~ m/Change in Terms/i) { $filename = "$date Change in Terms.pdf"; }
    elsif ($docName =~ m/Statement/i)       { $filename = "$date Statement.pdf"; }
    else                                    { $filename = "$date Unknown.pdf"; }

    # We may need to create a folder for the year...
    File::Path::make_path("$root_folder/Bank of America - $fourdigits/$year");

    # Get the file.
    unless (-f "$root_folder/Bank of America - $fourdigits/$year/$filename") {
        my $pdf = $mech2->clone();
        # Normally we'd just do $pdf->get(), but we need to do a submit_form. Unfortunately, the form doesn't exist,
        # javascript creates it in place. Ugh.
        $pdf->post( $actionurl,
         #           ':content_file' => "$root_folder/Bank of America - $fourdigits/$year/$filename",
                    [ documentId     => $documentId,
                      menu           => 'downloadPdf',
                      viewDownload   => 'downloadPdf',
                      date           => $boadate,
                      docName        => $docName,
                      documentTypeId => $documentTypeId,
                      version        => '',
                    ],
        );

        $pdf->save_content("$root_folder/Bank of America - $fourdigits/$year/$filename");

        # Let's do a notification...
        #system("/usr/local/bin/terminal-notifier -message \"Bank of America document dated $date has been downloaded.\" -title \"Statement Retrieved\" ");
    }
}

1 个答案:

答案 0 :(得分:1)

通过快速浏览WWW:Mechanize文档中的save_content方法,可能值得尝试的是:

$mech->save_content( $filename, binary => 1 );

您描述的问题类似于在ascii模式下保存二进制数据时获得的排序。