我正在尝试编写一个可以登录美国银行并下载PDF语句的脚本。我已经管理了所有困难的技巧,而且我已经挂断了保存PDF文件。我已经尝试了':content_file' => "some file path"
方法和$mech->save_content("same file path")
。通常,这些中的任何一个都可以正常工作(即使对于PDF)。典型的BoA PDF声明长4页,大小约为400k。
如果我使用前一种方法,它会将文件截断为33k,并且它无法通过Mac上的预览打开(但我可以在Sublime中看到PDF标题和EPS二进制乱码)。如果我使用后一种方法,它会保存文件95个额外字节(与在Chrome中下载相比),这会以某种方式搞砸第二页(4)。唯一明显不同的是,Mechanize下载的文件有一个额外的行,其中包含字符“0”和最后一些换行符。 diff报告“二进制文件2014-06-19 Statement.pdf和eStmt_2014-06-19.pdf不同”。我不知道如何确定剩余的92个字节的差异。
哦,找到了一些东西:使用save_content(),PDF中每隔几百行,我得到一个换行符,字符串“8000”,另一个尾随换行符......然后二进制文件再次启动。不确定那是什么。看起来有10个这样的实例(因此占了另外50个额外字节)。
有谁知道这里会发生什么?
我有以下代码:
#!/usr/bin/perl
use strict;
use WWW::Mechanize;
use Date::Parse;
use DateTime;
use File::Path;
########################################################################################################################
# Change only the configuration settings in this section, nothing above or below it. #
########################################################################################################################
# Credentials
my $username = "someusername";
my $password = "somepassword";
# Enclose value in double quotes, folders with spaces in the name are ok.
my $root_folder = "/Users/john/Documents/Important/Credit Card Statements";
########################################################################################################################
########################################################################################################################
# Suddenly web robot.
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Mac Safari');
# First we have to log in.
$mech->get("https://www.bankofamerica.com/");
# Login, blah.
$mech->submit_form(
form_name => 'frmSignIn',
fields => { Access_ID => $username },
);
# Dumb thing uses a meta refresh...
$mech->follow_link(url_regex => qr/signOn\.go/);
# This is what they call two factor authentication. Heh.
$mech->submit_form(
form_name => 'ConfirmSitekeyForm',
fields => { password => $password },
);
# Just the single account for now... maybe make this a loop later?
#for my $link ($mech->find_all_links(url_regex => qr/redirect\.go.+?target=acctDetails/)) {
$mech->follow_link(url_regex => qr/redirect\.go.+?target=acctDetails/);
# We need the last four digits, easiest here.
my ($fourdigits) = $mech->content() =~ /<span class="bold TL_NPI_AcctName">.+? - (\d{4})</;
# Go to the account details page...
$mech->follow_link(url_regex => qr/redirect\.go.+?target=statements/);
# Now we need to select which documents we want...
# I'm assuming that you're running this daily in cron. Therefor, we're only going to search the last 60 days.
my $mech2 = $mech->clone();
$mech2->submit_form(
form_name => 'statementsAndDocTab',
fields => { docItemSelected => 'All',
dateRangeSelected => '60D',
selectedDocCode => 'All',
selectedDateRange => '60D',
},
);
# These are nasty javascripty links. I think I have to post to this damn thing, to get a pdf response back. Need to
# regex-loop.
my $page = $mech2->content();
while ($page =~ /id="hidden-documentId\d+" value="(\d+)" name="statement-name".+?onclick="docInboxModuleAccountSkin.downloadLayerSubmit\(this,'downloadPdf','(.+?)', '(.+?)','([0-9\/]+)','(.+?)'/gs) {
my $documentId = $1;
my $actionurl = "https://secure.bankofamerica.com" . $2 . "&nocache=" . sprintf("%05d", int(rand(100000)));
my $docName = $3;
my $boadate = $4;
my $documentTypeId = $5;
my $year = DateTime->from_epoch(epoch => str2time($boadate))->year;
my $date = DateTime->from_epoch(epoch => str2time($boadate))->ymd;
# There are more than just statements here. What do we name the files?
my $filename;
if ($docName =~ m/Change in Terms/i) { $filename = "$date Change in Terms.pdf"; }
elsif ($docName =~ m/Statement/i) { $filename = "$date Statement.pdf"; }
else { $filename = "$date Unknown.pdf"; }
# We may need to create a folder for the year...
File::Path::make_path("$root_folder/Bank of America - $fourdigits/$year");
# Get the file.
unless (-f "$root_folder/Bank of America - $fourdigits/$year/$filename") {
my $pdf = $mech2->clone();
# Normally we'd just do $pdf->get(), but we need to do a submit_form. Unfortunately, the form doesn't exist,
# javascript creates it in place. Ugh.
$pdf->post( $actionurl,
# ':content_file' => "$root_folder/Bank of America - $fourdigits/$year/$filename",
[ documentId => $documentId,
menu => 'downloadPdf',
viewDownload => 'downloadPdf',
date => $boadate,
docName => $docName,
documentTypeId => $documentTypeId,
version => '',
],
);
$pdf->save_content("$root_folder/Bank of America - $fourdigits/$year/$filename");
# Let's do a notification...
#system("/usr/local/bin/terminal-notifier -message \"Bank of America document dated $date has been downloaded.\" -title \"Statement Retrieved\" ");
}
}
答案 0 :(得分:1)
通过快速浏览WWW:Mechanize文档中的save_content
方法,可能值得尝试的是:
$mech->save_content( $filename, binary => 1 );
您描述的问题类似于在ascii模式下保存二进制数据时获得的排序。