我正在尝试运行perl脚本(在Windows cmd窗口中),但它总是会停止在某一点工作。我怎么才能找出为什么它不会继续下去?
这是脚本:我能看到的最后一件事是第37行的“get_html_source()”
#!/usr/bin/perl
# Perl script that scrapes the members of the Hellenic Parliament
# Created by Kostas Ntonas, 03 May 2013 - http://ntonas.gr
# http://deixto.blogspot.gr/2013/05/scraping-members-of-greek-parliament.html
use strict;
use warnings;
use utf8;
use IO::File;
use POSIX qw(tmpnam);
use DEiXToBot;
use WWW::Selenium;
my $agent = DEiXToBot->new(); # create the DEiXToBot agent object
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://www.hellenicparliament.gr/"
);
$sel->start;
for my $i (1..30) {
my $url = "http://www.hellenicparliament.gr/en/Vouleftes/Viografika-Stoicheia?pageNo=$i";
$sel->open($url);
$sel->wait_for_page_to_load(5000);
$sel->pause(1);
print "$i) $url\n";
my $content = $sel->get_html_source();
my ($fh,$name); # create a temporary file containing the page's source code
do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
binmode( $fh, ':utf8' );
print $fh $content;
close $fh;
$agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme
unlink $name; # delete the temporary file, it is not needed any more
if (! $agent->success) { die "Could not fetch the temp file!\n"; }
$agent->build_dom();
$agent->load_pattern('C:\Users\XXX\Documents\Privat\MyCase3\Deixto Patterns\parliament_CVs.xml');
$agent->extract_content();
if (! $agent->hits) {
die "Could not find any MPs/ records!\n";
}
else {
for my $record ($agent->records) {
my @rec = @$record;
my $party;
my $logo = $rec[0];
# deduce the party name from the logo in the first column of the table
if ($logo=~m#ND_Logo#) { $party = "N.D. (New Democracy)"; }
elsif ($logo=~m#COALITION#) { $party = "SYRIZA Unitary Social Front"; }
elsif ($logo=~m#PASOK#) { $party = "PA.SO.K. (Panhellenic Socialist Movement)"; }
elsif ($logo=~m#ANEKS_ELL#) { $party = "ANEXARTITOI ELLINES (Independent Hellenes)"; }
elsif ($logo=~m#xrisi#) { $party = "LAIKOS SYNDESMOS - CHRYSI AVGI (People's Association - Golden Dawn)"; }
elsif ($logo=~m#small#) { $party = "DHM.AR (Democratic Left)"; }
elsif ($logo=~m#KKE#) { $party = "K.K.E. (Communist Party of Greece)"; }
elsif ($logo=~m#INDEPENDENT#) { $party = "INDEPENDENT"; }
else { die "$logo => Unknown logo!\n"; }
$rec[0] = $party;
$rec[3]=~s#\s+# #g; # replace whitespace characters with a single space
# append the data in a tab delimited text file
open my $fh,">>:utf8","MPs.txt";
print $fh join("\t",@rec)."\n";
close $fh;
}
}
}
$sel->stop;
答案 0 :(得分:0)
您是否知道代码在get_html_source中死亡,或者它是否真正在之前或之后死亡(例如在调用tmpnam时,似乎缺少分号)?
另一个评论是,这似乎只是为了削减国会议员及其政党的名单。如果你查看页面源代码,那么有一大块base-64编码文本似乎拥有你需要的所有数据。因此,您可能会发现加载页面,解码块并获得所需的一切更快。
答案 1 :(得分:0)
tmpnam函数由POSIX Perl模块提供。它应该适用于大多数Unix / Linux变种,但它似乎在Windows下被打破。 我建议用以下内容替换包含tmpnam调用的“有问题”行:
use File::Temp qw/ tempfile /;
($fh,$name) = tempfile();
希望此更改可以解决问题并允许脚本完成。
这也是Perl tmpnam文档(http://perldoc.perl.org/POSIX.html)建议的内容:“出于安全原因,可能在系统的C库tmpnam()函数文档中有详细说明,不应使用此接口;而是见File :: Temp“。