我有一个运行的LWP :: UserAgent,应该应用于以下网址:
http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503
这与许多类似的目标一起运行,请看以下结尾:
html?show_school=5503
html?show_school=9002
html?show_school=5512
我想使用LWP :: UserAgent:
来做到这一点for my $i (0..10000)
{ $ua->get(' [here the URL should be applied] ', id => 21, extern_uid => $i);
# process reply }
在任何情况下,使用这样的循环来完成这种工作是一种方法。我想LWP的API并不是要取代核心Perl的功能,我可以使用Perl循环来查询多个URL。
由于必须应用循环而无法运行的代码:
#use strict;
use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;
# first get a list of all schools
my ($url = '[here the url should be applied] =',id);
for my $id (0..10000) {
$ua->get(' [here the url should be applied ] ', id => 21, extern_uid => $i);
# process reply
}
#my $request = POST $url,
# [
# Schulsuche=> "Ergebnisse anzeigen",
# order => "schule_ort",
# schulname => undef,
# schulort => undef,
# typid => "11",
# verbinder => "AND"
# ];
my $ua = LWP::UserAgent->new;
print "getting all schools - this could take some time\n";
my $response = $ua->request($request);
# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";
# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];
# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;
my $sth = $dbh->prepare(<<sqlend);
insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
values ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend
# now loop over ids
for my $id (@ids_to_do) {
# get detail information for id
my $res = $ua->get("[url]=> &gid=$id");
# parse the response
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($res->content);
my $xpath = q|//div[@id='MCinhview']//div[@class='contentitem']//table|;
my ($adress_table, $tel_table) = $tree->findnodes($xpath);
my ($adr) = $adress_table->find("td");
my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];
my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");
$sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
$dbh->commit;
$tree->delete;
print "$name done\n";
}
星期日25日星期日更新:我已经应用了OmnipotentEntity的建议。
#!/usr/bin/perl -W
use strict;
use warnings; # give out some warnings if something does not run well
use diagnostics; # tell me when something is wrong
use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;
# first get a list of all schools
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7");
#pretending to be firefox on linux.
for my $i (0..10000) {
my $request = HTTP::Request->new(GET => sprintf(" here to put the URL into =%d", $i));
$request->header('Accept' => 'text/html');
my $response = $ua->request($request);
if ($response->is_success) {
$pagecontent = $response -> content;
}
# now we can do whatever with the $pagecontent
}
my $request = POST $url,
[
order => "schule_ort",
schulname => undef,
Basisdaten => undef,
Profil => undef,
Schulort => undef,
typid => "11",
Fax =>
Homepage => undef,
verbinder => "AND"
];
print "getting all schools - this could take some time\n";
my $response = $ua->request($request);
# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";
# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];
# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;
my $sth = $dbh->prepare(<<sqlend);
insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
values ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend
# now loop over ids
for my $id (@ids_to_do) {
# get detail information for id
my $res = $ua->get(" here to put the URL into => &gid=$id");
# parse the response
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($res->content);
my $xpath = q|//div[@id='MCinhview']//div[@class='floatbox']//table|;
my ($adress_table, $tel_table) = $tree->findnodes($xpath);
my ($adr) = $adress_table->find("td");
my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];
my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");
$sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
$dbh->commit;
$tree->delete;
print "$name done\n";
}
我想循环结果,因此我尝试应用相应的URL,但是我遇到了一堆错误:
suse-linux:/usr/perl # perl perl_mecha_example_two.pl Global symbol "$pagecontent" requires explicit package name at perl_mecha_example_two.pl line 24. Global symbol "$url" requires explicit package name at perl_mecha_example_two.pl line 29. Execution of perl_mecha_example_two.pl aborted due to compilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates that all variables must either be lexically scoped (using "my" or "state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "$pagecontent" requires explicit package name at perl_mecha_example_two.pl line 24. Global symbol "$url" requires explicit package name at perl_mecha_example_two.pl line 29. Execution of perl_mecha_example_two.pl aborted due to compilation errors. at perl_mecha_example_two.pl line 86
现在是调试部分。我该怎么改变?如何以正确的方式应用网址?
当我使用strict时,在声明之前我不允许使用变量。通常的解决方法是预先my
,例如第一次出现时my $url
和my $pagecontent
。
答案 0 :(得分:3)
这很简单:
#!/usr/bin/perl -W
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); #pretending to be firefox on linux.
for my $i (0..10000) {
my $req = HTTP::Request->new(GET => sprintf("http://path/to/url?=%d", $i));
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
if ($res->is_success) {
$pagecontent = $res -> content;
}
# Do whatever with the $pagecontent
}
这假设您要获取所有10000页。如果你只想获取特定的那些,那么你应该尝试将这些数字放在一个数组中,然后用于散步该数组,而不是1..10000