关于此Target-Url德国学校数据库http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=1750
这个目标网址有一个网格(或表格) - 我有一个运行良好的代码 - 但它无法处理网格......
它会像以下那样吐出数据......
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326 Abensberg 09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
这很糟糕 - 我需要一个分隔符 - 我怎么能在结果中得到一些分隔符 - 或者 逗号或标签....?
这里是代码:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print OUTFILE "@$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( @_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
btw - 请参阅分隔符的一个示例......:
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
注意:这篇文章来自用户,它给了我一些关于这个帖子的一些很好的提示。看到 HTML::TableExtract: how to run the right argument [see live example]
嗯 - 我想将这些想法迁移到上面提到的蜘蛛和解析器中。这有可能。你可以给我一些提示,如何添加一些代码行 - 这样结果就会在csv-formate中吐出......!
任何和所有帮助将不胜感激..
零
更新:根据muu的想法太短了......:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Text::CSV
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
$csv->combine(@$row);
print OUTFILE $csv->string();
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup {
for ( @_ ) {
s/\s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
更新:我做了一些试验:
马丁@ SUSE Linux的:〜/ perl的&GT; perl的 perl_bayern_newstack.pl无与伦比的(in 正则表达式;标记为&lt; - HERE in m /^.*?(Treffer)(d +)( - )(d +)(&lt ;- 这里
这里有什么问题......?
答案 0 :(得分:2)
您应该使用Text::CSV来构建输出,这样可以省去尝试手动处理嵌入式引号和分隔符的麻烦。因此,使用您想要的选项创建Text::CSV
的实例并替换它:
print OUTFILE "@$row\n";
与
$csv->combine(@$row);
print OUTFILE $csv->string();
$csv
是Text::CSV
个实例。
此外,您的cleanup
功能有点破碎,它应如下所示:
sub cleanup {
for ( @_ ) {
s/\s+/ /g;
}
}
当您粘贴代码时,丢失的\
可能只是一个错字。但
不要使用原型,他们不会做你认为他们做的事情。你说的地方:
sub cleanup() {
只是说
sub cleanup {
并且不要使用前导符号调用函数(即不要&workDir();
,只需要workDir();
),除非你知道它的副作用是什么。