我有一个问题,我希望你能提供帮助吗?
我有两个包含以下内容的文本文件:
FILE1.TXT
http://www.dog.com/
http://www.cat.com/
http://www.antelope.com/
FILE2.TXT
1
2
Barry
我正确实现的输出如下:
http://www.dog.com/1
http://www.dog.com/2
http://www.dog.com/Barry
http://www.cat.com/1
http://www.cat.com/2
http://www.cat.com/Barry
http://www.antelope.com/1
http://www.antelope.com/2
http://www.antelope.com/Barry
执行上述操作的代码
open my $animalUrls, '<', 'FILE1.txt' or die "Can't open: $!";
open my $directory, '<', 'FILE2.txt' or die "Can't open: $!";
my @directory = <$directory>; #each line of the file into an array
close $directory or die "Can't close: $!";
while (my $line = <$animalUrls>) {
chomp $line;
print $line.$_ foreach (@directory);
push (@newListOfUrls, $line.$_) foreach (@directory); #put each new url into array
}
现在我遇到了问题:
我需要获取原始网址的内容长度(File1.txt),并将每个新网址的内容长度与对应的原始网址进行比较,看看它们是否相同或者不同,例如:
获取内容长度的代码:
print $mech->response->header('Content-Length'); #returns the content length
我遇到的问题是如何将每个新网址与正确对应的原始网址进行比较? (即没有意外地将http://www.cat.com/Barry的内容长度与http://www.dog.com/的内容长度进行比较)我是否应该使用哈希?我将如何处理?
非常感谢您对此的帮助,非常感谢
答案 0 :(得分:3)
你应该使用哈希。我会更改您的输入代码以创建更复杂的数据结构,因为这会使任务更容易。
open my $animalUrls, '<', 'FILE1.txt' or die "Can't open: $!";
open my $directory, '<', 'FILE2.txt' or die "Can't open: $!";
my @directory = <$directory>; #each line of the file into an array
close $directory or die "Can't close: $!";
my $newURLs;
while ( my $baseURL = <$animalUrls> ) {
chomp $baseURL;
SUBDIR: foreach my $subdir (@directory) {
chomp $subdir;
next SUBDIR if $subdir eq "";
# put each new url into arrayref
push( @{ $newURLs->{$baseURL} }, $baseURL . $subdir );
}
}
我们现在可以利用这个优势。假设我们已经设置了Mechanize:
foreach my $url ( keys %{$newURLs} ) {
# first get the base URL and save its content length
$mech->get($url);
my $content_length = $mech->response->header('Content-Length');
# now iterate all the 'child' URLs
foreach my $child_url ( @{ $newURLs->{$url} } ) {
# get the content
$mech->get($child_url);
# compare
if ( $mech->response->header('Content-Length') != $content_length ) {
print "$child_url: different content length: $content_length vs "
. $mech->response->header('Content-Length') . "!\n";
}
}
}
通过将代码放在构建数据结构的位置,您甚至可以在没有第二组foreach
循环的情况下执行此操作。
如果您不熟悉这些参考文献,请查看perlreftut。我们在这里做的是为每个基本URL创建一个带有密钥的哈希,并将所有生成的子URL的数组放入其中。如果您使用Data :: Dumper输出最终的$newURLs
,它将如下所示:
$VAR1 = {
'http://www.dog.com/' => [
'http://www.dog.com/1',
'http://www.dog.com/2',
],
'http://www.cat.com/' => [
'http://www.cat.com/1',
'http://www.cat.com/2',
],
};
编辑:我更新了代码。我用这些文件来测试它:
URLS:
http://www.stackoverflow.com/
http://www.superuser.com/
显示目录:
faq
questions
/
答案 1 :(得分:1)
此代码似乎可以满足您的需求。它将所有URL存储在@urls
中,并在获取每个URL时打印内容长度。我不知道您之后需要什么长度数据,但我已将每个响应的长度存储在散列%lengths
中,以将它们与URL相关联。
use 5.010;
use warnings;
use LWP::UserAgent;
STDOUT->autoflush;
my @urls;
open my $fh, '<', 'FILE1.txt' or die $!;
while (my $base = <$fh>) {
chomp $base;
push @urls, $base;
open my $fh, '<', 'FILE2.txt' or die $!;
while (my $path = <$fh>) {
chomp $path;
push @urls, $base.$path;
}
}
my $ua = LWP::UserAgent->new;
my %lengths;
for my $url (@urls) {
my $resp = $ua->get($url);
my $length = $resp->header('Content-Length');
$lengths{$url} = $length;
printf "%s -- %s\n", $url, $length // 'undef';
}
<强>输出强>
http://www.dog.com/ -- undef
http://www.dog.com/1 -- 56244
http://www.dog.com/2 -- 56244
http://www.dog.com/Barry -- 56249
http://www.cat.com/ -- 156
http://www.cat.com/1 -- 11088
http://www.cat.com/2 -- 11088
http://www.cat.com/Barry -- 11088
http://www.antelope.com/ -- undef
http://www.antelope.com/1 -- undef
http://www.antelope.com/2 -- undef
http://www.antelope.com/Barry -- undef