我发现使用给定的https网站处理WWW::Mechanize
中的response_data
和response_done
事件之间会有大约120秒的延迟。我使用普通的网络浏览器进行检查,并没有遇到这种缓慢的情况,因此我怀疑我必须做错事。
以下是我为追踪事件所做的工作(出于某种原因use LWP::Debug qw(+)
没有做任何事情):
use WWW::Mechanize;
use Time::HiRes qw(gettimeofday);
use IO::Handle;
my $mech = WWW::Mechanize->new(
timeout => 3,
autocheck => 1, # check success of each query
stack_depth => 0, # no keeping history
keep_alive => 50, # connection pool
);
$mech->agent_alias( 'Windows IE 6' );
open my $debugfile, '>traffic.txt';
$debugfile->autoflush(1);
$mech->add_handler( request_send => sub {
my $cur_time = gettimeofday();
my $req = shift;
print $debugfile "\n$cur_time === BEGIN HTTP REQUEST ===\n";
print $debugfile $req->dump();
print $debugfile "\n$cur_time === END HTTP REQUEST ===\n";
return
}
);
$mech->add_handler( response_header => sub {
my $cur_time = gettimeofday();
my $res = shift;
print $debugfile "\n$cur_time === GOT RESPONSE HDRS ===\n";
print $debugfile $res->dump();
return
}
);
$mech->add_handler( response_data => sub {
my $cur_time = gettimeofday();
my $res = shift;
my $content_length = length($res->content);
print $debugfile "$cur_time === Got response data chunk resp size = $content_length ===\n";
return
}
);
$mech->add_handler( response_done => sub {
my $cur_time = gettimeofday();
my $res = shift;
print $debugfile "\n$cur_time === BEGIN HTTP RESPONSE ===\n";
print $debugfile $res->dump();
print $debugfile "\n=== END HTTP RESPONSE ===\n";
return
}
);
以下是跟踪的摘录(URL和cookie被混淆):
1347463214.24724 === BEGIN HTTP REQUEST ===
GET https://...
Accept-Encoding: gzip
Referer: https://...
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Cookie: ...
Cookie2: $Version="1"
(no content)
1347463214.24724 === END HTTP REQUEST ===
1347463216.13134 === GOT RESPONSE HDRS ===
HTTP/1.1 200 OK
Date: Wed, 12 Sep 2012 15:20:08 GMT
Accept-Ranges: bytes
...
Server: Lotus-Domino
Content-Length: 377806
Content-Type: application/octet-stream
Last-Modified: Fri, 07 Sep 2012 06:25:33 GMT
Client-Peer: ...
Client-Response-Num: 1
Client-SSL-Cert-Issuer: ...
Client-SSL-Cert-Subject: ...
Client-SSL-Cipher: DES-CBC3-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
(no content)
1347463216.48305 === Got response data chunk resp size = 4096 ===
1347463337.98131 === BEGIN HTTP RESPONSE ===
HTTP/1.1 200 OK
Date: Wed, 12 Sep 2012 15:20:08 GMT
Accept-Ranges: bytes
...
Server: Lotus-Domino
Content-Length: 377806
Content-Type: application/octet-stream
Last-Modified: Fri, 07 Sep 2012 06:25:33 GMT
Client-Date: Wed, 12 Sep 2012 15:22:17 GMT
Client-Peer: ...
Client-Response-Num: 1
Client-SSL-Cert-Issuer: ...
Client-SSL-Cert-Subject: ...
Client-SSL-Cipher: DES-CBC3-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
PK\3\4\24\0\6\0\10\0\0\0!\0\x88\xBC\21Xi\2\0\0\x84\22\0\0\23\0\10\2[Content_Types].xml \xA2...
(+ 377294 more bytes not shown)
=== END HTTP RESPONSE ===
在“获得响应数据块”和“BEGIN HTTP RESPONSE”消息期间,您可以看到121.5秒的间隙。我觉得有时LWP::UserAgent
在收到全部数据后会挂起两分钟。
你有什么线索可以来自哪里?
编辑这是Wireshark的截图:我在120秒后收到FIN / ACK消息......
由于
答案 0 :(得分:3)
我认为您的交易实际上可能需要很长时间。 LWP::UserAgent
的文档说明了这个
[response_data handler]需要返回一个要调用的TRUE值 再次为同一请求的后续块
所以,因为你的处理程序什么都不返回,所以你只追踪第一个返回的数据包
根据您的输出,前4 KB数据在2.2秒内到达,或大约每秒2KB。整个数据长度为369KB,因此您可能需要再接收92个数据包,而每秒2KB则需要3分钟才能传输。你会在两分钟内得到答复,所以我认为你的时间是合理的
答案 1 :(得分:3)
感谢Borodin的回答,我找到了解决问题的方法:
我用这种方式修改了response_data
事件处理程序:
if($res->header('Content-Length') == length($res->content)) {
die "OK"; # Got whole data, not waiting for server to end the communication channel.
}
return 1; # In other cases make sure the handler is called for subsequent chunks
然后如果X-Died
标头等于OK
,则忽略调用者中的错误。
答案 2 :(得分:2)
我知道现在已经很老了,但最近我遇到了同样的问题。它仅在未加密的HTTPS响应(包括标头)的大小恰好为1024字节时发生。 Benoit似乎有4096字节的响应,因此1024的倍数可能很重要。我没有控制服务器,因此我无法生成任意长度的测试响应,也无法在任何其他服务器上重现该问题。但是1024字节的出现是可重复的。
环顾LWP代码(v6.05),我发现要求sysread一次读取1024个字节。所以,它第一次返回所有1024个字节。然后立即调用第二次,而不是返回0,表示没有更多数据,它返回undef,表示错误,并将errno设置为EAGAIN,表示有更多数据,但它是'还没有。这导致套接字上的选择,因为没有更多的数据而挂起。超时需要120秒,之后返回我们所拥有的数据,这恰好是正确的结果。因此,我们没有错误,只是很长的延迟。
我没有足够方便使用Benoit的解决方案。相反,我的解决方法是扩展HTTPS处理代码以检查上述情况并返回0而不是undef:
package LWP::Protocol::https::Socket;
sub sysread {
my $self = shift;
my $result = $self->SUPER::sysread(@_);
# If we get undef back then some error occurred. If it's EAGAIN
# then that ought to mean that there is more data to read but
# it's not available yet. We suspect the error may be false.
# $_[2] is the offset, so if it's defined and non-zero we have
# some data in the buffer.
# $_[0] is the buffer, so check it for an entire HTTP response,
# including the headers and the body. If the length specified
# by Content-Length is exactly the length of the body we have in
# the buffer, then take that as being complete and return a length
# here instead. Since it's unlikely that anything was read, the
# buffer will not have increased in size and the result will be zero
# (which was the expected result anyway).
if (!defined($result) &&
$!{EAGAIN} &&
$_[2] &&
$_[0] =~ /^HTTP\/\d+\.\d+\s+\d+\s+.*\s+content-length\s*:\s*(\d+).*?\r?\n\r?\n(.*)$/si &&
length($2) == $1) {
return length($_[0]) - $_[2]; # bufferlen - offset
}
return $result;
}
答案 3 :(得分:1)
艾伦, 我在我的系统上收到了相同的行为。 对于内容长度1024,2048,3072字节等
此问题的解决方案是将Net :: HTTP升级到6.09及更高版本