无法使用wget下载(wget重试“无限制”)

时间:2013-02-20 17:01:15

标签: wget web-crawler

我要使用wget抓取网站http://docbao.com.vn/,但wget始终是消息

发送HTTP请求,等待响应......未收到任何数据 重试。

例如,我抓取了http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec类别中的所有网页,结果是

congnh@congnh-pc:~/Source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec" -O -
--2013-02-20 23:53:16--  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:17--  (try: 2)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:19--  (try: 3)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:22--  (try: 4)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:27--  (try: 5)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:32--  (try: 6)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:38--  (try: 7)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:45--  (try: 8)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:53--  (try: 9)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
...

为什么wget“无限制地”重试?或者问题是什么? 由于

1 个答案:

答案 0 :(得分:0)

很抱歉说明了明显的但是:wget重试,因为它没有收到任何数据。它发送HTTP标头,然后远程主机立即关闭连接。我可以猜测,这种非标准行为是由于服务器端的配置错误造成的,可能是故意的。

稍微调整一下之后,我发现一旦你发出信号就可以处理gzip编码的响应,内容获得服务。您可以通过向--header="accept-encoding: gzip"命令添加wget来执行此操作。对于使用wget进行抓取,这再次成为问题,因为它无法递归到gzip压缩内容中。您需要编写一个脚本来处理这种情况,或者使用另一个可以处理此类内容的工具。

旁注:请注意,并非所有网站都允许抓取其内容。请在此之前检查他们的服务条款。