我要使用wget抓取网站http://docbao.com.vn/,但wget始终是消息
发送HTTP请求,等待响应......未收到任何数据 重试。
例如,我抓取了http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec类别中的所有网页,结果是
congnh@congnh-pc:~/Source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec" -O -
--2013-02-20 23:53:16-- http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:17-- (try: 2) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:19-- (try: 3) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:22-- (try: 4) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:27-- (try: 5) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:32-- (try: 6) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:38-- (try: 7) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:45-- (try: 8) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2013-02-20 23:53:53-- (try: 9) http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
...
为什么wget“无限制地”重试?或者问题是什么?
由于
丛
答案 0 :(得分:0)
很抱歉说明了明显的但是:wget
重试,因为它没有收到任何数据。它发送HTTP标头,然后远程主机立即关闭连接。我可以猜测,这种非标准行为是由于服务器端的配置错误造成的,可能是故意的。
稍微调整一下之后,我发现一旦你发出信号就可以处理gzip编码的响应,内容将获得服务。您可以通过向--header="accept-encoding: gzip"
命令添加wget
来执行此操作。对于使用wget
进行抓取,这再次成为问题,因为它无法递归到gzip压缩内容中。您需要编写一个脚本来处理这种情况,或者使用另一个可以处理此类内容的工具。
旁注:请注意,并非所有网站都允许抓取其内容。请在此之前检查他们的服务条款。