Question

我正在使用python urllib2从网上下载页面。我没有使用任何类型的user_agent等。我收到以下示例错误。有人可以告诉我一个简单的方法来避免它们。

http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code:  403


http://www.spiritus-temporis.com/marc-platt-dancer-/
The server couldn't fulfill the request.
Error code:  503

http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code:  500


http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason:  timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
    File "download.py", line 43, in <module>
    localFile.write(response.read())
    File "/usr/lib/python2.6/socket.py", line 327, in read
    data = self._sock.recv(rbufsize)
    File "/usr/lib/python2.6/httplib.py", line 517, in read
    return self._read_chunked(amt)
    File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
    raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)

谢谢你巴拉

Answer 1

许多网络资源需要某种cookie或其他身份验证才能访问，您的403状态代码很可能是由此产生的。

503错误往往意味着您正在快速从循环中的服务器访问资源，并且您需要在尝试其他访问之前等待一段时间。

500示例似乎甚至不存在......

超时错误可能不需要“!!”，我只能在没有它的情况下加载资源。

我建议您阅读http状态代码。

Answer 2

对于那些更复杂的任务，您可能需要考虑使用机械化，斜纹或甚至Selenium或Windmill，它们将支持更多合适的场景，包括cookie或javascript支持。

对于随机网站，仅使用urllib2（签名的cookie，任何人？）可能会很棘手。

Python urllib2，如何避免错误 - 需要帮助

2 个答案: