使用httplib2和urllib2,我正在尝试从这个网址中获取网页,但是所有这些网页都没有用完,最终导致了这个异常。
content = conn.request(uri="http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1129, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 901, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 871, in _conn_request
response = conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1027, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
HTTP标头就像这样
http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902
GET /news/news_print.asp?artice_id=20110727092902 HTTP/1.1
Host: www.zdnet.co.kr
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ko-kr,ko;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: RMID=7d83495d4f336fe0; __utma=37206251.1552605885.1328771258.1328771258.1329070845.2; __utmz=37206251.1328771258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ASPSESSIONIDCSQCQTDD=BCLEHPPDEPHEBJDLCFNDMKDN; __utmc=37206251; ASPSESSIONIDSSQCQQCB=MJPLMOJAFPDFCLONCANBIKHN; _EXEN=2
X-FireLogger: 1.2
HTTP/1.1 200 OK
Date: Mon, 13 Feb 2012 18:02:56 GMT
Content-Length: 19158
Content-Type: text/html;charset=UTF-8; Charset=UTF-8
Set-Cookie: ASPSESSIONIDSQSDQRDB=NGAIFHKAGDIOGEMANAOLLKKF; path=/
Cache-Control: private
有任何线索吗?
答案 0 :(得分:4)
对于所有在安装httplib2 0.8之后最终遇到类似问题的人:
0.8版具有与HTTP keep-alive相关的连接处理回归。请参阅错误报告:https://code.google.com/p/httplib2/issues/detail?id=250
此问题有一个修复程序,但到目前为止还没有发布。在此之前,只需使用httplib2 0.7.7。
答案 1 :(得分:3)
这对我来说很好用:
import urllib2
opener = urllib2.build_opener()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}
opener.addheaders = headers.items()
response = opener.open("http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
print response.headers
print response.read()
网站会丢弃所有没有User-Agent
字符串的请求。
答案 2 :(得分:2)
在我的代码中,当我使用
时 from urllib2 import urlopen
content = urlopen(page).read()
出现异常。但是,当我使用
时 import urllib
content = urllib.urlopen(page).read()
一切都好。
也许它会帮助你。
答案 3 :(得分:1)
看起来这个网页不允许您的用户代理。您可以这样更改:
>>> import urllib2
>>> user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>>> headers = { 'User-Agent' : user_agent }
>>> r = urllib2.Request('http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902', headers=headers)
>>> fd = urllib2.urlopen(r)
>>> print fd[20:]
'<!DOCTYPE html PUBLI'