我在urllib2
中使用Python
来抓取网页。但是,read()
方法不会返回。
以下是我正在使用的代码:
import urllib2
url = 'http://edmonton.en.craigslist.ca/kid/'
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(url, headers=headers)
f_webpage = urllib2.urlopen(request)
html = f_webpage.read() # <- does not return
我上个月前最后一次运行脚本,然后工作正常。
请注意,相同的脚本适用于Edmonton Craigslist上其他类别的网页,例如http://edmonton.en.craigslist.ca/act/
或http://edmonton.en.craigslist.ca/eve/
。
答案 0 :(得分:1)
根据评论中的要求:)
按requests
$ pip install requests
使用requests
如下:
>>> import requests
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = requests.get(url, headers=headers)
>>> request.ok
True
>>> request.text # content in string, similar to .read() in question
...
...
免责声明:从技术上讲,这不是OP问题的答案,但解决了OP的问题,因为已知urllib2
存在问题,requests
库就是为解决此类问题而诞生的。
答案 1 :(得分:0)
它返回(或更具体地说,错误)对我来说很好:
>>> import urllib2
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = urllib2.Request(url, headers=headers)
>>> f_webpage = urllib2.urlopen(request)
>>> html = f_webpage.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
Craigslist可能会发现你是一名刮刀并且拒绝向你提供实际页面。
答案 2 :(得分:0)
我遇到了类似的问题。部分错误信息:
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054]
我通过小批量读取缓冲区而不是直接读取来解决它。
def readBuf(fsrc, length=16*1024):
result=''
while 1:
buf = fsrc.read(length)
if not buf:
break
else:
result+=buf
return result
您可以使用html=f_webpage.read()
来抓取网页,而不是使用html=readBuf(f_webpage)
。