我正在尝试从网站上获取一些数据。但它返回incomplete read
。我想要获取的数据是一组庞大的嵌套链接。我在网上进行了一些调查,结果发现这可能是由于服务器错误造成的(之前是一个分块传输编码
达到预期的大小)。我还在此link
但是,我不确定如何在我的情况下使用它。以下是我正在处理的代码
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
请帮帮我。谢谢
答案 0 :(得分:19)
您在问题中包含的link只是一个执行urllib的read()函数的包装器,它会捕获任何不完整的读取异常。如果你不想实现这个整个补丁,你可以随时抛出一个try / catch循环来读取你的链接。例如:
try:
page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
page = e.partial
for python3
try:
page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
page = e.partial
答案 1 :(得分:7)
我发现在我的情况下:发送HTTP / 1.0请求,添加此项,解决问题。
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
在我提出请求后:
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
我回到http 1.1后(对于支持1.1的连接):
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
诀窍是使用http 1.0而不是默认的http / 1.1 http 1.1可以处理块,但由于某些原因,webserver没有,所以我们在http 1.0中执行请求
答案 2 :(得分:1)
对我来说有用的是将IncompleteRead作为一个例外,并通过将其放入如下所示的循环中来获取您在每次迭代中设法读取的数据:(注意,我使用的是Python 3.4.1并且urllib库之间已经发生了变化2.7和3.4)
try:
requestObj = urllib.request.urlopen(url, data)
responseJSON=""
while True:
try:
responseJSONpart = requestObj.read()
except http.client.IncompleteRead as icread:
responseJSON = responseJSON + icread.partial.decode('utf-8')
continue
else:
responseJSON = responseJSON + responseJSONpart.decode('utf-8')
break
return json.loads(responseJSON)
except Exception as RESTex:
print("Exception occurred making REST call: " + RESTex.__str__())
答案 3 :(得分:1)
您可以使用requests
代替urllib2
。 requests
基于urllib3
,因此很少有任何问题。把它放在循环中试试3次,它会更强大。你可以这样使用它:
import requests
msg = None
for i in [1,2,3]:
try:
r = requests.get(self.crawling, timeout=30)
msg = r.text
if msg: break
except Exception as e:
sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
if i == 3 :
sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'. format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
raise e
time.sleep(10*(i-1))
答案 4 :(得分:0)
我发现我的病毒检测器/防火墙导致了这个问题。 AVG的“Online Shield”部分。
答案 5 :(得分:0)
我尝试了所有这些解决方案,但没有一个能为我工作。实际上,做了什么工作而不是使用urllib,我只使用了http.client(Python 3)
conn = http.client.HTTPConnection('www.google.com')
conn.request('GET', '/')
r1 = conn.getresponse()
page = r1.read().decode('utf-8')
每次都能完美运行,而urllib每次都会返回一个不完整的例外。
答案 6 :(得分:0)
我只是添加了一个例外来传递这个问题 就像
try:
r = requests.get(url, timeout=timeout)
except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e:
logging.error("There is a error: %s" % e)
答案 7 :(得分:0)
python3仅供参考
from urllib import request
import http.client
import os
url = 'http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brand'
try:
response = request.urlopen(url)
file = response.read()
except http.client.IncompleteRead as e:
file = e.partial
except Exception as result:
print("Unkonw error" + str(result))
return
# save file
with open(file_path, 'wb') as f:
print("save -> %s " % file_path)
f.write(file)