我正在编写一个用于研究目的的Google Analytics刮刀。这是我的第一篇文章,如果格式错误,请提前抱歉。这可能是一个noob问题(我已按照{25491872的建议尝试过),但无论如何它都来了。
Python版本:2.7.6
请求版本:2.2.1
etree.LXML_VERSION:(3,3,3,0)
lxml正在返回ParserError:
lxml.etree.ParserError: Document is empty
代码的错误部分如下所示:
user_agent={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36'}
for _ in range(row_count):
url="http://"+linecache.getline("sourcelist.csv", i).rstrip('\n') #the error only occurs on a specific webpage: http://www.mixednews.ru
headers = {"User-Agent": , "Accept-Encoding": "gzip, deflate"}
print url
page=requests.get(url, headers=headers, timeout=10)
print page.headers
print repr(page)
except requests.exceptions.Timeout:
i =i+1
fields=[url,["TimeOut"]]
with open ("output.csv", "ab") as o:
writer=csv.writer(o)
writer.writerows([fields])
print "timeout"
continue
except requests.exceptions.ConnectionError: #HTTPSConnectionPool:
i=i+1
fields=[url, "URL not valid"]
with open ("output.csv", "ab") as o:
writer=csv.writer(o)
writer.writerows([fields])
print "URL not valid"
continue
except requests.exceptions.ContentDecodingError:
i=i+1
fields=[url, "Host server error"]
with open ("output.csv", "ab") as o:
writer=csv.writer(o)
writer.writerows([fields])
print "Host server error"
continue
tree=html.fromstring(page.content) #error
page.headers and repr(page)返回内容:
CaseInsensitiveDict({'wp-super-cache':'来自PHP的超级缓存文件','内容编码':'gzip','transfer-encoding':'chunked','vary':'Accept-Encoding ,Cookie','server':'nginx','connection':'keep-alive','cache-control':'max-age = 3,must-revalidate','date':'Sun,2017年3月12日13:36:05 GMT','content-type':'text / html; charset = UTF-8'})
回复[200]
有关为什么会发生这种情况的任何想法,以及可以采取哪些措施来解决这个问题?