lxml.etree.ParserError:文档为空

时间:2017-03-12 14:17:36

标签: python python-requests lxml

我正在编写一个用于研究目的的Google Analytics刮刀。这是我的第一篇文章,如果格式错误,请提前抱歉。这可能是一个noob问题(我已按照{25491872的建议尝试过),但无论如何它都来了。

Python版本:2.7.6

请求版本:2.2.1

etree.LXML_VERSION:(3,3,3,0)

lxml正在返回ParserError:

  

lxml.etree.ParserError: Document is empty

代码的错误部分如下所示:

user_agent={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36'}

for _ in range(row_count):  
    url="http://"+linecache.getline("sourcelist.csv", i).rstrip('\n') #the error only occurs on a specific webpage: http://www.mixednews.ru

    headers = {"User-Agent": , "Accept-Encoding": "gzip, deflate"}
    print url
    page=requests.get(url, headers=headers, timeout=10)

    print page.headers
    print repr(page)

    except requests.exceptions.Timeout:
        i =i+1
        fields=[url,["TimeOut"]]
        with open ("output.csv", "ab") as o:
            writer=csv.writer(o)
            writer.writerows([fields])
        print "timeout"
        continue

    except requests.exceptions.ConnectionError: #HTTPSConnectionPool:
        i=i+1
        fields=[url, "URL not valid"]
        with open ("output.csv", "ab") as o:
            writer=csv.writer(o)
            writer.writerows([fields])
        print "URL not valid"
        continue

    except requests.exceptions.ContentDecodingError:
        i=i+1
        fields=[url, "Host server error"]
        with open ("output.csv", "ab") as o:
            writer=csv.writer(o)
            writer.writerows([fields])
        print "Host server error"
        continue        
    tree=html.fromstring(page.content) #error

page.headers and repr(page)返回内容:

  

CaseInsensitiveDict({'wp-super-cache':'来自PHP的超级缓存文件','内容编码':'gzip','transfer-encoding':'chunked','vary':'Accept-Encoding ,Cookie','server':'nginx','connection':'keep-alive','cache-control':'max-age = 3,must-revalidate','date':'Sun,2017年3月12日13:36:05 GMT','content-type':'text / html; charset = UTF-8'})

     

回复[200]

有关为什么会发生这种情况的任何想法,以及可以采取哪些措施来解决这个问题?

0 个答案:

没有答案