如何在html.fromstring(lxml)中处理和传递lxml.etree.ParserError

时间:2018-10-23 12:17:11

标签: python exception-handling python-requests html-parsing lxml

[抱歉,我的问题太天真了。我是python和请求的新手。]我已经尝试了很多方法,但是仍然找不到在html.fromstring中传递lxml.etree.ParserError的方法。

我的代码如下:

    from lxml import html
    import requests
    import time
    import csv
    import pandas as pd

    start = time.time()
    data = {}
    data['webid'] = []
    data['name'] = []
    filename = "xxx"+ str(N) + ".csv"

    for i in range(N):
            url = "http://xxx."+str(i)+".html"
            print(url)
            try:
                    page = requests.get(url,timeout=120)
                    try:
                            tree = html.fromstring(page.content)
                            name = tree.xpath('//h2[starts-with(text(),"Name")]/text()')
                            data['webid'].append(i)
                            data['name'].append(name)

                    except (html.ParseError, ParseError):
                            continue
            except requests.exceptions.RequestException as e:
                    continue

            dataframe = pd.DataFrame(data)
            dataframe.to_csv(filename, index=False, sep='|')

    print("took", time.time() - start, "seconds.")

错误显示为:

    Traceback (most recent call last):
      File "xxx/xxx.py", line 43, in <module>
        tree = html.fromstring(page.content)
      File "\AppData\Local\Programs\Python\Python37-32\lib\site-packages\lxml\html\__init__.py", line 876, in fromstring
        doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
      File "\AppData\Local\Programs\Python\Python37-32\lib\site-packages\lxml\html\__init__.py", line 765, in document_fromstring "Document is empty")
    lxml.etree.ParserError: Document is empty

我尝试了几种例外,例如(lxml.etree.ParserError,etree.ParserError,tree.ParserError,html.ParserError),但它们均无效,并产生新错误:

Traceback (most recent call last):
   File "D:/Google/Research2017/t/schoolpc/v241.py", line 66, in <module>
    except (lxml.etree.ParserError, ParseError):
NameError: name 'lxml' is not defined 

请问是否存在传递ParserError并继续循环而不引发错误的方法?非常感谢你!

非常感谢@BoboDarph,解决方案如下:

from lxml.etree import ParseError
from lxml.etree import ParserError
...except (ParserError, ParseError)

“ NameError是由导入方式和内容引起的。在import语句中,导入了lxml(html)的特定模块,因此无法调用lxml.etree.ParserError或ParseError,因为它们未导入。请导入特定的另一个导入语句中的异常(类似于来自lxml.etree import ParseError的异常)– BoboDarph“

0 个答案:

没有答案