当网页抓取HTML时避免尝试/例外

时间:2019-03-13 19:58:14

标签: python html beautifulsoup

我最近正在编写一个网络爬虫,发现自己嵌套了try / except循环,并依靠错误来驱动我的部分代码,如以下两个部分:

try:
    reg_title = soup.find('p', {'class': "regnumber-e"}).text
except AttributeError:
    try:
        reg_title = soup.find('p', {'class': "regtitle-e"}).text
    except AttributeError:
        reg_title = soup.find('p', {'class': "Yregnumber-e"}).text

if soup.find_all('p', {'class': "Notice"}):
    try:
        #More code
    except IndexError:
        #More code
        continue
elif (soup.find_all('p', {'class': "ConsolidationPeriod-e"}) or
      soup.find_all('p', {'class': "ConsolidationPeriod"})):
    try:
        text = soup.find('p', {'class': "ConsolidationPeriod-e"}).text
    except AttributeError:
        text = soup.find('p', {'class': "ConsolidationPeriod"}).text
elif soup.find('p', {'class': "Notice-e"}):
    #More code
    continue
else:
    continue

很显然,我已经剪掉了代码部分,但是这里的特定代码是无关紧要的。通常,我的编码传感器性能不佳,并且在进行网页抓取时,我觉得必须有一种更好的方法来导航不同的html标签。有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您难道不只是try except所有捕获多异常的代码吗?喜欢:

try:
    # All your code
    # For exemple 
    # if soup.find_all('p', {'class': "Notice"}):
    #      ...
    # else:
    #      ...
except (AttributeError, IndexError) as e:
    continue

对于您要获取文本的部分内容,我认为只需进行一次测试就足够了

赞:

if soup.find('p', {'class': "ConsolidationPeriod-e"}):
    text = soup.find('p', {'class': "ConsolidationPeriod-e"}).get_text()
else:
    text = soup.find('p', {'class': "ConsolidationPeriod"}).text

或者:

if soup.find('p', {'class': "regnumber-e"}):
    reg_title = soup.find('p', {'class': "regnumber-e"}).get_text()
elif soup.find('p', {'class': "regtitle-e"}):
    reg_title = soup.find('p', {'class': "regtitle-e"}).get_text()
else:
    reg_title = soup.find('p', {'class': "Yregnumber-e"}).get_text()