使用BeautifulSoup跳过错误404

时间:2016-10-20 17:26:44

标签: python beautifulsoup

我正在尝试使用BeautifulSoup废弃一些网址。我正在抓取的网址来自谷歌分析API调用,其中一些不能正常工作,所以我需要找到一种方法来跳过它们。

我试着添加这个:

except urllib2.HTTPError:
continue

但是我收到了以下语法错误:

    except urllib2.HTTPError:
         ^
SyntaxError: invalid syntax

这是我的完整代码:

rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
  # Print data nicely for the user.

  if results:
    for row in results.get('rows'):
      rawdata.append(row[0])
  else:
    print 'No results found'

  urllist = [mystring + x for x in rawdata]

  for row in urllist:  
            # query the website and return the html to the variable 'page'
    page = urllib2.urlopen(row)
    except urllib2.HTTPError:
    continue
    soup = BeautifulSoup(page, 'html.parser')

                # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
      continue
    share = name_box.text.strip() # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append((row,share))

  print(sharelist)

5 个答案:

答案 0 :(得分:2)

您的except语句前面没有try语句。您应该使用以下模式:

try:
    page = urllib2.urlopen(row)
except urllib2.HTTPError:
    continue

还要注意缩进级别。在try子句下执行的代码必须缩进,以及except子句。

答案 1 :(得分:2)

两个错误:
1.没有try声明
2.没有缩进

使用此:

for row in urllist:  
          # query the website and return the html to the variable 'page'
    try:
        page = urllib2.urlopen(row)
    except urllib2.HTTPError:
        continue

答案 2 :(得分:1)

如果您只是想要捕获 404 ,则需要检查返回的代码或引发错误,否则您将捕获并忽略的不仅仅是404:

import urllib2
from bs4  import BeautifulSoup
from urlparse import urljoin


def print_results(results):
    base = 'http://www.konbini.com'
    rawdata = []
    sharelist = []
    # Print data nicely for the user.
    if results:
        for row in results.get('rows'):
            rawdata.append(row[0])
    else:
        print 'No results found'
    # use urljoin to join to the base url
    urllist = [urljoin(base, h) for h in rawdata]
    for url in urllist:
        # query the website and return the html to the variable 'page'
        try: # need to open with try
            page = urllib2.urlopen(url)
        except urllib2.HTTPError as e:
            if e.getcode() == 404: # check the return code
                continue
            raise # if other than 404, raise the error

        soup = BeautifulSoup(page, 'html.parser')
        # Take out the <div> of name and get its value
        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
            continue
        share = name_box.text.strip()  # strip() is used to remove starting and trailing

        # save the data in tuple
        sharelist.append((url, share))

    print(sharelist)

答案 3 :(得分:0)

您的语法错误是由于您使用try声明错过except这一事实。

try:
    # code that might throw HTTPError
    page = urllib2.urlopen(row)
except urllib2.HTTPError:
    continue

答案 4 :(得分:0)

正如其他人已经提到的那样,

  1. 尝试声明缺失
  2. 缺少适当的缩进。
  3. 您应该使用IDE或编辑器,这样您就不会遇到这样的问题,一些优秀的IDE和编辑器

    无论如何,尝试并缩进后的代码

    rawdata = []
    urllist = []
    sharelist = []
    mystring = 'http://www.konbini.com'
    
    
    def print_results(results):
        # Print data nicely for the user.
        if results:
            for row in results.get('rows'):
                rawdata.append(row[0])
        else:
            print 'No results found'
        urllist = [mystring + x for x in rawdata]
        for row in urllist:
            # query the website and return the html to the variable 'page'
            try:
                page = urllib2.urlopen(row)
            except urllib2.HTTPError:
                continue
    
        soup = BeautifulSoup(page, 'html.parser')
        # Take out the <div> of name and get its value
        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
            continue
        share = name_box.text.strip()  # strip() is used to remove starting and trailing
    
        # save the data in tuple
        sharelist.append((row, share))
    
        print(sharelist)