Question

我正在尝试使用BeautifulSoup废弃一些网址。我正在抓取的网址来自谷歌分析API调用，其中一些不能正常工作，所以我需要找到一种方法来跳过它们。

这是我的初始脚本，当我没有任何错误的URL时正常工作：

rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.blablabla.com'
def print_results(results):
  # Print data nicely for the user.

  if results:
    for row in results.get('rows'):
      rawdata.append(row[0])
  else:
    print 'No results found'

  urllist = [mystring + x for x in rawdata]

  for row in urllist:  
            # query the website and return the html to the variable 'page'
    page = urllib2.urlopen(row)
    soup = BeautifulSoup(page, 'html.parser')

                # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    share = name_box.text.strip() # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append((row,share))

  print(sharelist)

根据堆栈的回答，我有这些行来处理错误的网址：

if name_box is None:
  continue

然后我有了这句话：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

在我的脚本顶部处理此错误'ascii' codec can't encode character u'\u200b' in position 22: ordinal not in range(128)

但现在我的脚本给我一个空对象。

这是我的最终剧本：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
{...my api call here...}
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.blablabla.com'
def print_results(results):
  # Print data nicely for the user.

  if results:
    for row in results.get('rows'):
      rawdata.append(row[0])
  else:
    print 'No results found'

  urllist = [mystring + x for x in rawdata]

  for row in urllist:  
            # query the website and return the html to the variable 'page'
    page = urllib2.urlopen(row)
    soup = BeautifulSoup(page, 'html.parser')

                # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
      continue
    share = name_box.text.strip() # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append((row,share))

  print(sharelist)

BeautifulSoup给我一个空对象

0 个答案: