我正在尝试使用BeautifulSoup废弃一些网址。我正在抓取的网址来自谷歌分析API调用,其中一些不能正常工作,所以我需要找到一种方法来跳过它们。
这是我的初始脚本,当我没有任何错误的URL时正常工作:
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.blablabla.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
根据堆栈的回答,我有这些行来处理错误的网址:
if name_box is None:
continue
然后我有了这句话:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
在我的脚本顶部处理此错误'ascii' codec can't encode character u'\u200b' in position 22: ordinal not in range(128)
但现在我的脚本给我一个空对象。
这是我的最终剧本:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
{...my api call here...}
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.blablabla.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)