在抓取网页数据时,我遇到了'ascii'
编码问题,例如:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128)
,
我遇到了这个有争议的解决方案,有人说这很危险:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
请看这里: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
我正在使用Beautiful Soup
,我的应用程序索引以不同语言收集的文本,例如德语和法语,以及英语。
这是产生偶然错误的片段:
for page in pages:
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
回溯:
soup = BeautifulSoup(c.read())
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
在没有解决问题的情况下,在这里抓取我的数据最安全的方法是什么?