Question

在抓取网页数据时，我遇到了'ascii'编码问题，例如：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128)，

我遇到了这个有争议的解决方案，有人说这很危险：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

请看这里： Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

我正在使用Beautiful Soup，我的应用程序索引以不同语言收集的文本，例如德语和法语，以及英语。

这是产生偶然错误的片段：

for page in pages:
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())

回溯：

soup = BeautifulSoup(c.read()) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__ self._feed(isHTML=isHTML) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed SGMLParser.feed(self, markup)

在没有解决问题的情况下，在这里抓取我的数据最安全的方法是什么？

Python - 设置用于在许多语言中进行抓取的编码

0 个答案: