Python - 设置用于在许多语言中进行抓取的编码

时间:2016-11-16 23:58:19

标签: python encoding web-scraping

在抓取网页数据时,我遇到了'ascii'编码问题,例如:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128)

我遇到了这个有争议的解决方案,有人说这很危险:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

请看这里: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

我正在使用Beautiful Soup,我的应用程序索引以不同语言收集的文本,例如德语和法语,以及英语。

这是产生偶然错误的片段:

for page in pages:
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())

回溯:

soup = BeautifulSoup(c.read()) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__ self._feed(isHTML=isHTML) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed SGMLParser.feed(self, markup)

在没有解决问题的情况下,在这里抓取我的数据最安全的方法是什么?

0 个答案:

没有答案