继承人我所做的......
>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>>
>>> soup.find('div')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>>
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>
如何从html
中删除令人不安的unicode字符?
或者有更清洁的解决方案吗?
答案 0 :(得分:10)
尝试这种方式:
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
答案 1 :(得分:2)
您看到的错误是由于repr(soup)
尝试混合Unicode和字节串。混合Unicode和字节串经常会导致错误。
比较
>>> u'1' + '©'
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
和
>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'
以下是类的示例:
>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
...
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
...
>>> B()
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)
BeautifulSoup
发生了类似的事情:
>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)
解决方法:
>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'
答案 2 :(得分:1)
首先,“令人不安”的unicode字符可能是某种语言的字母,但假设您不必担心非英语字符,那么您可以使用python lib将unicode转换为ansi。看看这个问题的答案: How do I convert a file's format from Unicode to ASCII using Python?
接受的答案似乎是一个很好的解决方案(事先我不知道)。
答案 3 :(得分:0)
我遇到了同样的问题,花了好几个小时。请注意,只要解释器必须显示内容,就会发生错误,这是因为解释器正在尝试转换为ascii,从而导致出现问题。看看最常见的答案:
UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2