I am writing a simple python program that retrieves information from a website, the problem is that there are some words which contain special characters such as "°", "Ψ" and many more.
Here is my code:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('http://www.samplepage.sample').read()
soup = BeautifulSoup(r, "lxml")
text = soup.find_all("a", class_="some_class")
for word in text:
word = word.get_text()
word = word.encode('utf-8')
print word
the output should be "°", but instead of that, I get "°"
If i try to encode it with ascii i get the classical UnicodeEncodeError:
for word in text:
word = word.get_text()
word = word.encode('ascii')
print word
>>> UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8:
ordinal not in range(128)
Any ideas?
答案 0 :(得分:0)
这可能是因为您使用错误的编解码器解码字符串。
尝试打印字符串,然后在使用 utf-8 进行编码之前,需要使用正确的编解码器解码字符串。然后你会得到一个Unicode对象,你可以打印它,并且应该正确显示。
如果它是ascii映射之外的特殊字符,则需要Unicode对象才能正确显示它。
尝试执行以下操作:
new_word = word.decode('latin-1')
print new_code
word = word.encode('utf-8')