如何在python中排除BMP中没有的字符?

时间:2018-07-20 15:48:08

标签: python python-3.x utf-8 urllib

这是一个提供术语并抓取Urban Dictionary并返回页面中第一个含义的应用。 到目前为止,这是我的代码:

import re
import urllib.request

term = input('Enter a word: ')
url = "https://www.urbandictionary.com/define.php?term=" + term

rawData = urllib.request.urlopen(url).read()
decodedData = rawData.decode("utf-8")

x = re.search('div class="meaning"', rawData)
start = x.start()
end = x.end()
result = rawData[start:end]
print(result)

但是我收到下面的错误

    Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    print(decodedData)
  File "~\Python\Python35-32\lib\idlelib\PyShell.py", line 1344, in write
    return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 95889-95889: Non-BMP character not supported in Tk

如何排除无法解码的字符?

1 个答案:

答案 0 :(得分:1)

好的,要解决您的问题,您只需要实际使用解码后的数据即可。当前,您正在解码数据,但是随后您使用了rawData

import re
import urllib.request

term = input('Enter a word: ')
url = "https://www.urbandictionary.com/define.php?term=" + term

rawData = urllib.request.urlopen(url).read()
decodedData = rawData.decode("utf-8")

x = re.search('div class="meaning"', decodedData)
start = x.start()
end = x.end()
result = decodedData[start:end]
print(result)

应该这样做。如果这样做不起作用,请张贴一个示例单词,抛出该错误。 (顺便说一下,这段代码不会产生您想要的输出)