Question

所以我正在使用BeautifulSoup。它为我提供了一些HTML节点的文本，但这些节点有一些Unicode字符，它们被转换为字符串中的转义序列

例如，一个包含以下内容的HTML元素： BeautifulSoup检索50 €，如： soup.find("h2").text作为此字符串：50\u20ac，只能在Python控制台中读取。但是当写入JSON文件时它变得不可读。注意：我使用以下代码保存到json： with open('file.json', 'w') as fp: json.dump(fileToSave, fp) 如何将这些Unicode字符转换回UTF-8或任何使它们再次可读的内容？

Answer 1

对于Python 2.7，我认为您可以使用codecs和json.dump(obj, fp, ensure_ascii=False)。例如：

import codecs
import json

with codecs.open(filename, 'w', encoding='utf-8') as fp:
    # obj is a 'unicode' which contains "50 €"
    json.dump(obj, fp, ensure_ascii=False)

Answer 2

使用Python的小型演示3.如果您不使用ensure_ascii=False转储到JSON，则非ASCII将使用Unicode转义码写入JSON。这不会影响加载JSON的能力，但它在.json文件本身中的可读性较差。

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50\u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
...  json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z

out.json的内容（UTF-8编码）：

"50€"

Answer 3

请尝试以下方式：

utf8string = <unicodestring>.encode("utf-8")

将Python转义的unicode序列转换为UTF-8

3 个答案: