Question

我正在解析一些JSON（特别是Amazon评论文件，亚马逊公开提供）。我正在逐行解析转换为Pandas DataFrame并动态插入SQL。我发现了一些奇怪的东西。我使用UTF-8打开json文件。在我用记事本打开它的文件中，我没有看到任何奇怪的符号或其他什么。例如，评论的子字符串：

The temperature control doesn’t hold to as tight a temperature as some of the others reported.

但是当我解析它并检查字符串的内容时：

The temperature control doesn\xe2\x80\x99t hold to as tight a temperature as some of the others reported.

为什么会这样？我怎么能正确阅读呢？

我目前的代码如下：

def parseJSON(path):
  g = io.open(path,'r',encoding='utf8')
  for l in g:
      yield eval(l)



for l in parseJSON(r"reviews.json"):
    for review in l["reviews"]:
        df = {}
        df[l["url"]] = review["review"]
        dfInsert = pd.DataFrame( list(df.items()), columns = ["url", "Review"])

失败的文件子集是： http://www.filedropper.com/subset

Answer 1

首先，您不应该使用eval解析来自不安全（在线）来源的文本。如果数据在JSON中，则应使用JSON解析器。这就是JSON发明的原因 - 提供安全的序列化和反序列化。

在您的情况下，请使用标准json.load()模块中的json：

import json

def parseJSON(path):
    return json.load(io.open(path, 'r', encoding='utf-8-sig'))

由于您的JSON文件包含BOM，因此您应该使用知道如何剥离它的编解码器，即utf-8-sig。

如果您的文件包含每行一个JSON对象，您可以这样阅读：

def parseJSON(path):
    with io.open(path, 'r', encoding='utf-8-sig') as f:
        for line in f:
            yield json.loads(line)

现在回答为什么您会看到doesn\xe2\x80\x99t而不是doesn’t。如果将字节\xe2\x80\x99解码为UTF-8，则得到：

>>> '\xe2\x80\x99'.decode('utf8')`
u'\u2019'

那是什么Unicode代码点？

>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'

好的，现在当你在Python 2中eval()时会发生什么？嗯，首先，请注意Unicode并不是Python 2字符串中的一流公民（Python 3修复了这一点）。

因此，eval尝试将字符串（Python 2中的一系列字节）解析为Python表达式：

>>> eval('"’"')
'\xe2\x80\x99'

注意（在我的使用UTF-8的控制台中）即使我输入’，它也表示为3个字节的序列。

它甚至没有帮助说它应该是unicode：

>>> eval('u"’"')
u'\xe2\x80\x99'

有助于告诉Python如何解释源/字符串中后面的一系列字节，即编码的内容（参见PEP-263）：

>>> eval('# encoding: utf-8\nu"’"')
u'\u2019'

即使使用显式的utf-8编码，Python的UTF-8编码也会产生奇怪的结果

1 个答案: