Question

我有一个JSON文件，其中包含这样的文本

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

我的简单问题是如何将这些\ u代码转换（不删除）为空格，撇号和e.t.c ...？

输入：带有.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

的文本文件

输出： .....wax, and voila!(converted to the line break)At the moment you can't use our ...

Python代码

def TEST():
        export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

        with open("TEST.json",'w') as file:
            file.write(export.decode('utf8'))

我尝试过的事情：

使用.json（）
将.encode（）。decode（）与e.t.c结合的任何不同方式。

编辑1

当我将此文件上传到BigQuery时，我有-Â符号

更大样本：

{
    "xxxx1": "...You don\u2019t nee...",
    "xxxx2": "...Gu\u00e9rer...",
    "xxxx3": "...boost.\u00a0Sit back an....",
    "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
    "xxxx5": "\u00a0\n\u00a0",
    "xxxx6": "It was Christmas Eve babe\u2026",
    "xxxx7": "It\u2019s xxx xxx\u2026"
}

Python代码：

import json
import re
import codecs


def load():
    epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
    x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

    with open("TEST.json", "w") as file:
        json.dump(x,file)

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
    load()

Answer 1

hacky方法是删除编码的外层：

import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

这匹配转义的UTF-8字节，并将其替换为实际的UTF-8字节。将生成的类似字节的对象写入磁盘（无需进一步解码！）将产生有效的UTF-8 JSON文件。

如果文件包含UTF-8范围内的真实转义的unicode字符（例如带重音符号“ e”的\u00e9），这当然会中断。

Answer 2

我已经使这个简陋的UTF-8变得难以理解，它似乎可以解决您的混乱编码情况：

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

用法：

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

它使用正则表达式从字符串中提取十六进制序列，将其转换为单个字节，并将其解码为UTF-8。

对于上面的示例字符串（我已经包含3字节字符€作为测试），将打印：

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

“已解析数据”中的\xa0是Python将命令输出到控制台的方式引起的，它仍然是实际的不间断空格。

Answer 3

当您尝试将其写入名为TEST.json的文件时，我将假定此字符串是较大json字符串的一部分。

让我举一个完整的例子：

js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''
print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

我先用json加载它：

x = json.loads(js)
print(x)

{'a': "and voila!Â\xa0At the moment you can't use our"}

好吧，这看起来像一个utf-8字符串，被错误地解码为Latin1。让我们进行相反的操作：

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

好的，现在可以了，我们可以将其转换回正确的json字符串：

print(json.dumps(x))

{"a": "and voila!\\u00a0At the moment you can\'t use our"}

表示正确编码的NO-BREAK空间（U + 00A0）

TL / DR：您应该做的是：

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)

文件包含\ u00c2 \ u00a0，转换为字符

3 个答案: