文件包含\ u00c2 \ u00a0,转换为字符

时间:2019-07-09 14:53:34

标签: python python-3.x python-2.7 decode encode

我有一个JSON文件,其中包含这样的文本

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

我的简单问题是如何将这些\ u代码转换(不删除)为空格,撇号和e.t.c ...?

输入:带有.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

的文本文件

输出: .....wax, and voila!(converted to the line break)At the moment you can't use our ...

Python代码

def TEST():
        export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

        with open("TEST.json",'w') as file:
            file.write(export.decode('utf8'))

我尝试过的事情:

  • 使用.json()
  • 将.encode()。decode()与e.t.c结合的任何不同方式。

编辑1

当我将此文件上传到BigQuery时,我有-Â符号

更大样本:

{
    "xxxx1": "...You don\u2019t nee...",
    "xxxx2": "...Gu\u00e9rer...",
    "xxxx3": "...boost.\u00a0Sit back an....",
    "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
    "xxxx5": "\u00a0\n\u00a0",
    "xxxx6": "It was Christmas Eve babe\u2026",
    "xxxx7": "It\u2019s xxx xxx\u2026"
}

Python代码:

import json
import re
import codecs


def load():
    epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
    x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

    with open("TEST.json", "w") as file:
        json.dump(x,file)

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
    load()

3 个答案:

答案 0 :(得分:2)

hacky方法是删除编码的外层:

import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

这匹配转义的UTF-8字节,并将其替换为实际的UTF-8字节。将生成的类似字节的对象写入磁盘(无需进一步解码!)将产生有效的UTF-8 JSON文件。

如果文件包含UTF-8范围内的真实转义的unicode字符(例如带重音符号“ e”的\u00e9),这当然会中断。

答案 1 :(得分:2)

我已经使这个简陋的UTF-8变得难以理解,它似乎可以解决您的混乱编码情况:

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

用法:

broken_json = '{"some_key": "... \\u00e2\\u0080\\u0099 w\\u0061x, and voila!\\u00c2\\u00a0\\u00c2\\u00a0At the moment you can\'t use our \\u00e2\\u0082\\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

它使用正则表达式从字符串中提取十六进制序列,将其转换为单个字节,并将其解码为UTF-8。

对于上面的示例字符串(我已经包含3字节字符作为测试),将打印:

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

“已解析数据”中的\xa0是Python将命令输出到控制台的方式引起的,它仍然是实际的不间断空格。

答案 2 :(得分:1)

当您尝试将其写入名为TEST.json的文件时,我将假定此字符串是较大json字符串的一部分。

让我举一个完整的例子:

js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''
print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

我先用json加载它:

x = json.loads(js)
print(x)

{'a': "and voila!Â\xa0At the moment you can't use our"}

好吧,这看起来像一个utf-8字符串,被错误地解码为Latin1。让我们进行相反的操作:

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

好的,现在可以了,我们可以将其转换回正确的json字符串:

print(json.dumps(x))

{"a": "and voila!\\u00a0At the moment you can\'t use our"}

表示正确编码的NO-BREAK空间(U + 00A0)

TL / DR:您应该做的是:

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)