对于invaild转义字符,JSON加载失败

时间:2016-06-26 14:46:55

标签: python json python-3.x

我从外部服务获取一个大型数据文件,其中每一行都是一个json对象。但是,它包含多个十六进制字符,如(\ xef,\ xa0,\ xa9)等,以及一些unicode字符,如(\ u2022)。我基本上是在阅读文件,如

with open(filename,'r') as fh:
    for line in fh:
        attr = json.loads(line)

我尝试将utf-8和latin-1编码给open方法,但json加载仍然失败。如果删除了无效字符,则加载正在运行,但我不想丢失任何数据。推荐的解决方法是什么?

repr(line)样本:

'{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'

json.loads在item_name中的\ xf6失败,带有Invalid \ escape:第1行第105列(char 104)。

1 个答案:

答案 0 :(得分:0)

您可以使用正则表达式修复JSON字符串,如@Martijn所述。这是一个冗长的例子。

import re
import json

s = '{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'

xinvalid = re.compile(r'\\x([0-9a-fA-F]{2})')

def fix_xinvalid(m):
    return chr(int(m.group(1), 16))

def fix(s):
    return xinvalid.sub(fix_xinvalid, s)

print(json.loads(fix(s)))

和输出:

{'recommended_browse_nodes': '361208011', 'bullet_point': [], 'product_description': [], 'brand_name': 'Citygate', 'size_name': '42 EU', 'material_composition': ['Leather'], 'product_type': 'SHOES', 'item_name': ['Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)'], 'department_name': 'Women’s', 'manufacturer': 'J H Pölking GmbH & Co KG'}