Question

我从外部服务获取一个大型数据文件，其中每一行都是一个json对象。但是，它包含多个十六进制字符，如（\ xef，\ xa0，\ xa9）等，以及一些unicode字符，如（\ u2022）。我基本上是在阅读文件，如

with open(filename,'r') as fh:
    for line in fh:
        attr = json.loads(line)

我尝试将utf-8和latin-1编码给open方法，但json加载仍然失败。如果删除了无效字符，则加载正在运行，但我不想丢失任何数据。推荐的解决方法是什么？

repr（line）样本：

'{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'

json.loads在item_name中的\ xf6失败，带有Invalid \ escape：第1行第105列（char 104）。

Answer 1

您可以使用正则表达式修复JSON字符串，如@Martijn所述。这是一个冗长的例子。

import re
import json

s = '{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'

xinvalid = re.compile(r'\\x([0-9a-fA-F]{2})')

def fix_xinvalid(m):
    return chr(int(m.group(1), 16))

def fix(s):
    return xinvalid.sub(fix_xinvalid, s)

print(json.loads(fix(s)))

和输出：

{'recommended_browse_nodes': '361208011', 'bullet_point': [], 'product_description': [], 'brand_name': 'Citygate', 'size_name': '42 EU', 'material_composition': ['Leather'], 'product_type': 'SHOES', 'item_name': ['Citygate  960561 Ankle Boots Womens  Gray Grau (anthrazit 9) Size: 8 (42 EU)'], 'department_name': 'Women’s', 'manufacturer': 'J H Pölking GmbH & Co KG'}

对于invaild转义字符，JSON加载失败

1 个答案: