我从外部服务获取一个大型数据文件,其中每一行都是一个json对象。但是,它包含多个十六进制字符,如(\ xef,\ xa0,\ xa9)等,以及一些unicode字符,如(\ u2022)。我基本上是在阅读文件,如
with open(filename,'r') as fh:
for line in fh:
attr = json.loads(line)
我尝试将utf-8和latin-1编码给open方法,但json加载仍然失败。如果删除了无效字符,则加载正在运行,但我不想丢失任何数据。推荐的解决方法是什么?
repr(line)样本:
'{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate 960561 Ankle Boots Womens Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'
json.loads在item_name中的\ xf6失败,带有Invalid \ escape:第1行第105列(char 104)。
答案 0 :(得分:0)
您可以使用正则表达式修复JSON字符串,如@Martijn所述。这是一个冗长的例子。
import re
import json
s = '{"product_type":"SHOES","recommended_browse_nodes":"361208011","item_name":["Citygate 960561 Ankle Boots Womens Gray Grau (anthrazit 9) Size: 8 (42 EU)"],"product_description":[],"brand_name":"Citygate","manufacturer":"J H P\\xf6lking GmbH & Co KG","bullet_point":[],"department_name":"Women\\u2019s","size_name":"42 EU","material_composition":["Leather"]}\n'
xinvalid = re.compile(r'\\x([0-9a-fA-F]{2})')
def fix_xinvalid(m):
return chr(int(m.group(1), 16))
def fix(s):
return xinvalid.sub(fix_xinvalid, s)
print(json.loads(fix(s)))
和输出:
{'recommended_browse_nodes': '361208011', 'bullet_point': [], 'product_description': [], 'brand_name': 'Citygate', 'size_name': '42 EU', 'material_composition': ['Leather'], 'product_type': 'SHOES', 'item_name': ['Citygate 960561 Ankle Boots Womens Gray Grau (anthrazit 9) Size: 8 (42 EU)'], 'department_name': 'Women’s', 'manufacturer': 'J H Pölking GmbH & Co KG'}