我有一个JSON,我将其转换为字典并尝试使用该字典制作数据框。问题是它是多个嵌套的,并且数据不一致
例如
d = """[
{
"id": 51,
"kits": [
{
"id": 57,
"kit": "KIT1182A",
"items": [
{
"id": 254,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 258,
"product": {
"name": "Separator Sheet",
"short_code": "FSS001",
"priceperunit": 170,
"volumetric_weight": 0.9
},
"quantity": 18
}
],
"quantity": 5
}, #end of kit
{
"id": 58,
"kit": "KIT1182B",
"items": [
{
"id": 259,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 260,
"product": {
"name": "Plastic Sidewall",
"short_code": "PS001",
"priceperunit": 1250,
"volumetric_weight": 16.1
},
"quantity": 5
},
{
"id": 261,
"product": {
"name": "Plastic Lid",
"short_code": "PL001",
"priceperunit": 1250,
"volumetric_weight": 9.7
},
"quantity": 5
}
],
"quantity": 7
} #end of kit
],
"warehouse": "Yantraksh Logistics Private limited_GGNPC1",
"receiver_client": "Lumax Cornaglia Auto Tech Private Limited",
"transport_by": "Kiran Roadways",
"transaction_type": "Return",
"transaction_date": "2020-08-13T04:34:11.678000Z",
"transaction_no": 1180,
"is_delivered": false,
"driver_name": "__________",
"driver_number": "__________",
"lr_number": 0,
"vehicle_number": "__________",
"freight_charges": 0,
"vehicle_type": "Part Load",
"remarks": "0",
"flow": 36,
"owner": 2
} ]"""
我想将其转换为如下所示的数据框:
transaction_no is_delivered flow transaction_date receiver_client warehouse kits quantity product1 quantity1 product2 quantity2 product3 quantity3
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182A 5 PP001 5 FSS001 18 NaN NaN
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182B 7 PP001 5 PS001 5 PL001 7.0
或以更好的方式显示它:
我所做的:
data = json.loads(d)
result_dataframe = pd.DataFrame(data)
l = ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse','kits'] #fields that I need
result_dataframe = result_dataframe[l]
result_dataframe.to_csv("out.csv")
我尝试过:
def flatten(input_dict, separator='_', prefix=''):
output_dict = {}
for key, value in input_dict.items():
if isinstance(value, dict) and value:
deeper = flatten(value, separator, prefix+key+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
elif isinstance(value, list) and value:
for index, sublist in enumerate(value, start=1):
if isinstance(sublist, dict) and sublist:
deeper = flatten(sublist, separator, prefix+key+separator+str(index)+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
else:
output_dict[prefix+key+separator+str(index)] = value
else:
output_dict[prefix+key] = value
return output_dict
但是它在一行中给出了所有值,如何基于工具包将它们分开并得到结果?
答案 0 :(得分:6)
像上面这样的数据转换非常普遍。熊猫提供了许多工具来帮助您完成此任务。
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse']) # line 1
df = df.explode('items') # line 2
df[['product_code', 'product_quantity']] = df['items'].apply(lambda x: pd.Series([x['product']['short_code'], x['quantity']])) # line 3
df.drop(columns=['items']) # line 4
将为您提供
id kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product_code product_quantity
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 FSS001 18
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PS001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PL001 5
实际上,技巧仅在pd.json_normalize
(第1行)中。它将创建一个非常类似于您所要求的数据框:
id kit items quantity transaction_no is_delivered flow transaction_date receiver_client warehouse
0 57 KIT1182A [{'id': 254, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 258, 'product': {'name': 'Separator Sheet', 'short_code': 'FSS001', 'priceperunit': 170, 'volumetric_weight': 0.9}, 'quantity': 18}] 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1
1 58 KIT1182B [{'id': 259, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 260, 'product': {'name': 'Plastic Sidewall', 'short_code': 'PS001', 'priceperunit': 1250, 'volumetric_weight': 16.1}, 'quantity': 5}, {'id': 261, 'product': {'name': 'Plastic Lid', 'short_code': 'PL001', 'priceperunit': 1250, 'volumetric_weight': 9.7}, 'quantity': 5}] 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1
第items
列在字典中包含产品的所有数据。可以按照与第3行类似的方式扩展它,但是我强烈建议不要这样做。稍后我将解释原因。因此,第2行根据套件中的项目数量爆炸每行。第三行提取prod_name
和prod_quantity
,最后在最后一行删除原始数据。
那么,为什么不应该有可变数量的表?您将永远不会知道每个套件中有多少物品。您将摆弄以获得这些变量列的值。这甚至比在字典中包含信息还要糟糕。
要以您要求的确切方式获得结果,只需运行以下命令:
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'] )
tmp = df['items'].apply(lambda it: [{'product'+str(indx+1):x['product']['short_code'], 'quantity'+str(indx+1):x['quantity']} for indx,x in enumerate(it)])
tmp = tmp.apply(lambda x : {k:el[k] for el in x for k in el})
tmp = pd.DataFrame.from_records(tmp)
df = pd.concat([df, tmp], axis=1)
df = df.drop(columns=['items', 'id'])
根据您在线存储的数据,结果是:
kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product1 quantity1 product2 quantity2 product3 quantity3 product4 quantity4 product5 quantity5
0 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS002 5 PL001 5 FIN1182A 30 FSS001 18
1 KIT1182B 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS001 5 PL001 5 FIN1182B 20 FSS001 25
2 KIT1151 14 1179 False 1 2020-08-11T04:31:31.245000Z Mahindra & Mahindra_Kandivali Yantraksh Logistics Private limited_GGNPC1 PP001 14 PS001 14 PL001 14 FIN1151A 28 FSS001 42
3 KIT1151 15 1178 False 32 2020-08-10T04:30:12.022000Z Mahindra Vehicle Manufacturers Pune Yantraksh Logistics Private limited_GGNPC1 PP001 15 PS001 15 PL001 15 FIN1151A 29 FSS001 43
答案 1 :(得分:1)
使用json_normalize
可以得到此信息:
data = json.loads(d)
df = pd.json_normalize(data,
record_path=['kits', 'items'],
meta=[
['kits', 'kit'],
['id'],
['kits', 'quantity'],
['warehouse'],
['receiver_client']
],
meta_prefix='top')
print(df)
id quantity product.name ... topkits.quantity topwarehouse topreceiver_client
0 254 5 Plastic Pallet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
1 258 18 Separator Sheet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
2 259 5 Plastic Pallet ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
3 260 5 Plastic Sidewall ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
4 261 5 Plastic Lid ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited