Question

我有一个JSON，我将其转换为字典并尝试使用该字典制作数据框。问题是它是多个嵌套的，并且数据不一致

例如

d = """[
      {
        "id": 51,
        "kits": [
            {
                "id": 57,
                "kit": "KIT1182A",
                "items": [
                    {
                        "id": 254,
                        "product": {
                            "name": "Plastic Pallet",
                            "short_code": "PP001",
                            "priceperunit": 2500,
                            "volumetric_weight": 21.34
                        },
                        "quantity": 5
                    },
                    {
                        "id": 258,
                        "product": {
                            "name": "Separator Sheet",
                            "short_code": "FSS001",
                            "priceperunit": 170,
                            "volumetric_weight": 0.9
                        },
                        "quantity": 18
                    }
                ],
                "quantity": 5
            },                                     #end of kit
            {
                "id": 58,
                "kit": "KIT1182B",
                "items": [
                    {
                        "id": 259,
                        "product": {
                            "name": "Plastic Pallet",
                            "short_code": "PP001",
                            "priceperunit": 2500,
                            "volumetric_weight": 21.34
                        },
                        "quantity": 5
                    },
                    {
                        "id": 260,
                        "product": {
                            "name": "Plastic Sidewall",
                            "short_code": "PS001",
                            "priceperunit": 1250,
                            "volumetric_weight": 16.1
                        },
                        "quantity": 5
                    },
                    {
                        "id": 261,
                        "product": {
                            "name": "Plastic Lid",
                            "short_code": "PL001",
                            "priceperunit": 1250,
                            "volumetric_weight": 9.7
                        },
                        "quantity": 5
                    }
                   
                ],
                "quantity": 7
            }                                    #end of kit
        ],
        "warehouse": "Yantraksh Logistics Private limited_GGNPC1",
        "receiver_client": "Lumax Cornaglia Auto Tech Private Limited",
        "transport_by": "Kiran Roadways",
        "transaction_type": "Return",
        "transaction_date": "2020-08-13T04:34:11.678000Z",
        "transaction_no": 1180,
        "is_delivered": false,
        "driver_name": "__________",
        "driver_number": "__________",
        "lr_number": 0,
        "vehicle_number": "__________",
        "freight_charges": 0,
        "vehicle_type": "Part Load",
        "remarks": "0",
        "flow": 36,
        "owner": 2
    } ]"""

我想将其转换为如下所示的数据框：

transaction_no  is_delivered    flow    transaction_date    receiver_client warehouse   kits    quantity    product1    quantity1   product2    quantity2   product3    quantity3
1180       False    36  2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited   Yantraksh Logistics Private limited_GGNPC1  KIT1182A    5   PP001   5   FSS001  18  NaN NaN
1180       False    36  2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited   Yantraksh Logistics Private limited_GGNPC1  KIT1182B    7   PP001   5   PS001   5   PL001   7.0

或以更好的方式显示它：

我所做的：

data = json.loads(d)
result_dataframe = pd.DataFrame(data)
l = ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse','kits']  #fields that I need
result_dataframe = result_dataframe[l]
result_dataframe.to_csv("out.csv")

我尝试过：

def flatten(input_dict, separator='_', prefix=''):
output_dict = {}
for key, value in input_dict.items():
    if isinstance(value, dict) and value:
        deeper = flatten(value, separator, prefix+key+separator)
        output_dict.update({key2: val2 for key2, val2 in deeper.items()})
    elif isinstance(value, list) and value:
        for index, sublist in enumerate(value, start=1):
            if isinstance(sublist, dict) and sublist:
                deeper = flatten(sublist, separator, prefix+key+separator+str(index)+separator)
                output_dict.update({key2: val2 for key2, val2 in deeper.items()})
            else:
                output_dict[prefix+key+separator+str(index)] = value
    else:
        output_dict[prefix+key] = value
return output_dict

但是它在一行中给出了所有值，如何基于工具包将它们分开并得到结果？

Answer 1

像上面这样的数据转换非常普遍。熊猫提供了许多工具来帮助您完成此任务。

data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'])  # line 1
df = df.explode('items')  # line 2
df[['product_code', 'product_quantity']] = df['items'].apply(lambda x: pd.Series([x['product']['short_code'], x['quantity']]))  # line 3
df.drop(columns=['items'])  # line 4

将为您提供

  id       kit  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse product_code  product_quantity
0  57  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001        5               
0  57  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  FSS001       18              
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001        5               
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PS001        5               
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PL001        5

实际上，技巧仅在pd.json_normalize（第1行）中。它将创建一个非常类似于您所要求的数据框：

   id       kit                                                                                                                                                                                                                                                                                                                                                                                                                             items  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse
0  57  KIT1182A  [{'id': 254, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 258, 'product': {'name': 'Separator Sheet', 'short_code': 'FSS001', 'priceperunit': 170, 'volumetric_weight': 0.9}, 'quantity': 18}]                                                                                                                                          5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1
1  58  KIT1182B  [{'id': 259, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 260, 'product': {'name': 'Plastic Sidewall', 'short_code': 'PS001', 'priceperunit': 1250, 'volumetric_weight': 16.1}, 'quantity': 5}, {'id': 261, 'product': {'name': 'Plastic Lid', 'short_code': 'PL001', 'priceperunit': 1250, 'volumetric_weight': 9.7}, 'quantity': 5}]  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1

第items列在字典中包含产品的所有数据。可以按照与第3行类似的方式扩展它，但是我强烈建议不要这样做。稍后我将解释原因。因此，第2行根据套件中的项目数量爆炸每行。第三行提取prod_name和prod_quantity，最后在最后一行删除原始数据。

那么，为什么不应该有可变数量的表？您将永远不会知道每个套件中有多少物品。您将摆弄以获得这些变量列的值。这甚至比在字典中包含信息还要糟糕。

更新

要以您要求的确切方式获得结果，只需运行以下命令：

data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'] )
tmp = df['items'].apply(lambda it: [{'product'+str(indx+1):x['product']['short_code'], 'quantity'+str(indx+1):x['quantity']} for indx,x in enumerate(it)])
tmp = tmp.apply(lambda x :  {k:el[k] for el in x for k in el})
tmp = pd.DataFrame.from_records(tmp)
df = pd.concat([df, tmp], axis=1)
df = df.drop(columns=['items', 'id'])

根据您在线存储的数据，结果是：

        kit  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse product1  quantity1 product2  quantity2 product3  quantity3  product4  quantity4 product5  quantity5
0  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001    5          PS002    5          PL001    5          FIN1182A  30         FSS001   18       
1  KIT1182B  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001    5          PS001    5          PL001    5          FIN1182B  20         FSS001   25       
2  KIT1151   14        1179           False        1    2020-08-11T04:31:31.245000Z  Mahindra & Mahindra_Kandivali              Yantraksh Logistics Private limited_GGNPC1  PP001    14         PS001    14         PL001    14         FIN1151A  28         FSS001   42       
3  KIT1151   15        1178           False        32   2020-08-10T04:30:12.022000Z  Mahindra Vehicle Manufacturers Pune        Yantraksh Logistics Private limited_GGNPC1  PP001    15         PS001    15         PL001    15         FIN1151A  29         FSS001   43

Answer 2

使用json_normalize可以得到此信息：

data = json.loads(d)
df = pd.json_normalize(data,
                       record_path=['kits', 'items'],
                       meta=[
                         ['kits', 'kit'], 
                         ['id'], 
                         ['kits', 'quantity'],
                         ['warehouse'],
                         ['receiver_client']
                      ],
                       meta_prefix='top')
print(df)


    id  quantity      product.name  ... topkits.quantity                                topwarehouse                         topreceiver_client
0  254         5    Plastic Pallet  ...                5  Yantraksh Logistics Private limited_GGNPC1  Lumax Cornaglia Auto Tech Private Limited
1  258        18   Separator Sheet  ...                5  Yantraksh Logistics Private limited_GGNPC1  Lumax Cornaglia Auto Tech Private Limited
2  259         5    Plastic Pallet  ...                7  Yantraksh Logistics Private limited_GGNPC1  Lumax Cornaglia Auto Tech Private Limited
3  260         5  Plastic Sidewall  ...                7  Yantraksh Logistics Private limited_GGNPC1  Lumax Cornaglia Auto Tech Private Limited
4  261         5       Plastic Lid  ...                7  Yantraksh Logistics Private limited_GGNPC1  Lumax Cornaglia Auto Tech Private Limited

展平嵌套字典并将其转换为列

2 个答案:

更新