如何从熊猫数据框创建给定的jSON格式?

时间:2020-03-18 09:37:27

标签: python json pandas csv dictionary

数据如下:

enter image description here

期望的Json fomat就是这样

    {
    "DataExtractName": "SalesDataExtract",
    "BusinessName" : {
        "InvoiceDate": {
            "SourceSystem": {
                "MYSQL" : "Invc_Dt",
                "CSV" : "Invc_Date"
            },
            "DataType": {
                "MYSQL" : "varchar",
                "CSV" : "string"
            }
        },
        "Description": {
            "SourceSystem": {
                "MYSQL" : "Prod_Desc",
                "CSV" : "Prod_Descr"
            },
            "DataType": {
                "MYSQL" : "varchar",
                "CSV" : "string"
            }
        }
    }
},
{
    "DataExtractName": "DateDataExtract",
    "BusinessName" : {
        "InvoiceDate": {
            "SourceSystem": {
                "MYSQL" : "Date"
            },
            "DataType": {
                "MYSQL" : "varchar"
            }
        }
    }
}

如何使用python数据帧实现此目的?还是我需要编写一些脚本来制作这样的数据?

注意

我尝试使用-

  1. df.to_json
  2. df.to_dict

1 个答案:

答案 0 :(得分:2)

由于嵌套结构太多,您应该使用棉花糖。它是在考虑您的用例的情况下构建的。请查看出色的文档:https://marshmallow.readthedocs.io/en/stable/。您需要的只是masic用法。

有很多代码,但是比聪明更明确。我敢肯定,存在一个更短的解决方案,但它可能无法维护。我还必须建立您的数据框。请下次以数据格式提供它。

import pandas as pd
import marshmallow as ma

# build test data
df = pd.DataFrame.from_records([
                               ['InvoiceDate', 'MYSQL', 'Invc_Dt', 'varchar', 'SalesDataExtract'],
                               ['InvoiceDate', 'CSV', 'Invc_Date', 'string', 'SalesDataExtract'], 
                               ['Description', 'MYSQL', 'Prod_Descr', 'varchar', 'SalesDataExtract'],
                               ['Description', 'CSV', 'Prod_Descr', 'string', 'SalesDataExtract'],
                               ['InvoiceDate', 'MYSQL', 'Date', 'varchar', 'DateDataExtract'],]
                        )
df.columns = ['BusinessName', 'SourceSystem', 'FunctionalName', 'DataType', 'DataExtractName']


# define marshmallow schemas
class SourceSystemTypeSchema(ma.Schema):
    MYSQL = ma.fields.String()
    CSV = ma.fields.String()

class DataTypeSchema(ma.Schema):
    MYSQL = ma.fields.String()
    CSV = ma.fields.String()

class InvoiceDateSchema(ma.Schema):
    InvoiceDate = ma.fields.Nested(SourceSystemTypeSchema())
    DataType = ma.fields.Nested(DataTypeSchema())

class DescriptionSchema(ma.Schema):
    SourceSystem = ma.fields.Nested(SourceSystemTypeSchema())
    DataType = ma.fields.Nested(DataTypeSchema())

class BusinessNameSchema(ma.Schema):
    InvoiceDate = ma.fields.Nested(InvoiceDateSchema())
    Description = ma.fields.Nested(DescriptionSchema())

class DataSchema(ma.Schema):
    DataExtractName = ma.fields.String()
    BusinessName = ma.fields.Nested(BusinessNameSchema())

# building json
result = []

mask_business_name_invoicedate = df.BusinessName == 'InvoiceDate'
mask_business_name_description = df.BusinessName == 'Description'

for data_extract_name in set(df['DataExtractName'].to_list()):
    mask_data_extract_name = df.DataExtractName == data_extract_name

    # you need these two helper dfs to get the dictionaries 
    df_source_system = df[mask_data_extract_name & mask_business_name_invoicedate].set_index('SourceSystem').to_dict(orient='dict')
    df_description = df[mask_data_extract_name & mask_business_name_description].set_index('SourceSystem').to_dict(orient='dict')

    # all dictionaries are defined, so you can use your schemas
    source_system_type = SourceSystemTypeSchema().dump(df_source_system['FunctionalName'])
    data_type = DataTypeSchema().dump(df_source_system['DataType'])
    source_system = SourceSystemTypeSchema().dump(df_description['FunctionalName'])
    invoice_date = InvoiceDateSchema().dump({'SourceSystemType': source_system_type, 'DataType': data_type})
    description = DescriptionSchema().dump({'SourceSystem': source_system, 'DataType': data_type})
    business_name = BusinessNameSchema().dump({'InvoiceDate': invoice_date, 'Description': description})
    data = DataSchema().dump({'DataExtractName': data_extract_name, 'BusinessName': business_name})

    # end result
    result.append(data)

现在,

ma.pprint(result)

返回

[{'BusinessName': {'Description': {'DataType': {'CSV': 'string',
                                                'MYSQL': 'varchar'},
                                   'SourceSystem': {'CSV': 'Prod_Descr',
                                                    'MYSQL': 'Prod_Descr'}},
                   'InvoiceDate': {'DataType': {'CSV': 'string',
                                                'MYSQL': 'varchar'}}},
  'DataExtractName': 'SalesDataExtract'},
 {'BusinessName': {'Description': {'DataType': {'MYSQL': 'varchar'},
                                   'SourceSystem': {}},
                   'InvoiceDate': {'DataType': {'MYSQL': 'varchar'}}},
  'DataExtractName': 'DateDataExtract'}]