如何将嵌套的JSON解析为CSV

时间:2019-10-24 21:03:36

标签: python json csv parsing

我有一个新项目,我在其中从REST API获得JSON数据-我正尝试将这些数据解析为定界的csv管道,以导入到我们的旧版软件中 我似乎无法正确解析所有的值对-这是我第一次接触JSON,我尝试了很多事情,但一次只能获得一些正确的结果

我使用过Python,可以获取一些我需要的项目,但不能获取整个JSON树-它作为一个列表出现,并且还包含一些字典和列表 我知道我的代码是不完整的,只是寻找某人向我指出python中哪些工具可以完成工作的正确方向

import json
import csv

with open('tenants.json') as access_json:
    read_content = json.load(access_json)


for rm_access in read_content:
    rm_data = rm_access

print(rm_data)
contacts_data = rm_data['Contacts']
leases_data = rm_data['Leases']
udfs_data = rm_data['UserDefinedValues']

for contacts_access in contacts_data:
    rm_contacts = contacts_access

已更新:

import pandas as pd

with open('tenants.json') as access_json:
    read_content = json.load(access_json)

for rm_access in read_content:
    rm_data = rm_access

pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 150)

TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
#    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name"
: Name,"FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,"Com
ment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod, "PropertyID"
: PropertyID, "PostingStartDate": PostingStartDate,"CreateDate": CreateDate, "Cr
eateUserID": CreateUserID,"UpdateUserID": UpdateUserID,"Contacts": Contacts})

print(df)

这是文件示例

[
  {
    "TenantID": 115,
    "TenantDisplayID": 115,
    "Name": "Jane Doe",
    "FirstName": "Jane",
    "LastName": "Doe",
    "WebMessage": "",
    "Comment": "",
    "RentDueDay": 1,
    "RentPeriod": "Monthly",
    "FirstContact": "2015-11-01T15:30:00",
    "PropertyID": 17,
    "PostingStartDate": "2010-10-01T00:00:00",
    "CreateDate": "2014-04-16T13:35:37",
    "CreateUserID": 1,
    "UpdateDate": "2017-03-22T11:31:48",
    "UpdateUserID": 1,
    "Contacts": [
      {
        "ContactID": 128,
        "FirstName": "Jane",
        "LastName": "Doe",
        "MiddleName": "",
        "IsPrimary": true,
        "DateOfBirth": "1975-02-27T00:00:00",
        "FederalTaxID": "111-11-1111",
        "Comment": "",
        "Email": "jane.doe@mail.com",
        "License": "ZZT4532",
        "Vehicle": "BMW 3 Series",
        "IsShowOnBill": true,
        "Employer": "REW",
        "ApplicantType": "Applicant",
        "CreateDate": "2014-04-16T13:35:37",
        "CreateUserID": 1,
        "UpdateDate": "2017-03-22T11:31:48",
        "AnnualIncome": 0.0,
        "UpdateUserID": 1,
        "ParentID": 115,
        "ParentType": "Tenant",
        "PhoneNumbers": [
          {
            "PhoneNumberID": 286,
            "PhoneNumberTypeID": 2,
            "PhoneNumber": "703-555-5610",
            "Extension": "",
            "StrippedPhoneNumber": "7035555610",
            "IsPrimary": true,
            "ParentID": 128,
            "ParentType": "Contact"
          }
        ]
      }
    ],
    "UserDefinedValues": [
      {
        "UserDefinedValueID": 1,
        "UserDefinedFieldID": 4,
        "ParentID": 115,
        "Name": "Emerg Contact Name",
        "Value": "Terry Harper",
        "UpdateDate": "2016-01-22T15:41:53",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      },
      {
        "UserDefinedValueID": 174,
        "UserDefinedFieldID": 5,
        "ParentID": 115,
        "Name": "Emerg Contact Phone",
        "Value": "703-555-3568",
        "UpdateDate": "2016-01-22T15:42:03",
        "FieldType": "Text",
        "UpdateUserID": 2,
        "CreateUserID": 2
      }
    ],
    "Leases": [
      {
        "LeaseID": 115,
        "TenantID": 115,
        "UnitID": 181,
        "PropertyID": 17,
        "MoveInDate": "2010-10-01T00:00:00",
        "SortOrder": 1,
        "CreateDate": "2014-04-16T13:35:37",
        "UpdateDate": "2017-03-22T11:31:48",
        "CreateUserID": 1,
        "UpdateUserID": 1
      }
    ],
    "Addresses": [
      {
        "AddressID": 286,
        "AddressTypeID": 1,
        "Address": "14393 Montgomery Road Lot #102\r\nCincinnati, OH 45122",
        "Street": "14393 Montgomery Road Lot #102",
        "City": "Cincinnati",
        "State": "OH",
        "PostalCode": "45122",
        "IsPrimary": true,
        "ParentID": 115,
        "ParentType": "Tenant"
      }
    ],
    "OpenReceivables": [],
    "Status": "Current"
  },

并非所有租户都拥有所有元素,这也很棘手

我需要顶部有TenantID,TenantDisplayID等的数据 我还需要联系人,电话号码,租赁等值中的数据 每行应该是静态的,因此如果没有某些标记,则我想输入Null或None,这样看起来 TentantID | TenantDisplayID |名字...等,因此每一行都有相同数量的字段

2 个答案:

答案 0 :(得分:0)

类似的事情应该起作用:

import pandas as pd 
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100000)
TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
    rm_data = rm_access

    print(rm_data)
    TenantID.append(rm_data["TenantID"])
    TenantDisplayID.append(rm_data["TenantDisplayID"])
    Name.append(rm_data["Name"])
    FirstName.append(rm_data["FirstName"])
    LastName.append(rm_data["LastName"])
    WebMessage.append(rm_data["WebMessage"])
    Comment.append(rm_data["Comment"])
    RentDueDay.append(rm_data["RentDueDay"])
    RentPeriod.append(rm_data["RentPeriod"])
    FirstContact.append(rm_data["FirstContact"])
    PropertyID.append(rm_data["PropertyID"])
    PostingStartDate.append(rm_data["PostingStartDate"])
    CreateDate.append(rm_data["CreateDate"])
    CreateUserID.append(rm_data["CreateUserID"])
    UpdateUserID.append(rm_data["UpdateUserID"])
    Contacts.append(rm_data["Contacts"])


df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name": Name,
                   "FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,
                   "Comment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod,
                   "FirstContact": FirstContact, "PropertyID": PropertyID, "PostingStartDate": PostingStartDate,
                   "CreateDate": CreateDate, "CreateUserID": CreateUserID,"UpdateUserID": UpdateUserID,
                   "Contacts": Contacts})

print(df)

答案 1 :(得分:0)

一般问题

此任务(和其他类似任务)的问题不仅在于如何创建算法-我相信从理论上讲,您将能够(很少)使用嵌套的for循环来解决此问题。问题在于组织代码的方式不会让人头疼-即以一种可以轻松修复错误,可以编写单元测试,可以通过阅读来轻松理解代码的方式(六个月内)从现在开始),这样您就可以轻松更改代码,以备不时之需。 我不知道有人将头包裹在深深嵌套的结构中时不会犯错。而且由于在镜像中嵌套数据的嵌套结构而在嵌套严重的代码中追逐错误可能会令人沮丧。

快速(很可能是最佳)解决方案

依靠为您的确切用例制作的软件包,例如

https://github.com/cwacek/python-jsonschema-objects

如果您有API模式的正式定义,则可以使用软件包。例如,如果您的API具有Swagger模式定义,则不能使用swagger-pyhttps://github.com/digium/swagger-py)将JSON响应转换为Python对象。

原则解决方案:面向对象的编程和递归

即使您的具体用例可能有一些库,我也想解释一下如何处理“那种”任务的原理:

使用面向对象编程来组织此类问题的代码的一种好方法。利用递归的原理可以使嵌套的麻烦更加清晰。万一您的API响应的JSON模式由于任何原因(例如,API的更新)而发生更改,这也将使处理代码变得更加容易。对于您的情况,我建议您创建如下所示的内容:

class JsonObject:
    """Parent Class for any Object that will be retrieved from the JSON
    and potentially has nested JsonObjects inside.

    This class takes care of parsing the json into python Objects and deals
    with the recursion into the nested structures."""

    primitives = []
    json_objects = {
        # For each class, this dict defines all the "embedded" classes which
        # live directly "under" that class in the nested JSON. It will have the
        # following structure:

        # attribute_name : class

        # In your case the JSON schema does not have any "single" objects
        # in the nesting strcuture, but only lists of nested objects. I
        # still , to demonstrate how you would do it in case, there would be
        # single "embedded"
    }
    json_object_lists = {
        # For each class, this dict defines all the "embedded" subclasses which
        # are provided in a list "under" that class in the nested JSON.
        # It will have the following structure:

        # attribute_name : class
    }

    @classmethod
    def from_dict(cls, d: dict) -> "JsonObject":
        instance = cls()

        for attribute in cls.primitives:
            # Here we just parse all the primitives
            instance.attribute = getattr(d, attribute, None)

        for attribute, klass in cls.json_object_lists.items():
            # Here we parse all lists of embedded JSON Objects
            nested_objects = []
            l = getattr(d, attribute, [])
            for nested_dict in l:
                nested_objects += klass.from_dict(nested_dict)

            setattr(instance, attribute, nested_objects)

        for attribute, klass in cls.json_objects.items():
            # Here we parse all "single" embedded JSON Objects
            setattr(
                instance,
                attribute,
                klass.from_dict(getattr(d, attribute, None)
            )

    def to_csv(self) -> str:
        pass

由于您没有解释要从JSON创建csv的确切方式,因此我没有实现该方法并将其留给您。也没有必要解释整体方法。

现在我们有了通用的Parent类,所有我们的特定类都将从那里继承,这样我们就可以将递归应用于我们的问题了。现在,我们只需要根据要解析的JSON模式定义这些具体结构。我从您的示例中得到了以下内容,但是您可以轻松更改所需的内容:

class Address(JsonObject):
    primitives = [
        "AddressID",
        "AddressTypeID",
        "Address",
        "Street",
        "City",
        "State",
        "PostalCode",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}


class Lease(JsonObject):
    primitives = [
        "LeaseID",
        "TenantID",
        "UnitID",
        "PropertyID",
        "MoveInDate",
        "SortOrder",
        "CreateDate",
        "UpdateDate",
        "CreateUserID",
        "UpdateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class UserDefinedValue(JsonObject):
    primitives = [
        "UserDefinedValueID",
        "UserDefinedFieldID",
        "ParentID",
        "Name",
        "Value",
        "UpdateDate",
        "FieldType",
        "UpdateUserID",
        "CreateUserID",
    ]

    json_objects = {}
    json_object_lists = {}


class PhoneNumber(JsonObject):
    primitives = [
        "PhoneNumberID",
        "PhoneNumberTypeID",
        "PhoneNumber",
        "Extension",
        "StrippedPhoneNumber",
        "IsPrimary",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {}

class Contact(JsonObject):
    primitives = [
        "ContactID",
        "FirstName",
        "LastName",
        "MiddleName",
        "IsPrimary",
        "DateOfBirth",
        "FederalTaxID",
        "Comment",
        "Email",
        "License",
        "Vehicle",
        "IsShowOnBill",
        "Employer",
        "ApplicantType",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "AnnualIncome",
        "UpdateUserID",
        "ParentID",
        "ParentType",
    ]

    json_objects = {}
    json_object_lists = {
        "PhoneNumbers": PhoneNumber,
    }


class Tenant(JsonObject):
    primitives = [
        "TenantID",
        "TenantDisplayID",
        "Name",
        "FirstName",
        "LastName",
        "WebMessage",
        "Comment",
        "RentDueDay",
        "RentPeriod",
        "FirstContact",
        "PropertyID",
        "PostingStartDate",
        "CreateDate",
        "CreateUserID",
        "UpdateDate",
        "UpdateUserID",
        "OpenReceivables",  # Maybe this is also a nested Object? Not clear from your sample.
        "Status",
    ]

    json_object_lists = {
        "Contacts": Contact,
        "UserDefinedValues": UserDefinedValue,
        "Leases": Lease,
        "Addresses": Address,
    }

    json_objects = {}

您可能会想象这种方法的“美”(至少:顺序)在于:通过这种结构,我们可以处理API的JSON响应中的任何嵌套级别,而不会造成其他麻烦-我们的代码不会加深其缩进级别,因为我们已经将令人讨厌的嵌套分为JsonObject的{​​{1}}方法的递归定义。这就是为什么现在更容易识别错误或对我们的代码进行更改。

要最终将JSON现在解析为我们的对象,您需要执行以下操作:

from_json

重要的最后一点:这只是基本原理

我的代码示例只是一个简短的介绍,介绍了使用对象和递归来处理结构的压倒性(讨厌)嵌套的想法。该代码有一些缺陷。例如,应该避免定义可变的类变量。当然,整个代码应验证从API获取的数据。您可能还想添加每个属性的类型,并在Python对象中正确表示该属性(例如,您的示例具有整数,日期时间和字符串)。

我真的只想在这里向您展示面向对象编程的原理。

我没有花时间测试我的代码。因此,可能还剩下一些错误。再说一次,我只是想示范一下原理。