将嵌套的JSON解析为pandas DataFrames

时间:2019-03-18 00:18:57

标签: python json pandas

我正在从目标传统系统读取数据,该系统中包含股票收益数据。数据以JSON格式导出到该收入模块等模块中。

earnings_dict = {
 "earningsChart": {
      "quarterly": [
           {
                "date": "1Q2018",
                "actual": {
                     "raw": 0.12,
                     "fmt": "0.12"
                },
                "estimate": {
                     "raw": 0.05,
                     "fmt": "0.05"
                }
           },
           {
                "date": "2Q2018",
                "actual": {
                     "raw": 0.21,
                     "fmt": "0.21"
                },
                "estimate": {
                     "raw": 0.19,
                     "fmt": "0.19"
                }
           },
           {
                "date": "3Q2018",
                "actual": {
                     "raw": 0.16,
                     "fmt": "0.16"
                },
                "estimate": {
                     "raw": 0.21,
                     "fmt": "0.21"
                }
           },
           {
                "date": "4Q2018",
                "actual": {
                     "raw": 0.07,
                     "fmt": "0.07"
                },
                "estimate": {
                     "raw": 0.14,
                     "fmt": "0.14"
                }
           }
      ],
      "currentQuarterEstimate": {
           "raw": 0.15,
           "fmt": "0.15"
      },
      "currentQuarterEstimateDate": "1Q",
      "currentQuarterEstimateYear": 2019,
      "earningsDate": [
           {
                "raw": 1556496000,
                "fmt": "2019-04-29"
           },
           {
                "raw": 1556841600,
                "fmt": "2019-05-03"
           }
      ]
 },
 "financialsChart": {
      "yearly": [
           {
                "date": 2015,
                "revenue": {
                     "raw": 74977000,
                     "fmt": "74.98M",
                     "longFmt": "74,977,000"
                },
                "earnings": {
                     "raw": -15668000,
                     "fmt": "-15.67M",
                     "longFmt": "-15,668,000"
                }
           },
           {
                "date": 2016,
                "revenue": {
                     "raw": 105586000,
                     "fmt": "105.59M",
                     "longFmt": "105,586,000"
                },
                "earnings": {
                     "raw": -8281000,
                     "fmt": "-8.28M",
                     "longFmt": "-8,281,000"
                }
           },
           {
                "date": 2017,
                "revenue": {
                     "raw": 143803000,
                     "fmt": "143.8M",
                     "longFmt": "143,803,000"
                },
                "earnings": {
                     "raw": 9716000,
                     "fmt": "9.72M",
                     "longFmt": "9,716,000"
                }
           },
           {
                "date": 2018,
                "revenue": {
                     "raw": 190071000,
                     "fmt": "190.07M",
                     "longFmt": "190,071,000"
                },
                "earnings": {
                     "raw": 19967000,
                     "fmt": "19.97M",
                     "longFmt": "19,967,000"
                }
           }
      ],
      "quarterly": [
           {
                "date": "1Q2018",
                "revenue": {
                     "raw": 42340000,
                     "fmt": "42.34M",
                     "longFmt": "42,340,000"
                },
                "earnings": {
                     "raw": 4320000,
                     "fmt": "4.32M",
                     "longFmt": "4,320,000"
                }
           },
           {
                "date": "2Q2018",
                "revenue": {
                     "raw": 47240000,
                     "fmt": "47.24M",
                     "longFmt": "47,240,000"
                },
                "earnings": {
                     "raw": 7474000,
                     "fmt": "7.47M",
                     "longFmt": "7,474,000"
                }
           },
           {
                "date": "3Q2018",
                "revenue": {
                     "raw": 50126000,
                     "fmt": "50.13M",
                     "longFmt": "50,126,000"
                },
                "earnings": {
                     "raw": 5524000,
                     "fmt": "5.52M",
                     "longFmt": "5,524,000"
                }
           },
           {
                "date": "4Q2018",
                "revenue": {
                     "raw": 50365000,
                     "fmt": "50.37M",
                     "longFmt": "50,365,000"
                },
                "earnings": {
                     "raw": 2649000,
                     "fmt": "2.65M",
                     "longFmt": "2,649,000"
                }
           }
      ]
 },
 "financialCurrency": "USD"}

如您所见,JSON在字典的顶层嵌套了一些元数据,使用pandas.io.json_normalize之类的内容易于读取。

df = pd.io.json.json_normalize(earnings_dict)

df
Out[13]: 
  earningsChart.currentQuarterEstimate.fmt  ...                             financialsChart.yearly
0                                     0.15  ...  [{'date': 2015, 'revenue': {'raw': 74977000, '...

[1 rows x 9 columns]

但是,它错过了包含多年和季度收益数据的字典嵌套列表。例如。季度列表和年度列表只是作为字典列表添加到数据框的。

我想这最初是几个带有外键的SQL表。

我已经阅读了json_normalize文档,但似乎无法解决如何使用record_path和meta参数解析字典的问题。

我想我可以使用json_normalize甚至从嵌套的多个级别的字典中创建DataFrame。看来我至少需要5个-一个用于元数据,一个至少4个用于2个年表和年表。

奖金:

您将如何存储它?您将其存储在NoSQL字符串数据库中还是将其保留在SQL中?我的要求是进行低负载,轻量级的分析,这需要使用pandas和matplotlib进行一些视图和图形处理。

感谢您的帮助!

0 个答案:

没有答案