如何将嵌套的json展平到数据框熊猫

时间:2019-11-15 05:25:43

标签: python json python-3.x pandas dataframe

如何将JSON展平为pd.dataframe,如下所示:

class_id|id |schedule_id |schedule_date |lesson_price |status`
    1   | 3 |    1       | 2017-07-11   |   USD 25    | ONGOING
    1   | 3 |    2       | 2016-09-24   |   USD 15    | OPEN REGISTRATION
    1   | 4 |    1       | 2016-12-17   |   USD 19    | ONGOING
    1   | 4 |    2       | 2015-11-12   |   USD 29    | ONGOING
    1   | 4 |    3       | 2015-11-10   |   USD 14    | ON SCHEDULE
    2   | 1 |    1       | 2017-05-21   |   USD 50    | CANCELLED
    2   | 2 |    1       | 2017-06-04   |   USD10     | FINISHED
    2   | 2 |    2       | 2018-03-01   |   USD12     | CLOSED

来自JSON

我已经尝试过使用此reference,但是我给了我2行groupby class_id

如何显示课程对象中所有带有class_id和id的数据时间表,如所需的数据框?

1 个答案:

答案 0 :(得分:0)

数据结构中的困难来自

{
  "lesson3": {
    "id": 3,
    "schedule": [
      {
        "schedule_id": "1",
        "schedule_date": "2017-07-11",
        "lesson_price": "USD 25",
        "status": "ONGOING"
      },
      {
        "schedule_id": "2",
        "schedule_date": "2016-09-24",
        "lesson_price": "USD 15",
        "status": "OPEN REGISTRATION"
      }
    ]
  }
}

拥有

{
  "name": "lesson3",
  "id": 3,
  "schedule": [
    {
      "schedule_id": "1",
      "schedule_date": "2017-07-11",
      "lesson_price": "USD 25",
      "status": "ONGOING"
    },
    {
      "schedule_id": "2",
      "schedule_date": "2016-09-24",
      "lesson_price": "USD 15",
      "status": "OPEN REGISTRATION"
    }
  ]
}

但是我们无法控制大部分时间获得的数据。因此,我们必须摆脱第1课,第2课键,然后将对象向上移动。

解决方案

import requests
data = requests.get(url).json()

提取不同的课程

data_ = [{'class_id': c['class_id'], 'lessons': v} for c in data['class'] for d, v in c['data'].items()]

数据现在看起来像这样

[
  {
    "class_id": "1",
    "lessons": {
      "id": 3,
      "schedule": [
        {
          "schedule_id": "1",
          "schedule_date": "2017-07-11",
          "lesson_price": "USD 25",
          "status": "ONGOING"
        },
        {
          "schedule_id": "2",
          "schedule_date": "2016-09-24",
          "lesson_price": "USD 15",
          "status": "OPEN REGISTRATION"
        }
      ]
    }
  },
  ...
]

现在我们可以使用json_normalize

将其读入pandas DataFrame中
df = json_normalize(data_, record_path=['lessons', 'schedule'], meta=['class_id', ['lessons', 'id']])

输出

  schedule_id schedule_date lesson_price             status class_id lessons.id
0           1    2017-07-11       USD 25            ONGOING        1          3
1           2    2016-09-24       USD 15  OPEN REGISTRATION        1          3
2           1    2016-12-17       USD 19            ONGOING        1          4
3           2    2015-11-12       USD 29            ONGOING        1          4
4           3    2015-11-10       USD 14        ON SCHEDULE        1          4
5           1    2017-05-21       USD 50          CANCELLED        2          1
6           1    2017-06-04        USD10           FINISHED        2          2
7           5    2018-03-01        USD12             CLOSED        2          2