Question

这是我当前拥有的JSON数据，我需要在Pandas数据框中使用该数据以符合需要。

{
  "lab1": {
    "co2": [
      9.559335530495726
    ],
    "occupancy": [
      4
    ],
    "temperature": [
      21.033629524242304
    ],
    "time": "2020-09-15T16:15:35.565629"
  }
}
{
  "class1": {
    "co2": [
      24.168445969175817
    ],
    "occupancy": [
      15
    ],
    "temperature": [
      26.176607611778156
    ],
    "time": "2020-09-15T16:15:36.027525"
  }
}
{
  "office": {
    "co2": [
      6.633787232630541
    ],
    "occupancy": [
      1
    ],
    "temperature": [
      27.727982558797844
    ],
    "time": "2020-09-15T16:15:36.608386"
  }
}

我尝试了json_normalize，但是我不明白如何规范化JSON数据。

with open('data.json','r') as f:
    data = json.loads(f.read())
    # Normalizing data
    data1 = pd.json_normalize(data, record_path =['Results'])
    # Saving to CSV format 
    multiple_level_data.to_csv('multiplelevel_normalized_data.csv', index=False)

我使用此代码，出现以下错误

JSONDecodeError跟踪（最近一次通话）在1与 open（'data.json'，'r'）as f：----> 2 data = json.loads（f.read（）） JSONDecodeError：额外数据：第14行第2列（字符240）

Answer 1

您可以使用熊猫read_json。

首先使用正则表达式从数据中删除所有的'['和']'。然后将其转换为json文件。

import pandas as pd
pd.read_json (r'Path where you saved the JSON file/filename.json')

Answer 2

这里是没有正则表达式的方法。

import pandas as pd

data = [
    {'lab1': {'co2': [9.559335530495726],
              'occupancy': [4],
              'temperature': [21.033629524242304],
              'time': '2020-09-15T16:15:35.565629'}},
    {'class1': {'co2': [24.168445969175817],
                'occupancy': [15],
                'temperature': [26.176607611778156],
                'time': '2020-09-15T16:15:36.027525'}},
    {'office': {'co2': [6.633787232630541],
                'occupancy': [1],
                'temperature': [27.727982558797844],
                'time': '2020-09-15T16:15:36.608386'}}
]

现在遍历字典列表。使用explode()展平列表。

df = list()
for d in data:
    for key, values in d.items():
        t = (pd.json_normalize(values)
              .explode('co2')
              .explode('occupancy')
              .explode('temperature')
              .assign(location=key)
             )
        df.append(t)

df = pd.concat(df)
print(df)

       co2 occupancy temperature                        time location
0  9.55934         4     21.0336  2020-09-15T16:15:35.565629     lab1
0  24.1684        15     26.1766  2020-09-15T16:15:36.027525   class1
0  6.63379         1      27.728  2020-09-15T16:15:36.608386   office

最初的问题没有预期的结果，但是此数据框将支持许多类型的进一步分析。

Answer 3

@Fareed Ahmad发布了一个更大的数据集。

首先，我们创建两个函数：1）将Gist文件转换为数据包序列；和2）将数据包转换为数据帧：

import json
import pandas as pd
import requests

def resp_to_packets(resp_text):
    ''' Convert file to list of packets.'''
    packet = ''
    for line in resp_text.split('\n'):
        if line.startswith('}{'):
            packet += '}'
            yield json.loads(packet)
            packet = '{'
        else:
            packet += line + '\n'
    yield json.loads(packet)

    
def packet_to_df(packet):
    ''' Convert packet to data frame.'''
    df = list()
    for key, values in packet.items():
        t = (pd.json_normalize(values)
              .explode('co2')
              .explode('occupancy')
              .explode('temperature')
              .assign(location=key)
             )
        df.append(t)
    return pd.concat(df, ignore_index=True)

dtypes = {'co2': float, 'occupancy': int, 'temperature': float, 
          'time': 'datetime64', 'location': str}

第二，运行管道，包括连接数据帧和转换类型：

url = 'https://gist.githubusercontent.com/Fareed99/45b8a39a7e4493243ec973fa73f2b92b/raw/4b2eb95c24c95d40733f06d13dbf0356c4520e99/data.json'
r = requests.get(url)
assert r.ok

packets = (packet for packet in resp_to_packets(r.text))
dfs     = (packet_to_df(packet) for packet in packets)
df      = pd.concat(dfs, ignore_index=True).astype(dtype=dtypes)

print(df.tail())

           co2  occupancy  temperature                       time location
476  45.237285         15    27.364173 2020-09-15 20:37:29.252201   class1
477   5.565177          4    21.033565 2020-09-15 20:37:29.667347     lab1
478  10.799228          1    21.014435 2020-09-15 20:37:30.689885     lab1
479  36.989700         20    27.059197 2020-09-15 20:37:33.467733   class1
480   1.836340          2    23.021893 2020-09-15 20:37:35.853943   office

如何将嵌套的JSON数据转换为Pandas数据框？

3 个答案: