标准化熊猫数据框中的列

时间:2019-08-05 04:55:18

标签: pandas

我可以使用此代码从json文件导入数据...

import requests
from pandas.io.json import json_normalize

url = "https://datameetgeobk.s3.amazonaws.com/image_list.json"

resp = requests.get(url=url)
df = json_normalize(resp.json()['Images'])
df.head()

但是“ BlockDeviceMappings”列实际上是一个列表,每个项目都有DeviceName和Ebs参数,它们是字符串和字典。如何进一步规范我的数据框,以将所有详细信息包括在单独的列中?


我的屏幕截图与答案中显示的截图不匹配。 Ebs列(左数第二个)是字典。

enter image description here

1 个答案:

答案 0 :(得分:1)

import requests
from pandas.io.json import json_normalize

url = "https://datameetgeobk.s3.amazonaws.com/image_list.json"

resp = requests.get(url=url)
resp = resp.json()

到目前为止,您拥有什么:

df = json_normalize(resp['Images'])

BlockDeviceMappings强制转换为所有列

inner_keys = [x for x in resp['Images'][0].keys() if x != 'BlockDeviceMappings']

df_bdm = json_normalize(resp['Images'], record_path=['BlockDeviceMappings'], meta=inner_keys, errors='ignore')

分隔bdm_df

bdm_df = json_normalize(resp['Images'], record_path=['BlockDeviceMappings'])

毫无疑问,您为什么df有39995个条目,而bdm_df有131691个条目。这是因为BlockDeviceMappingslist中的dicts,长度不同:

bdm_len = [len(x) for x in df.BlockDeviceMappings]
max(bdm_len)
>>> 31

示例BlockDeviceMappings条目:

[{'DeviceName': '/dev/sda1',
  'Ebs': {'DeleteOnTermination': True,
   'SnapshotId': 'snap-0aac2591b85fe677e',
   'VolumeSize': 80,
   'VolumeType': 'gp2',
   'Encrypted': False}},
 {'DeviceName': 'xvdb',
  'Ebs': {'DeleteOnTermination': True,
   'SnapshotId': 'snap-0bd8d7828225924a7',
   'VolumeSize': 80,
   'VolumeType': 'gp2',
   'Encrypted': False}}]

df_bdm.head()

enter image description here