我有一个从API获得的数据框。我的结果数据框在各列中都有一个字典,我想获取有关它们的信息。 这是我的数据框示例:
如何获取黄色栏中的值?以及如何将此数据框保存为CSV?
预先感谢您的帮助! 这是从API获取数据框的代码:
r = rq.get('https://api.tfl.gov.uk/Road/A2%2C%20A406%2C%20A1%2C%20A12%2C%20A13/Disruption?app_id=XXXXXXXXXX&app_key=XXXXXXXXX')
r = r.text
df7 = pd.read_json(r)
df7
答案 0 :(得分:0)
columns = ['geography','geometry']
for col in columns:
d = df7.loc[0,col]
for key in d.keys():
df7[key+'_'+col] = df7[col].apply(lambda x: x[key] if x is not np.nan else np.nan)
通过使用col名称替换地理位置来对所有列进行尝试
答案 1 :(得分:0)
这是一种解决方案,无论具有不同键或包含混合字典和其他类型的列的字典,该解决方案都可以使用。
注意:我已删除了该请求的API密钥,您需要将其重新添加。
from functools import partial
import itertools
import pandas as pd
import requests as rq
api_key = ""
url = "https://api.tfl.gov.uk/Road/A2%2C%20A406%2C%20A1%2C%20A12%2C%20A13/Disruption?app_id= XXXXXX&app_key={}".format(api_key)
r = rq.get(url)
r = r.text
df7 = pd.read_json(r)
output_path = "disruptions.csv"
def keys_if_dict(element):
if isinstance(element, dict):
return list(element.keys())
return list()
def value_for_key(element, key):
if isinstance(element, dict) and key in element:
return element[key]
return None
def handle_dicts_in_column(df, column_name):
column = df[column_name]
if any(map(lambda x: isinstance(x, dict), column)):
# this column has dictionaries in it
column_dict_keys = set(itertools.chain.from_iterable(column.transform(keys_if_dict)))
for dict_key in column_dict_keys:
column_name_from_dict_key = "{}_{}".format(column_name, dict_key)
while column_name_from_dict_key in df.columns:
column_name_from_dict_key += "(dup)"
df[column_name_from_dict_key] = column.transform(partial(value_for_key, key=dict_key))
if any(map(lambda x: isinstance(x, dict), df[column_name_from_dict_key])):
handle_dicts_in_column(df, column_name_from_dict_key)
for column_name in df7.columns:
handle_dicts_in_column(df7, column_name)
df7.to_csv(output_path)
答案 2 :(得分:0)
import pandas as pd # version 0.25
from pandas.io.json import json_normalize
df = pd.read_json(r)
DataFrame.explode
将列表中的每个项目移动到单独的行:recurringSchedules
是nan
的{{1}}或list
dicts
df = df.explode('recurringSchedules')
:nan
和geography
的行分别为recurringSchedules
nan
都将替换为适当的nan
,其中dict
是value
,如np.NaN
和geo_json
所示recur_sched_json
类型dict
或list
上使用json_normalize抛出nan
AttributeError
geo_json = {"type": np.NaN, "coordinates": np.NaN, "crs": {"type": np.NaN, "properties": {"name": np.NaN}}}
recur_sched_json = {'$type': np.NaN, 'startTime': np.NaN, 'endTime': np.NaN}
def replace_nan(df_row: (dict, float), dict_nan: dict) -> dict:
if type(df_row) != dict:
return dict_nan
else:
return df_row
df.geometry = df.geometry.apply(lambda x: replace_nan(x, geo_json))
df.recurringSchedules = df.recurringSchedules.apply(lambda x: replace_nan(x, recur_sched_json))
将DataFrame.explode()
中的行数从13更改为24 DataFrame
df.reset_index(drop=True, inplace=True)
json_normalize
:dicts
中的每个key
都会有自己的列dict
df_dict = dict()
for x in df.keys():
try:
y = json_normalize(df[x])
y.columns = [f'{x}.{col}' for col in y.columns]
df_dict[x] = y
except AttributeError:
df_dict[x] = df[x]
df_new = pd.concat([df_dict[x] for x in df_dict.keys()], axis=1)
列(例如json_normalized
,geography
,geometry
)在recurringSchedules
中都有自己的DataFrame
,可以通过以下方式访问跟随df_dict
将所有列合并为一个df_new
DataFrame
df_dict['geography']
df_new.to_csv('geo.csv', sep=',', index=False)
未展开,因为它由嵌套的geometry.coordinates
组组成,其长度如下:
lists
[28, 1, 96, 65, nan, 1, nan, 50, 1, 1, 1, 1, 1, 1, 144, 144, 144, 144, 144, nan, 596, 596, 596, 52]
(例如,一个单元格包含596个嵌套list
)lists
没有展开,但是,每个行值都是单个geography.coordinates
list
文件。