我有一列包含JSON结构化数据。我的df
看起来像这样:
ClientToken Data
7a9ee887-8a09-ff9592e08245 [{"summaryId":"4814223456","duration":952,"startTime":1587442919}]
bac49563-2cf0-cb08e69daa48 [{"summaryId":"4814239586","duration":132,"startTime":1587443876}]
我想将其扩展为:
ClientToken summaryId duration startTime
7a9ee887-8a09-ff9592e08245 4814223456 952 1587442919
bac49563-2cf0-cb08e69daa48 4814239586 132 1587443876`
有什么想法吗?
答案 0 :(得分:1)
您可以尝试:
df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1]))))
说明:
Data
列,然后执行以下步骤:
Data
”内容包装在列表中,并且这是一个字符串,因此我们可以使用[]
手动删除x[1:-1]
(删除第一个和最后一个字符)。"Data"
列是string
,而我们实际上想要一个JSON
,因此需要对其进行转换。一种解决方案是使用json.loads()
模块中的json
函数。代码变为json.loads(x[1:-1])
dict
将pd.Series(json.loads(x[1:-1]))
转换为pd.Series
join
将这些新列添加到现有数据框中。另外,您会注意到我使用双[]
来选择"ClientToken"
列作为数据框。代码+插图:
import pandas as pd
import json
# step 1.1
print(df.Data.apply(lambda x: x[1:-1]))
# 0 {"summaryId":"4814223456","duration":952,"star...
# 1 {"summaryId":"4814239586","duration":132,"star...
# Name: Data, dtype: object
# step 1.2
print(df.Data.apply(lambda x: json.loads(x[1:-1])))
# 0 {'summaryId': '4814223456', 'duration': 952, '...
# 1 {'summaryId': '4814239586', 'duration': 132, '...
# Name: Data, dtype: object
# step 1.3
print(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1]))))
# summaryId duration startTime
# 0 4814223456 952 1587442919
# 1 4814239586 132 1587443876
# step 2
print(df[["ClientToken"]].join(df.Data.apply(lambda x: pd.Series(json.loads(x[1:-1])))))
# ClientToken summaryId duration startTime
# 0 7a9ee887-8a09-ff9592e08245 4814223456 952 1587442919
# 1 bac49563-2cf0-cb08e69daa48 4814239586 132 1587443876
编辑1:
似乎在某些行中,list
中的Data
具有多个dicts
,您可以尝试:
df[["ClientToken"]].join(df.Data.apply(lambda x: [pd.Series(y)
for y in json.loads(x)]) \
.explode() \
.apply(pd.Series))
答案 1 :(得分:0)
使用defaultdict和ast literal eval的替代方法:
from collections import defaultdict
import ast
d = defaultdict(list)
#iterate through the Data column and append to dictionary for each key
for ent in df.Data:
for entry in ast.literal_eval(ent):
for k, v in entry.items():
d[k].append(v)
#concat to ClientToken column
pd.concat([df.ClientToken,pd.DataFrame(d)],axis=1)
ClientToken summaryId duration startTime
0 7a9ee887-8a09-ff9592e08245 4814223456 952 1587442919
1 bac49563-2cf0-cb08e69daa48 4814239586 132 1587443876