我有一个数据框,其中一列是带有字典的json字符串,我需要将json扩展为单独的列。示例:
c1 c2
0 a1 {'x1': 1, 'x3': 3, 'x2': 2}
1 a2 {'x1': 21, 'x3': 23, 'x2': 22}
应成为:
c1 x1 x2 x3
0 a1 1.0 2.0 3.0
1 a2 21.0 22.0 23.0
我的问题与this thread非常相似,除了我有字符串而不是字典(尽管字符串求值成字典),并且在那里提出的简单,优化的解决方案不适用于我的情况。 我有一个可行的解决方案,但是显然效率很低。这是我的代码和该线程中提出的解决方案的代码段:
import json
import pandas as pd
def expandFeatures(df, columnName):
"""Expands column 'columnName', which contains a dictionary in form of a json string, into N single columns, each containing a single feature"""
# get names of new columns from the first row
features = json.loads(df.iloc[0].loc[columnName])
featureNames = list(features.keys())
featureNames.sort()
# add new columns (empty values)
newCols = list(df.columns) + featureNames
df = df.reindex(columns=newCols, fill_value=0.0)
# fill in the values of the new columns
for index, row in df.iterrows():
features = json.loads(row[columnName])
for key,val in features.items():
df.at[index, key] = val
# remove column 'columnName'
return df.drop(columns=[columnName])
def expandFeatures1(df, columnName):
return df.drop(columnName, axis=1).join(pd.DataFrame(df[columnName].values.tolist()))
df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'], ['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
columns=['c1', 'c2'])
df_dict = pd.DataFrame([['a1', {'x1': 1, 'x2': 2, 'x3': 3}], ['a2', {'x1': 21, 'x2': 22, 'x3': 23}]],
columns=['c1', 'c2'])
# correct result, but inefficient
print("expandFeatures, df_json")
df = df_json.copy()
print(df)
df = expandFeatures(df, 'c2')
print(df)
# this gives an error because expandFeatures expects a string, not a dictionary
# print("expandFeatures, df_dict")
# df = df_dict.copy()
# print(df)
# df = expandFeatures(df, 'c2')
# print(df)
# WRONG, doesn't expand anything
print("expandFeatures1, df_json")
df = df_json.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)
# correct and efficient, but not my use case (I have strings not dicts)
print("expandFeatures1, df_dict")
df = df_dict.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)
我确定有一些明显的方法可以提高代码效率,使其与其他线程中建议的单行代码更相似,但我本人无法真正看到它……预先感谢任何帮助。
答案 0 :(得分:0)
如果json字符串是有效的字典,则可以使用ast.literal_eval
来解析它们:
import pandas as pd
from ast import literal_eval
df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'],
['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
columns=['c1', 'c2'])
print (pd.concat([df_json,pd.DataFrame(df_json["c2"].apply(literal_eval).to_list())],axis=1).drop("c2",axis=1))
#
c1 x1 x2 x3
0 a1 1 2 3
1 a2 21 22 23