我在pandas dataframe列中有一个JSON对象,我希望将其拆分并放入其他列。在数据框中,JSON对象看起来像一个包含字典数组的字符串。数组可以是可变长度,包括零,或者列甚至可以为null。我写了一些代码,如下所示,它可以满足我的需求。列名由两个组件构成,第一个是字典中的键,第二个是字典中键值的子字符串。
此代码可以正常工作,但在大型数据帧上运行时速度非常慢。任何人都可以提供更快(也可能更简单)的方法吗?另外,如果你看到一些不合理/高效/ pythonic的东西,请随意选择我所做的事情。我还是个初学者。谢谢你。
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json
# Set some display options
pd.set_option('max_colwidth',150)
# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
3: '[]',\
4: None}})
display(df)
# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()
# Step through all rows in the dataframe
for i in range(df.shape[0]):
# Check whether record is null, or doesn't contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)
# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)
print("Processed df:")
display(df)
答案 0 :(得分:4)
首先,关于大熊猫的一般建议。 如果您发现自己在数据帧的行上进行迭代,那么您很可能做错了。
考虑到这一点,我们可以使用pandas重新编写您当前的程序' apply'方法(这可能会加速它开始,因为这意味着df上的索引查找要少得多):
# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
if pd.notnull(row) and len(row) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(row)
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
#we don't need to re-index
# Give the single record the same index number as the parent dataframe (for the merge to work)
#y.index = [df.index[i]]
#we don't need to add to a temp df
# Append this dataframe on sequentially for each row as we go through the loop
return y.iloc[0]
else:
return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)
请注意,这会返回与之前完全相同的结果,但我们还没有改变逻辑。 apply方法对索引进行了排序,因此我们可以合并,很好。
我认为,在加快速度和更加惯用方面,我会回答你的问题。
我认为你应该考虑一下,你想对这个数据结构做些什么,以及如何更好地构建你正在做的事情。
鉴于ColB可以是任意长度,您将得到一个具有任意列数的数据帧。当你出于任何目的来访问这些值时,无论目的是什么,这都会给你带来痛苦。
ColB中的所有条目都很重要吗?你能挽救第一个吗?你需要知道某个valA val的指数吗?
这些是你应该问自己的问题,然后决定一个结构,它允许你做任何你需要的分析,而不必检查一堆任意的东西。