我有一系列嵌套的Pandas DataFrame,其中包含几个(数百个)数组,我想对不同嵌套级别的每个变量取平均值。
变量mydatadf
包含一个非常简单的代表我的实际数据的示例。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydata = dict()
participant = ['participantA', 'participantB']
for p in participant:
ses = dict()
session = ['ses_1', 'ses_2']
for s in session:
series = dict()
set = ['s_1', 's_2', 's_3']
for se in set:
reps = dict()
rep = ['r_1', 'r_2', 'r_3', 'r_4', 'r_5']
for r in rep:
vars = dict()
vars = {'var1': np.sin(np.random.rand(1000)*2),
'var2': np.sin(np.random.rand(1000)*2)}
varsdf = pd.DataFrame(data=vars)
reps[r] = vars
series[se] = reps
ses[s] = series
mydata[p] = ses
mydatadf = pd.DataFrame(mydata)
如何有效地(例如)在嵌套级别var1
,reps
,series
和/或ses
上平均participant
?
最终,我想绘制所有var1
对象,并在任何所需的嵌套级别上用不同颜色的平均数据突出显示。
for p in mydatadf.keys():
for ses in mydatadf[p].keys():
for set in mydatadf[p][ses].keys():
for rep in mydatadf[p][ses][set].keys():
data = mydatadf[p][ses][set][rep]['var1']
plt.plot(data)
plt.show()
答案 0 :(得分:1)
您始终可以展平数据框并进行标准的分组操作(我不知道它是否是最佳选择,但它可以工作):
df = pd.io.json.json_normalize(mydata) #this will give a nested dataframe
df_flat = pd.DataFrame(df.T.index.str.split('.').tolist()).assign(values=df.T.values)
df_flat.head(3)
>> 0 1 2 3 4 \
0 participantA ses_1 s_1 r_1 var1
1 participantA ses_1 s_1 r_1 var2
2 participantA ses_1 s_1 r_2 var1
values
0 [0.7267196257553268, 0.9822775511169437, 0.991...
1 [0.6633676714415264, 0.2823588336690545, 0.977...
2 [0.2211576389168905, 0.9399581790280525, 0.645...
编辑:进行分组并应用函数(例如,均值):
# in this case I choose column 4, corresponding to 'var'.
# You can change the name of the column using df_flat.columns.rename
# note that I use np.hstack as you are dealing with a an array of arrays
column = 4
df_flat.groupby(column)['Values'].apply(lambda x: np.hstack(x).mean())
>> 4
var1 0.707803
var2 0.707821
Name: Values, dtype: float64