考虑以下df:
In [448]: complex_dataframe = pd.DataFrame({'stat_A': [120, 121, 122, 123],
...: 'group_id_A': [1, 1, 1, 2],
...: 'level_A': [1, 2, 2, 1],
...: 'stat_B': [220, 221, 222, 223],
...: 'group_id_B': [1, 1, 1, 2],
...: 'level_B': [1, 1, 2, 2],
...: 'stat_C': ['aa', 'ab', 'aa', 'ab'],
...: 'measure_avg_A': [10.5, 11, 20, 12],
...: 'measure_sum_B': [10, 20, 30, 40]}
...: )
In [449]: complex_dataframe
Out[449]:
stat_A group_id_A level_A stat_B group_id_B level_B stat_C measure_avg_A measure_sum_B
0 120 1 1 220 1 1 aa 10.5 10
1 121 1 2 221 1 1 ab 11.0 20
2 122 1 2 222 1 2 aa 20.0 30
3 123 2 1 223 2 2 ab 12.0 40
这里的变量全部包含3列:stat_
,group_id_
和level_
是complex
列,只有stat_
的变量是简单列。
因此,上面的列A
和B
是复杂的列。 C
列是一个简单的列,而以measure_
开头的列仅仅是值。
用例是:
我需要对所有group_id_
列和简单列进行分组。在上述情况下,分组依据为group_id_A
,group_id_B
,stat_C
。
预期输出:
其中measure_A
列如下:
我已经使用多个循环对此进行了编码。
from collections import Counter
cols_without_measures = complex_dataframe.loc[:, ~complex_dataframe.columns.str.startswith("measure_")].columns.tolist()
cols_without_measures = [i.split('_')[-1] for i in cols_without_measures]
counter = Counter(cols_without_measures)
complex_cols = [k for k, v in counter.items() if v == 3]
simple_cols = list(set(list(counter.keys())).symmetric_difference(set(complex_cols)))
grouped_cols = ['group_id_' + i for i in complex_cols] + ['stat_' + i for i in simple_cols]
grp = self.df_in.groupby(grouped_cols)
complex_df = pd.DataFrame()
for k, v in grp:
temp = v.loc[:, ~v.columns.isin(grouped_cols)]
stat_df = temp.loc[:, ~temp.columns.str.startswith('measure_')]
measure_df = temp.filter(like='measure_', axis=1)
out_df = v[grouped_cols].head(1)
for i in measure_df.columns:
m = stat_df.copy()
m[i] = measure_df[i]
out_df[i] = [m.to_dict()]
complex_df = complex_df.append(out_df)
有没有更好的方法来解决这个问题?也许以某种方式使其向量化。
答案 0 :(得分:2)
您可以在熊猫中使用groupby来执行此操作,但是它与lambda一起使用。这不是完全矢量化的解决方案。
df_complex = complex_dataframe.groupby(['group_id_A','group_id_B','stat_C']).apply(
lambda x: pd.Series({
'measure_avg_A': x[['stat_A','level_A','stat_B','level_B','measure_avg_A']].to_dict(),
'measure_sum_B': x[['stat_A','level_A','stat_B','level_B','measure_sum_B']].to_dict()
})).reset_index()
然后,您可以根据需要查询数据框
pd.DataFrame(df_complex.at[0, 'measure_avg_A'])
输出
stat_A level_A stat_B level_B measure_avg_A
0 120 1 220 1 10.5
2 122 2 222 2 20.0
答案 1 :(得分:2)
将df.groupby
与GroupBy.apply
一起使用,并使用pd.concat
在轴1上将它们串联
def d(x):
return x.to_dict()
g = df.groupby(['group_id_A','group_id_B','stat_C'])
one = g[['stat_A','level_A','stat_B','level_B','measure_avg_A']].apply(d)
two = g[['stat_A','level_A','stat_B','level_B','measure_sum_B']].apply(d)
out = pd.concat([one, two], axis=1)
out.columns = ['measure_avg_A', 'measure_sum_B']
从anky中借用一些代码:P
display = print
final = out
measure_avg_A measure_sum_B
group_id_A group_id_B stat_C
1 1 aa {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,... {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...
ab {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_... {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...
2 2 ab {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_... {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...
stat_A level_A stat_B level_B measure_avg_A
0 120 1 220 1 10.5
2 122 2 222 2 20.0
stat_A level_A stat_B level_B measure_avg_A
1 121 2 221 1 11.0
stat_A level_A stat_B level_B measure_sum_B
0 120 1 220 1 10
2 122 2 222 2 30
stat_A level_A stat_B level_B measure_sum_B
1 121 2 221 1 20
答案 2 :(得分:1)
不确定这是否更好,但是您可以尝试让我知道:
measure_cols = [*complex_dataframe.columns[complex_dataframe.columns
.str.contains("measure")]]
u = complex_dataframe.set_index(grouped_cols)
final=pd.concat([u[u.columns.difference(measure_cols,sort=False).union([i],sort=False)]
.groupby(grouped_cols).apply(lambda x: x.reset_index(drop=True).to_dict())
.rename(i) for i in measure_cols],axis=1).reset_index()
display(final)
display(pd.DataFrame(final['measure_avg_A'].iat[0]))
display(pd.DataFrame(final['measure_avg_A'].iat[1]))
display(pd.DataFrame(final['measure_sum_B'].iat[0]))
display(pd.DataFrame(final['measure_sum_B'].iat[1]))
答案 3 :(得分:0)
如果您愿意通过更具逻辑性的索引访问分组的DataFrame,则可以执行以下操作:
df_grouped = complex_dataframe.groupby(['group_id_A', 'group_id_B', 'stat_C'])
然后,您将按照分组的值的元组访问每个组:
df_grouped.get_group((1, 1, 'ab'))
为您提供数据框
stat_A group_id_A level_A stat_B group_id_B level_B stat_C measure_avg_A measure_sum_B
1 121 1 2 21 1 1 ab 11.0 20
您可以使用以下方法遍历各组:
for key, item in df_grouped:
print(key, "\n")
print(df_grouped.get_group(key), "\n\n")