通过避免循环对熊猫中的代码进行向量化

时间:2020-09-18 11:44:18

标签: python python-3.x pandas dataframe

考虑以下df:

In [448]: complex_dataframe = pd.DataFrame({'stat_A': [120, 121, 122, 123],
     ...:                                       'group_id_A': [1, 1, 1, 2],
     ...:                                       'level_A': [1, 2, 2, 1],
     ...:                                       'stat_B': [220, 221, 222, 223],
     ...:                                       'group_id_B': [1, 1, 1, 2],
     ...:                                       'level_B': [1, 1, 2, 2],
     ...:                                       'stat_C': ['aa', 'ab', 'aa', 'ab'],
     ...:                                       'measure_avg_A': [10.5, 11, 20, 12],
     ...:                                       'measure_sum_B': [10, 20, 30, 40]}
     ...:                                      )

In [449]: complex_dataframe
Out[449]: 
   stat_A  group_id_A  level_A  stat_B  group_id_B  level_B stat_C  measure_avg_A  measure_sum_B
0     120           1        1     220           1        1     aa           10.5             10
1     121           1        2     221           1        1     ab           11.0             20
2     122           1        2     222           1        2     aa           20.0             30
3     123           2        1     223           2        2     ab           12.0             40

这里的变量全部包含3列:stat_group_id_level_complex列,只有stat_的变量是简单列。

因此,上面的列AB是复杂的列。 C列是一个简单的列,而以measure_开头的列仅仅是值。

用例是:

我需要对所有group_id_列和简单列进行分组。在上述情况下,分组依据为group_id_Agroup_id_Bstat_C

预期输出:

enter image description here

其中measure_A列如下:

enter image description here

我已经使用多个循环对此进行了编码。

from collections import Counter

cols_without_measures = complex_dataframe.loc[:, ~complex_dataframe.columns.str.startswith("measure_")].columns.tolist()
cols_without_measures = [i.split('_')[-1] for i in cols_without_measures]
counter = Counter(cols_without_measures)

complex_cols = [k for k, v in counter.items() if v == 3]
simple_cols = list(set(list(counter.keys())).symmetric_difference(set(complex_cols)))
grouped_cols = ['group_id_' + i for i in complex_cols] + ['stat_' + i for i in simple_cols]

grp = self.df_in.groupby(grouped_cols)
complex_df = pd.DataFrame()

for k, v in grp:
    temp = v.loc[:, ~v.columns.isin(grouped_cols)]
    stat_df = temp.loc[:, ~temp.columns.str.startswith('measure_')]
    measure_df = temp.filter(like='measure_', axis=1)
    out_df = v[grouped_cols].head(1)
    for i in measure_df.columns:
        m = stat_df.copy()
        m[i] = measure_df[i]
        out_df[i] = [m.to_dict()]
    complex_df = complex_df.append(out_df)

有没有更好的方法来解决这个问题?也许以某种方式使其向量化。

4 个答案:

答案 0 :(得分:2)

您可以在熊猫中使用groupby来执行此操作,但是它与lambda一起使用。这不是完全矢量化的解决方案。

df_complex = complex_dataframe.groupby(['group_id_A','group_id_B','stat_C']).apply(
    lambda x: pd.Series({
        'measure_avg_A': x[['stat_A','level_A','stat_B','level_B','measure_avg_A']].to_dict(),
        'measure_sum_B': x[['stat_A','level_A','stat_B','level_B','measure_sum_B']].to_dict()
    })).reset_index()

然后,您可以根据需要查询数据框

pd.DataFrame(df_complex.at[0, 'measure_avg_A'])

输出

   stat_A  level_A  stat_B  level_B  measure_avg_A
0     120        1     220        1           10.5
2     122        2     222        2           20.0

答案 1 :(得分:2)

df.groupbyGroupBy.apply一起使用,并使用pd.concat在轴1上将它们串联

def d(x):
    return x.to_dict()
g = df.groupby(['group_id_A','group_id_B','stat_C'])
one = g[['stat_A','level_A','stat_B','level_B','measure_avg_A']].apply(d)
two = g[['stat_A','level_A','stat_B','level_B','measure_sum_B']].apply(d)

out = pd.concat([one, two], axis=1)
out.columns = ['measure_avg_A', 'measure_sum_B']

从anky中借用一些代码:P

display = print
final = out 

                                                                  measure_avg_A                                      measure_sum_B
group_id_A group_id_B stat_C                                                                                                      
1          1          aa      {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...  {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...
                      ab      {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...  {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...
2          2          ab      {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...  {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...

   stat_A  level_A  stat_B  level_B  measure_avg_A
0     120        1     220        1           10.5
2     122        2     222        2           20.0

   stat_A  level_A  stat_B  level_B  measure_avg_A
1     121        2     221        1           11.0

   stat_A  level_A  stat_B  level_B  measure_sum_B
0     120        1     220        1             10
2     122        2     222        2             30

   stat_A  level_A  stat_B  level_B  measure_sum_B
1     121        2     221        1             20

答案 2 :(得分:1)

不确定这是否更好,但是您可以尝试让我知道:

measure_cols = [*complex_dataframe.columns[complex_dataframe.columns
                                           .str.contains("measure")]]
u = complex_dataframe.set_index(grouped_cols)

final=pd.concat([u[u.columns.difference(measure_cols,sort=False).union([i],sort=False)]
        .groupby(grouped_cols).apply(lambda x: x.reset_index(drop=True).to_dict())
        .rename(i)   for i in measure_cols],axis=1).reset_index()

display(final)
display(pd.DataFrame(final['measure_avg_A'].iat[0]))
display(pd.DataFrame(final['measure_avg_A'].iat[1]))
display(pd.DataFrame(final['measure_sum_B'].iat[0]))
display(pd.DataFrame(final['measure_sum_B'].iat[1]))

enter image description here

答案 3 :(得分:0)

如果您愿意通过更具逻辑性的索引访问分组的DataFrame,则可以执行以下操作:

df_grouped = complex_dataframe.groupby(['group_id_A', 'group_id_B', 'stat_C'])

然后,您将按照分组的值的元组访问每个组:

df_grouped.get_group((1, 1, 'ab'))

为您提供数据框

        stat_A  group_id_A  level_A stat_B  group_id_B  level_B stat_C measure_avg_A    measure_sum_B
    1   121     1           2       21      1           1       ab     11.0             20

您可以使用以下方法遍历各组:

for key, item in df_grouped:
    print(key, "\n")
    print(df_grouped.get_group(key), "\n\n")