按累积总和分组行

时间:2020-03-04 19:47:32

标签: python pandas pandas-groupby cumulative-sum data-wrangling

我正在处理一个问题,该问题是通过使用属性的累加总和(在订购后)来对行进行分组。但是我是python的新手,不知道如何处理它。请告知。任何帮助表示赞赏。

这是我的输入,这是我制作的熊猫数据框。如您所见,密钥和组都没有排序。

key  group v1  v2
1_A   1    22  4
1_A  -1    10  11
1_B   2    15  9
1_B   6    15  2
1_A   2    33  43
1_A   5    50  22
1_A   3    5   122
1_B   1    30  8
1_A   4    1   2

对于数据处理,我需要按组计算v1顺序的累积总和,它是针对具有相同键的行。所以我想我应该先点桌子。但我不确定。请提出建议。如果需要先订购表,则新表如下所示。基本上,我将具有相同键的行放在一起,并按组对这些行进行排序。

key  group v1  v2
1_A  -1    10  11
1_A   1    22  4
1_A   2    33  43
1_A   3    5   122
1_A   4    1   2
1_A   5    50  22
1_B   1    30  8
1_B   2    15  9
1_B   6    1   2

这是我想要的输出。主要是我需要按组的顺序进行v1的累加和,一旦累加和达到阈值(例如30),则累加停止,并重新开始下一行。这个过程一直持续到到达同一组的最后一行为止。最后,如果最后一个垃圾箱小于30,则将它们与下部垃圾箱合并,如1_B所示,其中第2组和第6组合计仅16个(<30),因此它们需要与第1组合并。

请注意,箱号可能与我在这里的箱号不同。只要它给相同的组相同的箱号,它就可以工作。例如,您可以将1,2,3完全替换为A,B,C或3,2,1,或替换为A100,B201,M434。

key  group v1  v2    bin       sum_v1    sum_v2
1_A  -1    10  11    1         32        15
1_A   1    22  4     1         32        15
1_A   2    33  43    2         33        43
1_A   3    5   122   3         56        146
1_A   4    1   2     3         56        146
1_A   5    50  22    3         56        146
1_B   1    30  8     1         46        19
1_B   2    15  9     1         46        19
1_B   6    1   2     1         46        19

编辑: 现在,我在下面发布了完整的解决方案作为答案。享受。

1 个答案:

答案 0 :(得分:1)

我创建了一个解决方案。我对这份完整的工作感到困惑,但是一旦意识到可以分解成小的工作,我就能一次解决这些较小的任务。这个过程并不艰难。计划是困难的部分。因此,现在我将与所有人共享我的结果,以防有人遇到相同的困惑(我已经注意到这两个预订星级意味着有人感兴趣)。瞧!


import pandas as pd
data = [['1_A',1, 22, 4],['1_A', -1, 10, 11 ],['1_B',2, 15, 9],['1_B',6, 1, 2],['1_A',2, 33, 43 ],['1_A',5, 50, 22 ],['1_A',3, 5 , 122],['1_B',1, 30, 8],['1_A',4, 1 , 2]] 
df_1 = pd.DataFrame(data, columns = ['key', 'group', 'v1', 'v2'])
df_2 = df_1.sort(['key', 'group'])
def f1(df, thresh):
    myList = [] 
    bin = 0     
    sum_v1 = 0     
    sum_v2 = 0   
    new_df = pd.DataFrame(columns = ['key', 'group', 'v1', 'v2', 'sum_v1', 'sum_v2', 'bin']) 
    for i, (key, group, v1, v2) in df.iterrows(): 
        if key not in myList:
            myList.append(key) 
            bin = 1
            sum_v1 = v1
            sum_v2 = v2
        else:
            if sum_v1 < thresh:
                bin += 0
                sum_v1 += v1
                sum_v2 += v2
            else:
                bin += 1
                sum_v1 = v1
                sum_v2 = v2
        new_df.loc[i, ['key']] = key
        new_df.loc[i, ['group']] = group
        new_df.loc[i, ['v1']] = v1
        new_df.loc[i, ['v2']] = v2
        new_df.loc[i, ['sum_v1']] = sum_v1
        new_df.loc[i, ['sum_v2']] = sum_v2
        new_df.loc[i, ['bin']] = bin
    return new_df

new_df_2 = f1(df_2, 30)
df_3 = new_df_2.groupby(['key', 'bin']).agg({'v1': "sum", 'v2': "sum"}).reset_index()
df_3.rename(columns={'v2': 'a_c_sum_v2', 'v1': 'a_c_sum_v1'}, inplace=True)
def f2(df, thresh):
    df_tmp = df.sort(['key', 'bin'], ascending=[1, 0]) 
    myList = [] 
    bin_d = 0 
    sum_v1_d = 0   
    sum_v2_d = 0  
    new_df = pd.DataFrame(columns = ['key', 'bin', 'a_c_sum_v1', 'a_c_sum_v2', 'sum_v1_d', 'sum_v2_d', 'bin_d']) 
    for i, (key, bin, v1, v2) in df_tmp.iterrows(): 
        if key not in myList:
            myList.append(key) 
            bin_d = 1
            sum_v1_d = v1
            sum_v2_d = v2
        else:
            if sum_v1_d < thresh:
                bin_d += 0
                sum_v1_d += v1
                sum_v2_d += v2
            else:
                bin_d += 1
                sum_v1_d = v1
                sum_v2_d = v2
        new_df.loc[i, ['key']] = key
        new_df.loc[i, ['bin']] = bin
        new_df.loc[i, ['a_c_sum_v1']] = v1
        new_df.loc[i, ['a_c_sum_v2']] = v2
        new_df.loc[i, ['sum_v1_d']] = sum_v1_d
        new_df.loc[i, ['sum_v2_d']] = sum_v2_d
        new_df.loc[i, ['bin_d']] = bin_d
    return new_df

new_df_3 = f2(df_3, 30)
df_4 = new_df_3.groupby(['key', 'bin_d']).agg({'a_c_sum_v1': "sum", 'a_c_sum_v2': "sum"}).reset_index()
df_4.rename(columns={'a_c_sum_v2': 'sum_v2', 'a_c_sum_v1': 'sum_v1'}, inplace=True)
m_1 = pd.merge(new_df_3[['key', 'bin', 'bin_d']], df_4[['key', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'bin_d'], right_on=['key', 'bin_d'], how='left')
m_2 = pd.merge(new_df_2[['key', 'group', 'bin']], m_1[['key', 'bin', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'bin'], right_on=['key', 'bin'], how='left')
m_3 = pd.merge(df_1[['key', 'group', 'v1', 'v2']], m_2[['key', 'group', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'group'], right_on=['key', 'group'], how='left')
m_3.rename(columns={'bin_d': 'bin'}, inplace=True)
m_3.sort(['key', 'group'])