将数据帧分组,应用计算并重新合并为一个数据帧

时间:2015-04-21 20:49:26

标签: python python-2.7 pandas dataframe

我无法分析我的数据框,应用计算并将其重新组合在一起。

这就是我的数据框架:

Priority    ID      Name    Coverage    Group
1           1000    Name 1  33         Group A
2           1001    Name 2  67         Group A
3           1002    Name 3  100        Group A
4           1003    Name 4  33         Group B
5           1004    Name 5  67         Group B
6           1005    Name 6  100        Group B
7           1006    Name 7  33         Group C
8           1007    Name 8  67         Group C
9           1008    Name 9  100        Group C

我正在尝试创建一个新的“有效覆盖率”列,因为我当前的“覆盖率”列是每个“组”的累积值。例如,“名称3”是“A组”的一部分,实际上覆盖范围为33(100 - 67)。

我希望实现的最终结果是:

Priority    ID  Name    Coverage    Group   Effective Coverage
1          1000 Name 1  33          Group A 33
2          1001 Name 2  67          Group A 34
3          1002 Name 3  100         Group A 33
4          1003 Name 4  33          Group B 33
5          1004 Name 5  67          Group B 34
6          1005 Name 6  100         Group B 33
7          1006 Name 7  33          Group C 33
8          1007 Name 8  67          Group C 34
9          1008 Name 9  100         Group C 33

这是我到目前为止所做的:

for group in groups:

    effective_coverage = [df[df['group']==group].coverage.iloc[0]]

    for i in range(1,len(df[df['group']==group].placementID)):
        ecov = df[df['group']==group].coverage.iloc[i] - df[df['group']==group].coverage.iloc[i-1]
        effective_coverage.append(ecov)

    effective_coverage = pd.Series(effective_coverage, name='effective_coverage')

    print effective_coverage

    df[df['group']==group] = df[df['group']==group].join(effective_coverage)
    print df[df['group']==group]

我知道我用来计算有效覆盖率的逻辑是正确的,因为对于每个组,它打印出正确的有效覆盖范围33,34,33。

但是,当我尝试加入有效的覆盖系列,并打印出其中一个组的数据框时,它只会返回:

Priority    ID  Name    Coverage    Group
1          1000 Name 1  33          Group A
2          1001 Name 2  67          Group A
3          1002 Name 3  100         Group A

并没有成功加入我新计算的有效保险范围。

有什么想法吗?我是一个很棒的Python菜鸟,所以我很想听到更优雅的方法来完成这个,如果有人的话。

3 个答案:

答案 0 :(得分:1)

您可以编写自定义split_cumsum函数,该函数会计算Effective Coverage

In [33]: def split_cumsum(grp):
   .....:     grp['Effective Coverage'] = grp['Coverage']
   .....:     grp['Effective Coverage'][1:] = np.diff(grp['Coverage'])
   .....:     return grp

然后apply split_cumsum超过df.groupby('Group')

In [34]: df.groupby('Group').apply(split_cumsum)
Out[34]:
   Priority    ID    Name  Coverage    Group  Effective Coverage
0         1  1000  Name 1        33  Group A                  33
1         2  1001  Name 2        67  Group A                  34
2         3  1002  Name 3       100  Group A                  33
3         4  1003  Name 4        33  Group B                  33
4         5  1004  Name 5        67  Group B                  34
5         6  1005  Name 6       100  Group B                  33
6         7  1006  Name 7        33  Group C                  33
7         8  1007  Name 8        67  Group C                  34
8         9  1008  Name 9       100  Group C                  33

答案 1 :(得分:0)

另外,您可以在diff

中使用groups
In [53]: df['Effective Coverage'] = df.groupby('Group')['Coverage'].diff()

In [54]: df
Out[54]:
   Priority    ID    Name  Coverage    Group  Effective Coverage
0         1  1000  Name 1        33  Group A                 NaN
1         2  1001  Name 2        67  Group A                  34
2         3  1002  Name 3       100  Group A                  33
3         4  1003  Name 4        33  Group B                 NaN
4         5  1004  Name 5        67  Group B                  34
5         6  1005  Name 6       100  Group B                  33
6         7  1006  Name 7        33  Group C                 NaN
7         8  1007  Name 8        67  Group C                  34
8         9  1008  Name 9       100  Group C                  33

然后使用NaN

中的值填充Coverage
In [55]: df['Effective Coverage'] = df['Effective Coverage'].fillna(df['Coverage'])

In [56]: df
Out[56]:
   Priority    ID    Name  Coverage    Group  Effective Coverage
0         1  1000  Name 1        33  Group A                  33
1         2  1001  Name 2        67  Group A                  34
2         3  1002  Name 3       100  Group A                  33
3         4  1003  Name 4        33  Group B                  33
4         5  1004  Name 5        67  Group B                  34
5         6  1005  Name 6       100  Group B                  33
6         7  1006  Name 7        33  Group C                  33
7         8  1007  Name 8        67  Group C                  34
8         9  1008  Name 9       100  Group C                  33

答案 2 :(得分:0)

如果Coverage列是累计总数,则列的最大值将是该组的总和。我已经更改了您的覆盖率,以便您可以看到groupby发生了什么,然后将其连接到原始数据框:

df = pd.DataFrame({'Priority': np.arange(1, 10), 'ID': np.arange(1000, 1009), 'Name': ['Name {0}'.format(i) for i in np.arange(1, 10)], 'Coverage': [33, 67, 100, 11, 22, 33, 67, 124, 200], 'Group': ['Group {0}'.format(i) for i in ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']]})[['Priority', 'ID', 'Name', 'Coverage', 'Group']]

df2 = df.join(df.groupby('Group').Coverage.max(), on='Group', rsuffix='_max')

然后,您只需添加新列即可计算有效覆盖率:

df2['Effective Coverage'] = df2.Coverage.divide(df2.Coverage_max)

>>> df2
   Priority    ID    Name  Coverage    Group  Coverage_max  Effective Coverage
0         1  1000  Name 1        33  Group A           100            0.330000
1         2  1001  Name 2        67  Group A           100            0.670000
2         3  1002  Name 3       100  Group A           100            1.000000
3         4  1003  Name 4        11  Group B            33            0.333333
4         5  1004  Name 5        22  Group B            33            0.666667
5         6  1005  Name 6        33  Group B            33            1.000000
6         7  1006  Name 7        67  Group C           200            0.335000
7         8  1007  Name 8       124  Group C           200            0.620000
8         9  1008  Name 9       200  Group C           200            1.000000