我无法分析我的数据框,应用计算并将其重新组合在一起。
这就是我的数据框架:
Priority ID Name Coverage Group
1 1000 Name 1 33 Group A
2 1001 Name 2 67 Group A
3 1002 Name 3 100 Group A
4 1003 Name 4 33 Group B
5 1004 Name 5 67 Group B
6 1005 Name 6 100 Group B
7 1006 Name 7 33 Group C
8 1007 Name 8 67 Group C
9 1008 Name 9 100 Group C
我正在尝试创建一个新的“有效覆盖率”列,因为我当前的“覆盖率”列是每个“组”的累积值。例如,“名称3”是“A组”的一部分,实际上覆盖范围为33(100 - 67)。
我希望实现的最终结果是:
Priority ID Name Coverage Group Effective Coverage
1 1000 Name 1 33 Group A 33
2 1001 Name 2 67 Group A 34
3 1002 Name 3 100 Group A 33
4 1003 Name 4 33 Group B 33
5 1004 Name 5 67 Group B 34
6 1005 Name 6 100 Group B 33
7 1006 Name 7 33 Group C 33
8 1007 Name 8 67 Group C 34
9 1008 Name 9 100 Group C 33
这是我到目前为止所做的:
for group in groups:
effective_coverage = [df[df['group']==group].coverage.iloc[0]]
for i in range(1,len(df[df['group']==group].placementID)):
ecov = df[df['group']==group].coverage.iloc[i] - df[df['group']==group].coverage.iloc[i-1]
effective_coverage.append(ecov)
effective_coverage = pd.Series(effective_coverage, name='effective_coverage')
print effective_coverage
df[df['group']==group] = df[df['group']==group].join(effective_coverage)
print df[df['group']==group]
我知道我用来计算有效覆盖率的逻辑是正确的,因为对于每个组,它打印出正确的有效覆盖范围33,34,33。
但是,当我尝试加入有效的覆盖系列,并打印出其中一个组的数据框时,它只会返回:
Priority ID Name Coverage Group
1 1000 Name 1 33 Group A
2 1001 Name 2 67 Group A
3 1002 Name 3 100 Group A
并没有成功加入我新计算的有效保险范围。
有什么想法吗?我是一个很棒的Python菜鸟,所以我很想听到更优雅的方法来完成这个,如果有人的话。
答案 0 :(得分:1)
您可以编写自定义split_cumsum
函数,该函数会计算Effective Coverage
In [33]: def split_cumsum(grp):
.....: grp['Effective Coverage'] = grp['Coverage']
.....: grp['Effective Coverage'][1:] = np.diff(grp['Coverage'])
.....: return grp
然后apply
split_cumsum
超过df.groupby('Group')
In [34]: df.groupby('Group').apply(split_cumsum)
Out[34]:
Priority ID Name Coverage Group Effective Coverage
0 1 1000 Name 1 33 Group A 33
1 2 1001 Name 2 67 Group A 34
2 3 1002 Name 3 100 Group A 33
3 4 1003 Name 4 33 Group B 33
4 5 1004 Name 5 67 Group B 34
5 6 1005 Name 6 100 Group B 33
6 7 1006 Name 7 33 Group C 33
7 8 1007 Name 8 67 Group C 34
8 9 1008 Name 9 100 Group C 33
答案 1 :(得分:0)
另外,您可以在diff
groups
In [53]: df['Effective Coverage'] = df.groupby('Group')['Coverage'].diff()
In [54]: df
Out[54]:
Priority ID Name Coverage Group Effective Coverage
0 1 1000 Name 1 33 Group A NaN
1 2 1001 Name 2 67 Group A 34
2 3 1002 Name 3 100 Group A 33
3 4 1003 Name 4 33 Group B NaN
4 5 1004 Name 5 67 Group B 34
5 6 1005 Name 6 100 Group B 33
6 7 1006 Name 7 33 Group C NaN
7 8 1007 Name 8 67 Group C 34
8 9 1008 Name 9 100 Group C 33
然后使用NaN
列
Coverage
In [55]: df['Effective Coverage'] = df['Effective Coverage'].fillna(df['Coverage'])
In [56]: df
Out[56]:
Priority ID Name Coverage Group Effective Coverage
0 1 1000 Name 1 33 Group A 33
1 2 1001 Name 2 67 Group A 34
2 3 1002 Name 3 100 Group A 33
3 4 1003 Name 4 33 Group B 33
4 5 1004 Name 5 67 Group B 34
5 6 1005 Name 6 100 Group B 33
6 7 1006 Name 7 33 Group C 33
7 8 1007 Name 8 67 Group C 34
8 9 1008 Name 9 100 Group C 33
答案 2 :(得分:0)
如果Coverage
列是累计总数,则列的最大值将是该组的总和。我已经更改了您的覆盖率,以便您可以看到groupby发生了什么,然后将其连接到原始数据框:
df = pd.DataFrame({'Priority': np.arange(1, 10), 'ID': np.arange(1000, 1009), 'Name': ['Name {0}'.format(i) for i in np.arange(1, 10)], 'Coverage': [33, 67, 100, 11, 22, 33, 67, 124, 200], 'Group': ['Group {0}'.format(i) for i in ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']]})[['Priority', 'ID', 'Name', 'Coverage', 'Group']]
df2 = df.join(df.groupby('Group').Coverage.max(), on='Group', rsuffix='_max')
然后,您只需添加新列即可计算有效覆盖率:
df2['Effective Coverage'] = df2.Coverage.divide(df2.Coverage_max)
>>> df2
Priority ID Name Coverage Group Coverage_max Effective Coverage
0 1 1000 Name 1 33 Group A 100 0.330000
1 2 1001 Name 2 67 Group A 100 0.670000
2 3 1002 Name 3 100 Group A 100 1.000000
3 4 1003 Name 4 11 Group B 33 0.333333
4 5 1004 Name 5 22 Group B 33 0.666667
5 6 1005 Name 6 33 Group B 33 1.000000
6 7 1006 Name 7 67 Group C 200 0.335000
7 8 1007 Name 8 124 Group C 200 0.620000
8 9 1008 Name 9 200 Group C 200 1.000000