我正在尝试使用多索引向Pandas GroupBy DataFrame添加列。该列是分组后公共密钥的最大值和平均值之间的差值。
这是输入DataFrame:
Main Reads Test Subgroup
0 1 5 54 1
1 2 2 55 1
2 1 10 56 2
3 2 20 57 3
4 1 7 58 3
以下是代码:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Main': [1, 2, 1, 2, 1], 'Reads': [5, 2, 10, 20, 7],\
'Test':range(54,59), 'Subgroup':[1,1,2,3,3]})
result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean]})
在执行result
:
diff
Reads
amax mean
Main Subgroup
1 1 5 5
2 10 10
3 7 7
2 1 2 2
3 20 20
接下来,我使用以下内容计算diff
列
result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
但这是输出:
/home/userd/test.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/
...stable/indexing.html#indexing-view-versus-copy
...result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
我希望diff
列与amax
和mean
位于同一级别。
有没有办法在Pandas中的多索引GroupBy()
对象的最里面(底部)列索引中添加一列?
答案 0 :(得分:3)
您可以使用元组
访问多索引result[('Reads','diff')] = result[('Reads','amax')] - result[('Reads','mean')]
你得到了
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
答案 1 :(得分:2)
试试这个:
In [8]: result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean, lambda x: x.max()-x.mean()]})
In [9]: result
Out[9]:
Reads
amax mean <lambda>
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
In [10]: result = result.rename(columns={'<lambda>':'diff'})
In [11]: result
Out[11]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
答案 2 :(得分:2)
#you can you lambda to build diff directly.
df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean,lambda x: np.max(x)-np.mean(x)]}).rename(columns={'<lambda>':'diff'})
Out[2360]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0