Question

我有一个df：

temp = pd.DataFrame({'Y': ['A', 'B', 'B', 'A', 'B'],
                    'Z': [10, 5, 6, np.nan, 12],
                                        })

我将Y设置为索引，然后按组计算计数和大小：

temp.sort('Y', inplace=True)
temp.set_index('Y', inplace=True, drop=False)
temp.sort_index( inplace=True)

temp['n_obs'] = temp.groupby(by='Y')['Z'].transform('size')
temp['valid'] = temp.groupby(by='Y')['Z'].transform('count')

这会产生：

   Y     Z  n_obs  valid
Y                       
A  A  10.0    2.0    1.0
A  A   NaN    2.0    1.0
B  B   5.0    3.0    3.0
B  B   6.0    3.0    3.0
B  B  12.0    3.0    3.0

现在，我想通过n-obs分组来划分有效：

temp['New']=temp.groupby(by='Y').apply(lambda x: (x['valid'] / x['n_obs']))

但是我收到了这个错误：

Exception: cannot handle a non-unique multi-index!

请问？

Answer 1

我认为您可以使用两次reset_index：

temp.sort_values('Y', inplace=True)
temp.set_index('Y', inplace=True, drop=False)
temp.sort_index( inplace=True)

temp['n_obs'] = temp.groupby(by='Y')['Z'].transform('size')
temp['valid'] = temp.groupby(by='Y')['Z'].transform('count')

temp.reset_index(drop=True, inplace=True)

temp['New'] = temp.groupby(by='Y')
                  .apply(lambda x: (x['valid'] / x['n_obs']))
                  .reset_index(drop=True, level=0)
print (temp) 
   Y     Z  n_obs  valid  New
0  A  10.0    2.0    1.0  0.5
1  A   NaN    2.0    1.0  0.5
2  B   5.0    3.0    3.0  1.0
3  B   6.0    3.0    3.0  1.0
4  B  12.0    3.0    3.0  1.0

但是如果省略groupby并且只划分列，似乎结果相同：

temp.sort_values('Y', inplace=True)
temp.set_index('Y', inplace=True, drop=False)
temp.sort_index( inplace=True)

temp['n_obs'] = temp.groupby(by='Y')['Z'].transform('size')
temp['valid'] = temp.groupby(by='Y')['Z'].transform('count')


temp['New'] = temp['valid'] / temp['n_obs']
print (temp) 
   Y     Z  n_obs  valid  New
Y                            
A  A  10.0    2.0    1.0  0.5
A  A   NaN    2.0    1.0  0.5
B  B   5.0    3.0    3.0  1.0
B  B   6.0    3.0    3.0  1.0
B  B  12.0    3.0    3.0  1.0

按组划分数据框的列？

1 个答案: