我有数据框:
df = pd.DataFrame({'Continent':['North America','North America','North America','Europe','Europe','Europe','Europe'],
'Country': ['US','Canada','Mexico','France','Germany','Spain','Italy'],
'Status': ['Member','Non-Member','Non-Member','Member','Non-Member','Member','Non-Member'],
'Units': [27,5,4,10,15,8,8]})
print df
Continent Country Status Units
0 North America US Member 27
1 North America Canada Non-Member 5
2 North America Mexico Non-Member 4
3 Europe France Member 10
4 Europe Germany Non-Member 15
5 Europe Spain Member 8
6 Europe Italy Non-Member 8
我需要添加2列,这些列是关于大陆的摘要统计信息。我需要一个列为成员国和非成员国的单位总和。
以便最终输出如下:
Continent Member Units Non-Member Units Country Status Units
0 North America 27 9 US Member 27
1 North America 27 9 Canada Non-Member 5
2 North America 27 9 Mexico Non-Member 4
3 Europe 18 23 France Member 10
4 Europe 18 23 Germany Non-Member 15
5 Europe 18 23 Spain Member 8
6 Europe 18 23 Italy Non-Member 8
似乎我需要使用groupby,但我无法弄清楚如何获取groupby值并将它们作为新列重新插入数据帧。
summary_stats = df.groupby(['Continent','Status'])['Units'].sum()
print summary_stats
Continent Status
Europe Member 18
Non-Member 23
North America Member 27
Non-Member 9
Name: Units, dtype: int64
我也尝试过不使用groupby:
df['Member Units'] = df['Units'][df['Status'] == 'Member'].sum()
df['Non-Member Units'] = df['Units'][df['Status'] == 'Non-Member'].sum()
但是这并没有被大陆区分,所以它只是将所有会员和非会员加起来
任何帮助都非常有用!
答案 0 :(得分:2)
我认为您需要先groupby
和transform
sum
来创建新的Series
all_sum
。然后我认为最好使用numpy.where
,如果是成员,则从Series
获取值,如果不是,则获取0
。与非成员相似:
all_sum = df.groupby(['Continent','Status'])['Units'].transform(sum)
print all_sum
0 27
1 9
2 9
3 18
4 23
5 18
6 23
dtype: int64
df['Member Units'] = np.where(df['Status'] == 'Member', all_sum, 0)
df['Non-Member Units'] = np.where(df['Status'] != 'Member', all_sum, 0)
print df
Continent Country Status Units Member Units Non-Member Units
0 North America US Member 27 27 0
1 North America Canada Non-Member 5 0 9
2 North America Mexico Non-Member 4 0 9
3 Europe France Member 10 18 0
4 Europe Germany Non-Member 15 0 23
5 Europe Spain Member 8 18 0
6 Europe Italy Non-Member 8 0 23
答案 1 :(得分:1)
一旦你summary_stats
,我认为你可以这样做:
df['Member Units'] = summary_stats[zip(df['Continent'].values, df['Status'].values)]
您需要zip
系列值的原因是df['Continent']
会返回带索引的系列,但您不希望这种情况发生。
答案 2 :(得分:0)
由于您有summary_stats
,因此您可以在重塑后使用merge()
:
summary = summary_stats.reset_index().pivot(index='Continent', columns='Status', values='Units')
summary['Continent'] = summary.index
df = df.merge(summary, on='Continent')
然后根据需要重命名列