对数据进行分组后,我想从仅包含一个观察值的结果组中删除值低于某个阈值。
初始数据:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
现在对数据进行分组:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
现在我无法弄清楚的部分是如何放弃只有一个城市的省份(或者一般是N个观察点),总销售量低于10个。预期产量应该是:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
即。 MN / Winnipeg和QC / Montreal已从结果中消失。理想情况下,他们不会完全消失,而是组合成一个名为“其他”的新组,但这可能是另一个问题的原因。
答案 0 :(得分:3)
你可以这样做:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
现在我们将为那些拥有多个城市的省份创建一个面具':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
总销售额大于或等于10的城市获胜:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
答案 1 :(得分:2)
我对所给出的任何答案都不满意,所以我一直在努力,直到我找到以下解决方案:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1