我有一个像下面这样的pandas数据框
>>> df.head()
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
2 35002 26000 OPTIDX NIFTY XX 1609425000 NIFTY20DEC10400CE
3 35003 26000 OPTIDX NIFTY XX 1609425000 NIFTY20DEC10400PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
我希望按排序顺序将第5列分组,然后返回前n个组,其中n可以作为变量给出。
我做df.sort_values(5).groupby([5])
我得<pandas.core.groupby.DataFrameGroupBy object at 0x2afc8d0>
如何获取前两组中的所有行。在样本中,df上面的组为1499351400,组2为1499351400,组3为1609425000
预期输出:当需要组= 2时
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
Update1:尝试@ jezrael之后
>>> k2=k1[k1.groupby(5).ngroup() < 2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/2.7/lib/python2.7/site-packages/pandas/core/groupby.py", line 529, in __getattr__
(type(self).__name__, attr))
AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'
附加:没有pandas(只有python)可以做到这一点,我可能并不总能找到带有pandas的机器。感谢
答案 0 :(得分:1)
将ngroup
(使用0.20.2
)与boolean indexing
:
df = df.sort_values(5)
print (df.groupby(5).ngroup())
0 0
1 0
4 1
2 2
3 2
dtype: int64
df = df[df.groupby(5).ngroup() < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
对于较旧版本的pandas使用一点hack - 信息隐藏在对象grouper.group_info
中,因此请按[0]
选择第一个数组:
df = df.sort_values(5)
print (df.groupby([5]).grouper.group_info)
(array([0, 0, 2, 2, 1], dtype=int64), array([0, 1, 2]), 3)
print (df.groupby([5]).grouper.group_info[0])
[0 0 2 2 1]
df = df[df.groupby([5]).grouper.group_info[0] < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
使用factorize
的替代解决方案:
df = df.sort_values(5)
df = df[pd.factorize(df[5])[0] < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
答案 1 :(得分:1)
如果您不能使用ngroup
,只需使用'dense'
对元素进行排名,然后使用它来编入df:
In [24]: df.loc[df[5].rank(method='dense') <= 2]
Out[24]:
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
这是有效的,因为rank(method='dense')
为我们提供了每个数字的排序等级:
In [25]: df[5].rank(method='dense')
Out[25]:
0 1.0
1 1.0
2 3.0
3 3.0
4 2.0
Name: 5, dtype: float64
(P.S。奇怪的是巧合,我添加了ngroup
和method='dense'
,所以这个问题让我很开心。: - )