我想在一列中显示我拥有的组数,然后只显示那些具有超过特定数字的组。
考虑这个例子:
import pandas as pd
df = pd.DataFrame(
{
'ColA': 'A A A B B C C C C D E E F F F F F F F G G H'.split(),
'ColB': '1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2'.split()
}
)
如果我这样分组:
print df.groupby(['ColA']).agg(['count'])
我得到这样的输出:
ColB
count
ColA
A 3
B 2
C 4
D 1
E 2
F 7
G 2
H 1
现在,如果我只想在计数超过2时显示上面的行,我该怎么做?我希望不包含B
,D
,E
,G
或H
我尝试了以下两行,并且都返回了相同的错误:
print df.loc[df.groupby(['ColA']).agg(['count']) > 2]
print df.loc[df.groupby(['ColA']).agg(['count'])['ColB'] > 2]
错误:
Traceback (most recent call last):
File "C:/scratches/scratch_3", line 11, in <module>
print df.loc[df.groupby(['ColA']).agg(['count'])['ColB'] > 2]
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1189, in __getitem__
return self._getitem_axis(key, axis=0)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1321, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
ValueError: Cannot index with multidimensional key
PabTorre提供的answer似乎不适用于较新版本的pandas。我正在使用0.16.2
当我使用该答案时,我在此行收到以下错误:
print df_count[df_count.values>2]
Traceback (most recent call last):
File "C:/scratches/scratch_3", line 10, in <module>
print df_count[df_count.values>2]
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 1791, in __getitem__
return self._getitem_array(key)
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 1833, in _getitem_array
return self.take(indexer, axis=0, convert=False)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 1358, in take
convert=True, verify=True)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 3275, in take
axis=axis, allow_dups=True)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 3162, in reindex_indexer
for blk in self.blocks]
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 857, in take_nd
allow_fill=True, fill_value=fill_value)
File "C:\Anaconda\lib\site-packages\pandas\core\common.py", line 844, in take_nd
func(arr, indexer, out, fill_value)
File "pandas\src\generated.pyx", line 5779, in pandas.algos.take_2d_axis1_object_object (pandas\algos.c:107426)
File "stringsource", line 614, in View.MemoryView.memoryview_cwrapper (pandas\algos.c:187433)
File "stringsource", line 321, in View.MemoryView.memoryview.__cinit__ (pandas\algos.c:184022)
ValueError: buffer source array is read-only
答案 0 :(得分:2)
您上次查询的问题:
print df.loc[df.groupby(['ColA']).agg(['count']) > 2]
df.loc []是否需要一系列22个布尔对象。 相反,它获得了一系列8个对象:
>>> df.groupby(['ColA']).agg(['count']) > 2
ColB
count
ColA
A True
B False
C True
D False
E False
F True
G False
H False
因此它不知道如何将它们组合在一起。
但是有一个解决方案。 :)
首先,让我们将聚合df分配给一个新对象。
>>> df_count = df.groupby(['ColA']).agg(['count']).ColB
>>> df_count.columns=['ColB']
然后我们可以轻松过滤它
>>> df_count[df_count.ColB.values>2]
ColB
ColA
A 3
C 4
F 7
然后我们可以使用过滤后的df返回并过滤原始df
>>> df_filtered=df_count[df_count.ColB.values>2]
>>> df[df.ColA.isin(df_filtered.index)]
ColA ColB
0 A 1
1 A 2
2 A 3
5 C 6
6 C 7
7 C 8
8 C 9
12 F 3
13 F 4
14 F 5
15 F 6
16 F 7
17 F 8
18 F 9