当groupby计数多个列时,我收到错误。这是我的数据框,也是一个简单标记不同的'b'和'c'组的示例。
df = pd.DataFrame(np.random.randint(0,2,(4,4)),
columns=['a', 'b', 'c', 'd'])
df['gr'] = df.groupby(['b', 'c']).grouper.group_info[0]
print df
a b c d gr
0 0 1 0 0 1
1 1 1 1 0 2
2 0 0 1 0 0
3 1 1 1 1 2
但是,当稍微更改示例以便调用count()而不是grouper.group_info [0]时,会出现错误。
df = pd.DataFrame(np.random.randint(0,2,(4,4)),
columns=['a', 'b', 'c', 'd'])
df['gr'] = df.groupby(['b', 'c']).count()
print df
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-70-a46f632214e1> in <module>()
1 df = pd.DataFrame(np.random.randint(0,2,(4,4)),
2 columns=['a', 'b', 'c', 'd'])
----> 3 df['gr'] = df.groupby(['b', 'c']).count()
4 print df
C:\Python27\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
2036 else:
2037 # set column
-> 2038 self._set_item(key, value)
2039
2040 def _setitem_slice(self, key, value):
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
2082 ensure homogeneity.
2083 """
-> 2084 value = self._sanitize_column(key, value)
2085 NDFrame._set_item(self, key, value)
2086
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
2110 value = value.values.copy()
2111 else:
-> 2112 value = value.reindex(self.index).values
2113
2114 if is_frame:
C:\Python27\lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, method, level, fill_value, limit, copy)
2527 if index is not None:
2528 frame = frame._reindex_index(index, method, copy, level,
-> 2529 fill_value, limit)
2530
2531 return frame
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit)
2606 limit=None):
2607 new_index, indexer = self.index.reindex(new_index, method, level,
-> 2608 limit=limit)
2609 return self._reindex_with_indexers(new_index, indexer, None, None,
2610 copy, fill_value)
C:\Python27\lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit)
2181 else:
2182 # hopefully?
-> 2183 target = MultiIndex.from_tuples(target)
2184
2185 return target, indexer
C:\Python27\lib\site-packages\pandas\core\index.pyc in from_tuples(cls, tuples, sortorder, names)
1803 tuples = tuples.values
1804
-> 1805 arrays = list(lib.tuples_to_object_array(tuples).T)
1806 elif isinstance(tuples, list):
1807 arrays = list(lib.to_object_array_tuples(tuples).T)
C:\Python27\lib\site-packages\pandas\lib.pyd in pandas.lib.tuples_to_object_array (pandas\lib.c:42342)()
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
答案 0 :(得分:8)
在互动环节中评估df.groupby(['b', 'c']).count()
:
In [150]: df.groupby(['b', 'c']).count()
Out[150]:
a b c d
b c
0 0 1 1 1 1
1 1 1 1 1
1 1 2 2 2 2
这是一个完整的DataFrame。它可能不是您想要分配给df
的新列的(事实上,您无法将列分配给DataFrame,这就是为什么会引发一个虽然神秘的异常。)。
如果您希望创建一个计算每个组中行数的新列,可以使用
df['gr'] = df.groupby(['b', 'c'])['a'].transform('count')
例如,
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 2, (4, 4)),
columns=['a', 'b', 'c', 'd'])
print(df)
# a b c d
# 0 1 1 0 0
# 1 1 1 1 1
# 2 1 0 0 1
# 3 0 1 1 0
df['gr'] = df.groupby(['b', 'c'])['a'].transform('count')
df['comp_ids'] = df.groupby(['b', 'c']).grouper.group_info[0]
print(df)
产量
a b c d gr comp_ids
0 1 1 0 0 1 1
1 1 1 1 1 2 2
2 1 0 0 1 1 0
3 0 1 1 0 2 2
请注意,df.groupby(['b', 'c']).grouper.group_info[0]
返回的内容不是每组中行数的计数。相反,它会为每个组返回一个标签。