Question

[Pandas版本0.24.2； Python 3.6] 具有一个分类列的测试数据框...

import pandas as pd
test = pd.concat([pd.Series(['A','B','C'], name='C1').astype(
    pd.api.types.CategoricalDtype(categories=['C','B','A'], ordered=True)),
                  pd.Series(range(3), name='C2')], axis=1).sort_values('C1')
test

产生下面的数据框，列C1为分类。

在C1列上使用简单的groupby在每个组上调用size（）保留了该列在索引中的分类性质...

test.groupby('C1').size().index
CategoricalIndex(['C', 'B', 'A'], categories=['C', 'B', 'A'], ordered=True, name='C1', dtype='category')

但是grouby的更高级用法（适用于apply）用于为每个组计算一系列结果，从而以某种方式失去了索引的分类性质...

test.groupby('C1').apply(lambda g: pd.Series({'size':len(g)})).index
Index(['C', 'B', 'A'], dtype='object', name='C1')

似乎关键的区别在于对pd.Series的调用而不是对应用本身的调用...

test.groupby('C1').apply(len).index
CategoricalIndex(['C', 'B', 'A'], categories=['C', 'B', 'A'], ordered=True, name='C1', dtype='category')

至少对我来说，这似乎很奇怪，应用程序的有效负载与结果的索引有所不同。我以为该索引是由groupby单独定义的。

当使用apply为每个组生成系列时，为什么groupby索引会丢失Categorical dtype？

0 个答案: