Question

我似乎无法在v0.15 +中使用Pandas改进的Categoricals进行简单的dtype检查。基本上我只想要像is_categorical(column) -> True/False这样的东西。

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({
    'x': np.linspace(0, 50, 6),
    'y': np.linspace(0, 20, 6),
    'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])

我们可以看到分类列的dtype是“类别”：

df.cat_column.dtype
Out[20]: category

通常我们可以通过比较名称来进行dtype检查 dtype：

df.x.dtype == 'float64'
Out[21]: True

但是在尝试检查x列时，这似乎不起作用是绝对的：

df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'

TypeError: data type "category" not understood

有没有办法在pandas v0.15 +？

中进行这些类型的检查

Answer 1

使用name属性进行比较，它应该始终有效，因为它只是一个字符串：

>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'

>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'

所以，总而言之，你可以得到一个简单，直接的功能：

def is_categorical(array_like):
    return array_like.dtype.name == 'category'

Answer 2

首先，dtype的字符串表示形式为'category'而不是'categorical'，因此可行：

In [41]: df.cat_column.dtype == 'category'
Out[41]: True

但实际上，正如您所注意到的，此比较为其他dtypes提供了TypeError，因此您必须使用try .. except ..块进行包装。

使用pandas内部检查的其他方法：

In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True

In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True

对于非分类列，这些语句将返回False而不是引发错误。例如：

In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False

对于更早版本的pandas，请将上述代码段中的pd.api.types替换为pd.core.common。

Answer 3

在我的熊猫版本（v1.0.3）中，有一个简短版本的joris回答。

df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})

print(isinstance(df.noncat.dtype, pd.CategoricalDtype))  # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype))   # True

print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ))  # True

Answer 4

将其放在这里是因为pandas.DataFrame.select_dtypes()实际上是我要寻找的东西：

df['column'].name in df.select_dtypes(include='category').columns

感谢@Jeff。

Answer 5

我遇到了这个线程，寻找完全相同的功能，并且从熊猫文档here中找到了另一个选择。

检查pandas dataframe列是否为分类系列的规范方法应该如下：

hasattr(column_to_check, 'cat')

因此，按照第一个问题中给出的示例，这将是：

hasattr(df.x, 'cat') #True

检查dataframe列是否为Categorical

5 个答案: