Question

我正在尝试遍历我的列，并且如果该列属于类别，则其行为将有所不同。

使用以下方法适用于类别为series的系列，但是在使用object dtype检查系列时给出错误。

if series.dtype == 'category':
    # do something

可以在类别上使用，但是如果dtype为object则会抛出：

错误：

Traceback (most recent call last):
  File "", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "", line 54, in run_data_template_task
    data_template.run(data_bundle, columns=columns)
  File "", line 531, in run
    self.to_parquet(data_bundle, columns=columns)
  File "", line 195, in to_parquet
    df = self.parse_df(df, columns=columns, overwrite_columns=overwrite_columns)
  File "", line 378, in parse_df
    df[col.name] = parse_series_with_nans(df[col.name], 'str')
  File "", line 369, in parse_series_with_nans
    if series.dtype == 'category':
TypeError: data type "category" not understood

另一方面，使用：

if series.dtype is 'category':
    # do something

即使dtype是False也返回category（这很有意义，因为它显然不是同一对象）

可复制的示例：

         df = pd.DataFrame({'category_column': ['a', 'b', 'c'], 'other_column': [1, 2, 3]})
         df['category_column'] = df['category_column'].astype('category')
         df['category_column'].dtype is 'category'
Out[46]: False
         df['category_column'].dtype == 'category'
Out[47]: True
         df['other_column'].dtype == 'category'
Traceback (most recent call last):
  File "", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-c6cc61c458d0>", line 1, in <module>
    d['other_column'].dtype == 'category'
TypeError: data type "category" not understood

Answer 1

df['category_column'].dtype is 'category'

为假，因为两个对象不是同一对象。

另一方面，

df['category_column'].dtype == 'category'

因为

所有CategoricalDtype实例的字符串都等于“ category”。

（https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#equality-semantics）

另请参阅Understanding Python's "is" operator

Answer 2

实际上，系列的dtype是一个复杂的对象，将其与字符串进行比较可能会或不会产生预期的结果。只要看一下您的示例即可：

>>> print(repr(df.category_column.dtype))
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> print(repr(df.other_column.dtype))
dtype('int64')

这足以确保它们不是字符串值！

如果需要进行简单比较，则应使用它们的name属性，该属性的确是字符串：

>>> df['category_column'].dtype.name == 'category'
True
>>> df['other_column'].dtype.name == 'category'
False

熊猫检查列是否是类别问题

2 个答案: