Question

我从熊猫分类数据框架开始。

假设我有（1）：

A    B     C
-------------
3    Z     M
O    X     T
4    A     B

我像这样过滤数据框：df[ df['B'] != "X"] 这样我就会得到结果（2）：

A    B     C
-------------
3    Z     M
4    A     B

在（1） df['B'].cat.categories #would equal to ['Z', 'X', 'A']

在（2） df['B'].cat.categories #still equal to ['Z', 'X', 'A']

经过这种过滤操作后如何更新所有列的DF类别？

奖金：如果要在过滤后清理索引

df.reset_index()

Answer 1

过滤后从列中

remove_unused_categories。

正如piRSquared指出的那样，鉴于每一列都是分类dtype，您可以简洁地执行此操作：

df = df.query('B != "X"').apply(lambda s: s.cat.remove_unused_categories())

这将在过滤后循环遍历各列。

print(df)
#   A  B  C
#0  3  Z  M
#1  O  X  T
#2  4  A  B

df['B'].cat.categories
#Index(['A', 'X', 'Z'], dtype='object')

df = df[ df['B'] != 'X']

# Update all category columns
for col in df.dtypes.loc[lambda x: x == 'category'].index:
    df[col] = df[col].cat.remove_unused_categories()

df['B'].cat.categories
#Index(['A', 'Z'], dtype='object')

df['C'].cat.categories
#Index(['B', 'M'], dtype='object')

Answer 2

熊猫分别存储类别，如果不使用它们，则不要删除它们，如果要这样做，可以使用以下属性：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.remove_unused_categories.html#pandas.Series.cat.remove_unused_categories

按列值过滤熊猫分类数据框，然后更新其类别

2 个答案: