Question

我有一个包含200多个数字变量的数据集（类型：int）。在这些变量中，少数是具有（0,1），（0,1,2,3,4）等值的分类变量。

我需要识别这些分类变量并将它们弄清楚。识别和制作它们需要花费很多时间 - 有什么办法可以轻松完成吗？

Answer 1

您可以说某些变量是分类的，或者通过其唯一值的长度将其视为分类。例如，如果变量只有唯一值[-2,4,56]，则可以将此变量视为分类。

import pandas as pd
import numpy as np
col = [c for c in train.columns if c not in ['id','target']]
numclasses=[]
for c in col:
    numclasses.append(len(np.unique(train[[c]])))

threshold=10
categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]

每个被视为分类的变量中的每个唯一值都将创建一个新列。如果您不希望以后创建多个列作为虚拟对象，则可以使用小阈值。

Answer 2

使用nunique()函数获取每一列中的唯一值数量，然后过滤列。尽力判断threshold的值。将功能转换为分类类型

category_features = []
threshold = 10
for each in df.columns:
    if df[each].nunique() < threshold:
        category_features.append(each)

for each in category_features:
    df[each] = df[each].astype('category')

Answer 3

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

可能重复

这篇文章有很多答案。任何这些都可能对你有所帮助。看看

如何识别200多个数值变量中的分类变量？

3 个答案: