Question

我正在训练sklearn.tree.DecisionTreeClassifier。我从pandas.core.frame.DataFrame开始。此数据框的某些列是真正应该是分类的字符串。例如，＆＃39; Color＆＃39;是一个这样的列，并具有诸如黑色＆＃39;白色＆＃39;红色＆＃39;等等的值。所以我将此列转换为类型类别：

data['Color'] = data['Color'].astype('category')

这很好用。现在我使用sklearn.cross_validation.train_test_split分割数据框，如下所示：

X = data.drop(['OutcomeType'], axis=1)
y = data['OutcomeType']
X_train, X_test, y_train, y_test = train_test_split(X, y)

现在X_train的类型为numpy.ndarray。然而，＆＃39;颜色＆＃39;价值不再是绝对的，它们又回到了字符串。

所以当我拨打以下电话时：

    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X_train, y_train)

我收到以下错误：

ValueError：无法将字符串转换为float：黑色

要使其正常工作，我需要做些什么？

Answer 1

如果要将分类列转换为整数，可以使用data.Color.cat.codes;这使用data.Color.cat.categories来执行映射（第i个数组元素被映射到整数i）

Answer 2

正如ayhan所说，解决方法是从你的“颜色”中创建虚拟功能。变量（与决策树/ RF非常常用）。

您可以使用以下内容：

def feature_to_dummy(df, column, drop=False):
    ''' take a serie from a dataframe,
        convert it to dummy and name it like feature_value
        - df is a dataframe
        - column is the name of the column to be transformed
        - if drop is true, the serie is removed from dataframe'''
    tmp = pd.get_dummies(df[column], prefix=column, prefix_sep='_')
    df = pd.concat([df, tmp], axis=1, join_axes=[df.index])
    if drop:
        del df[column]
    return df

pandas.get_dummies

见EntityManager#find(Class, Object)

示例

df Out[1]: color 0 red 1 black 2 green df_dummy = feature_to_dummy(df, 'color', drop=True) df_dummy Out[2]: color_black color_green color_red 0 0 0 1 1 1 0 0 2 0 1 0

使用应被视为分类的字符串来sklearn DecisionTreeClassifier

2 个答案: