Question

我知道我可以使用categorical_column_with_identity来将分类特征转变为一系列单一特征。

例如，如果我的词汇是["ON", "OFF", "UNKNOWN"]：
"OFF"-> [0, 1, 0]

categorical_column = tf.feature_column.categorical_column_with_identity('column_name', num_buckets=3)
feature_column = tf.feature_column.indicator_column(categorical_column))

但是，我实际上有一维分类特征数组。我想将其转换为二维的一键式功能：

["OFF", "ON", "OFF", "UNKNOWN", "ON"]
->
[[0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]

与其他所有功能列不同，shape上似乎没有categorical_column_with_identity属性，并且我没有通过Google或文档找到任何帮助。

我是否必须放弃categorical_column_with_identity并通过numerical_column自己创建2D阵列？

Answer 1

根据评论，我不确定tensorflow是否可以使用此功能。但是使用Pandas，您可以通过pd.get_dummies获得一个简单的解决方案：

import pandas as pd

L = ['OFF', 'ON', 'OFF', 'UNKNOWN', 'ON']

res = pd.get_dummies(L)

print(res)

   OFF  ON  UNKNOWN
0    1   0        0
1    0   1        0
2    1   0        0
3    0   0        1
4    0   1        0

为了提高性能，或者如果仅需要一个NumPy数组，则可以使用sklearn.preprocessing中的LabelBinarizer：

from sklearn.preprocessing import LabelBinarizer

LB = LabelBinarizer()

res = LB.fit_transform(L)

print(res)

array([[1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0]])

Answer 2

几个二进制编码选项

import tensorflow as tf
test = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
encoding = {x:idx for idx, x in enumerate(sorted(set(test)))}
test = [encoding[x] for x in test]
print(tf.keras.utils.to_categorical(test, num_classes=len(encoding)))

>>>[[1. 0. 0.]
    [0. 1. 0.]
    [1. 0. 0.]
    [0. 0. 1.]
    [0. 1. 0.]]

或者从scikit中获得另一个答案

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["OFF", "ON", "OFF", "UNKNOWN", "ON"])
print(transfomed_label)

>>>[[1 0 0]
    [0 1 0]
    [1 0 0]
    [0 0 1]
    [0 1 0]]

Answer 3

您可以像这样将dict用作地图：

categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
one_hot_features = []

map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}

for val in categorical_features:
    one_hot_features.append(map[val])

或具有列表理解功能： categorical_features = [“ OFF”，“ ON”，“ OFF”，“ UNKNOWN”，“ ON”]

map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
one_hot_features = [map[f] for f in categorical_features]

这应该给您您想要的东西。

指定分类特征列的形状？

3 个答案: