Question

首先，了解我的模型架构的一些背景。

模型输入到我的keras非常简单：

类别变量A
类别变量B
数字输入C，范围为[0,1]。

该模型具有单个输出：

[0,1]上的数字

训练模型时，我的输入数据是使用pd.read_sql()来自SQL数据库的数据框。我使用以下函数对分类变量A和B（分别位于数据帧col1的{{1}}和col2中）进行热编码：

original_data

稍后，当我需要根据该模型进行预测时，输入数据来自RabbitMQ的实时供稿，其形式为字典。该RabbitMQ数据必须通过其自己的（不同的）from keras import utils as np_utils def preprocess_categorical_features(self): col1 = np_utils.to_categorical(np.copy(self.original_data.CURRENT_RTIF.values)) col2 = np_utils.to_categorical(np.copy(self.original_data.NEXT_RTIF.values)) cat_input_data = np.append(col1,col2,axis=1) return cat_input_data函数进行处理。

这使我想到一个问题：无论是对数据库中的数据进行预处理还是对RabbitMQ提要进行处理，如何确保单编码完全相同？

应用于数据库数据的A一键编码：

reprocess_categorical_features()

应用于RabbitMQ数据的A一键编码（必须相同）：

|---------------------|------------------|
|          A          | One-Hot-Encoding |
|---------------------|------------------|
|       "coconut"     |      <0,1,0,0>   |
|---------------------|------------------|
|       "apple"       |      <1,0,0,0>   |
|---------------------|------------------|
|       "quince"      |      <0,0,0,1>   |
|---------------------|------------------|
|       "plum"        |      <0,1,0,0>   |
|---------------------|------------------|

我是否可以将编码另存为数据帧，numpy ndarray或字典，以便将编码从预处理训练数据的函数传递给预处理输入的函数数据？我愿意使用Keras以外的其他库进行OHE，但是我很想知道是否有一种方法可以使用我当前使用的keras的to_categorical函数。

Answer 1

我决定不使用keras的utils.to_categorical方法，而是决定使用sklearn.preprocessing.OneHotEncoder。这使我在处理训练数据时可以声明一个单编码器对象self.encoder：

class TrainingData:
    def preprocess_categorical_features(self):
        # declare OneHotEncoder object to save for later
        self.encoder = OneHotEncoder(sparse=False)

        # fit encoder to data
        self.encoder.fit(self.original_data.CURRENT_RTIF.values.reshape(-1,1))

        # perform one-hot-encoding on columns 1 and 2 of the training data
        col1 = self.encoder.transform(self.original_data.CURRENT_RTIF.values.reshape(-1,1))
        col2 = self.encoder.transform(self.original_data.NEXT_RTIF.values.reshape(-1,1))

        # return on-hot-encoded data as a numpy ndarray
        cat_input_data = np.append(col1,col2,axis=1)
        return cat_input_data

稍后，我可以将该编码器重新使用（通过将其作为参数training_data_ohe_encoder传递）到处理最终做出预测所需的输入数据的方法。

class LiveData:
    def preprocess_categorical_features(self, training_data_ohe_encoder):
        # notice the training_data_ohe_encoder parameter; this is the 
        # encoder attribute from the Training Data Class.

        # one-hot-encode the live data using the training_data_ohe_encoder encoder
        col1 = training_data_ohe_encoder.transform(np.copy(self.preprocessed_data.CURRENT_RTIF.values).reshape(-1, 1))
        col2 = training_data_ohe_encoder.transform(np.copy(self.preprocessed_data.NEXT_RTIF.values).reshape(-1, 1))

        # return on-hot-encoded data as a numpy ndarray
        cat_input_data = np.append(col1,col2,axis=1)
        return cat_input_data

如何将一键编码存储为对象？

1 个答案: