Question

我手头的基本任务是

a）阅读一些标签分隔数据。

b）做一些基本的预处理

c）对于每个分类列，使用LabelEncoder创建映射。这有点像这样

mapper={}
#Converting Categorical Data
for x in categorical_list:
     mapper[x]=preprocessing.LabelEncoder()

for x in categorical_list:
     df[x]=mapper[x].fit_transform(df.__getattr__(x))

其中df是pandas数据框，categorical_list是需要转换的列标题列表。

d）训练分类器并使用pickle

将其保存到磁盘

e）现在在另一个程序中，加载了保存的模型。

f）加载测试数据并执行相同的预处理。

g）LabelEncoder's用于转换分类数据。

h）该模型用于预测。

现在我的问题是，步骤g)会正常工作吗？

正如LabelEncoder的文档所说

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

那么每个条目每次都会哈希到完全相同的值吗？

如果不是，有什么好办法可以解决这个问题。有没有办法重新编码编码器的映射？或者与LabelEncoder完全不同的方式？

Answer 1

根据LabelEncoder实现，当且仅当您在测试时fit LabelEncoders使用具有完全相同的唯一值集的数据时，您描述的管道才能正常工作。 / p>

重新使用在火车期间获得的LabelEncoders有一种方法。 LabelEncoder只有一个属性，即classes_。你可以腌制它，然后恢复如

火车：

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

测试

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

这似乎比使用相同数据重新设置更有效。

Answer 2

对我来说有用的是LabelEncoder().fit(X_train[col])，为每个分类列col挑选这些对象，然后重复使用相同的对象来转换验证数据集中的同一分类列col。基本上，每个分类列都有一个标签编码器对象。

因此fit()对培训数据进行了培训，并挑选了与培训数据框X_train中每列相对应的对象/模型。
对于验证集col列中的每个X_cv，加载相应的对象/模型，并通过访问转换函数来应用转换：transform(X_cv[col])。

Answer 3

对我而言，最简单的方法是将每一列的LabelEncoder导出为.pkl文件。使用fit_transform()函数后，必须导出每列的编码器

例如

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()

然后在测试项目中，您可以加载LabelEncoder对象并直接应用transform()函数

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])

Answer 4

使用“ le”对象对值进行编码后，您可以执行以下操作：

encoding = {}
for i in list(le.classes_):
    encoding[i]=le.transform([i])[0]

您将获得带有编码的“编码”字典，以供以后使用，例如，使用大熊猫，您可以将此字典导出到csv。

Answer 5

如我所见，没有其他关于名词性/类别性编码的文章。我将扩展上述解决方案，并分享我的OrdinalEncoder方法（这可能还是作者想要的）

我对OrdinalEncoder进行了以下操作（但也应与LabelEncoder一起使用）。请注意，我使用的是categories_而不是classes_

创建编码器字典
使用numpy保存它
使用numpy加载
遍历字典并将转换应用于每列

注意：np代表numpy。

# ------- step 1 and 2 in the file/cell where the encoding shall be exported

    encoder_dict = dict()

    for nom in nominal_columns:
        enc = enc.fit(df[[nom]])
        df[[nom]] = enc.transform(df[[nom]])
        encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]

    np.save('FILE_NAME.npy', encoder_dict)




# ------------ step 3 and 4 in the file where encoding shall be imported

enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()

    for nom in encoder_dict:
        for col in df.columns:
            if nom == col:
                enc.categories_ = encoder_dict[nom]
                df[[col]] = enc.transform(df[[col]])
    return df

Answer 6

如果您已经通过 pickle 保存模型，我会对预处理工具做同样的事情。

一种方法是将所有内容组合到一个类中：

class MyClassifier():
    def load_data(self):
        ...
    def fit(self):
        self.first_column_encoder = preprocessing.LabelEncoder()
        self.first_column_encoder.fit(...)
        ...
        self.second_column_encoder = preprocessing.LabelEncoder()
        self.second_column_encoder.fit(...)
        ...
        self.model = KNearestNeighbors(...)
        self.model.fit(...)

my_classifier = MyClassifier()
my_classifier.fit()

pickle.dump(my_classifier, file)

注意：对于输入类别，您可能希望使用 OrdinalEncoder 而不是 LabelEncoder

在多个程序中正确使用Scikit的LabelEncoder

6 个答案: