ID |动物|年龄|栖息地

0 |鱼| 2 |海

1 |鹰| 1 |山

2 |鱼| 3 |海

3 |蛇| 4 |森林

如果我应用One-hot Encoding，它将生成以下矩阵：

ID | Animal_Fish | Animal_Hawk | Animal_Snake |年龄| ...

0 | 1 | 0 | 0 | 2 | ...

1 | 0 | 1 | 0 | 1 | ...

2 | 1 | 0 | 0 | 3 | ...

3 | 0 | 0 | 1 | 4 | ...

在大多数情况下，这很漂亮。但是，如果我的测试集包含的功能少于（或更多）功能，那该怎么办？如果我的测试集不包含“Fish”怎么办？它将少生成一个类别。

你们可以帮助我如何处理这类问题？

谢谢

Answer 1

听起来你的火车和测试装置是完全分开的。这是一个如何自动添加＆＃34;缺失＆＃34;的最小示例。给定数据集的特征：

import pandas as pd

# Made-up training dataset
train = pd.DataFrame({'animal': ['cat', 'cat', 'dog', 'dog', 'fish', 'fish', 'bear'],
                      'age': [12, 13, 31, 12, 12, 32, 90]})

# Made-up test dataset (notice how two classes are from train are missing entirely)
test = pd.DataFrame({'animal': ['fish', 'fish', 'dog'],
                      'age': [15, 62, 1]})

# Discrete column to be one-hot-encoded
col = 'animal'

# Create dummy variables for each level of `col`
train_animal_dummies = pd.get_dummies(train[col], prefix=col)
train = train.join(train_animal_dummies)

test_animal_dummies = pd.get_dummies(test[col], prefix=col)
test = test.join(test_animal_dummies)

# Find the difference in columns between the two datasets
# This will work in trivial case, but if you want to limit to just one feature
# use this: f = lambda c: col in c; feature_difference = set(filter(f, train)) - set(filter(f, test))
feature_difference = set(train) - set(test)

# create zero-filled matrix where the rows are equal to the number
# of row in `test` and columns equal the number of categories missing (i.e. set difference 
# between relevant `train` and `test` columns
feature_difference_df = pd.DataFrame(data=np.zeros((test.shape[0], len(feature_difference))),
                                     columns=list(feature_difference))

# add "missing" features back to `test
test = test.join(feature_difference_df)

test来自：

   age animal  animal_dog  animal_fish
0   15   fish         0.0          1.0
1   62   fish         0.0          1.0
2    1    dog         1.0          0.0

对此：

   age animal  animal_dog  animal_fish  animal_cat  animal_bear
0   15   fish         0.0          1.0         0.0          0.0
1   62   fish         0.0          1.0         0.0          0.0
2    1    dog         1.0          0.0         0.0          0.0

假设每一行（每只动物）只能一只动物，我们可以添加一个animal_bear功能（一种＆＃34;是-a-bear＆＃34;测试/功能）因为假设{em> test中的任何空头，该信息将在animal列中进行说明

根据经验，在构建/训练模型时，尝试考虑所有可能的特征（例如，animal的所有可能值）是个好主意。正如评论中所提到的，有些方法在处理缺失数据方面比其他方法更好，但如果你能从一开始就做到这一点，那可能是个好主意。现在，如果您接受自由文本输入（因为可能的输入数量永无止境），那将很难做到。

Answer 2

列车组确定您可以使用哪些功能进行识别。如果你很幸运，你的识别器会忽略未知功能（我相信NaiveBayes会这样做），否则你会收到错误。因此，保存您在训练期间创建的功能名称集，并在测试/识别期间使用它们。

某些识别器会将缺少的二进制特征视为零值。我相信这是NLTK NaiveBayesClassifier所做的，但其他引擎可能有不同的语义。因此，对于二进制存在/不存在的特征，我会编写我的特征提取函数，以便始终将相同的键放在特征字典中。

机器学习 - 测试集的功能少于列车组