Question

我有3组数据（训练，验证和测试），运行时：

    training_x = pd.get_dummies(training_x, columns=['a', 'b', 'c'])

它为我提供了一定数量的功能。但是，当我在验证数据中运行它时，它给了我一个不同的数字，并且用于测试是相同的。有什么方法可以对所有数据集进行规范化（我知道错吗？），以便使特征数量对齐？

Answer 1

假人应在将数据集分为训练，测试或验证之前创建

假设我具有如下训练和测试数据框

import pandas as pd  
train = pd.DataFrame([1,2,3], columns= ['A'])
test= pd.DataFrame([7,8], columns= ['A'])

#creating dummy for train 
pd.get_dummies(train, columns= ['A'])

o/p
   A_1  A_2  A_3  A_4  A_5  A_6
0    1    0    0    0    0    0
1    0    1    0    0    0    0
2    0    0    1    0    0    0
3    0    0    0    1    0    0
4    0    0    0    0    1    0
5    0    0    0    0    0    1



# creating dummies for test data
pd.get_dummies(test, columns = ['A'])
    A_7  A_8
0    1    0
1    0    1

因此，针对7和8类的虚拟对象只会出现在测试中，因此其结果将具有不同的功能

final_df = pd.concat([train, test]) 

dummy_created = pd.get_dummies(final_df)

# now you can split it into train and test 
from sklearn.model_selection import train_test_split
train_x, test_x = train_test_split(dummy_created, test_size=0.33)

现在的训练和测试将具有相同的功能

Answer 2

一个简单的解决方案是在应用了虚拟函数后，将您的验证和测试集与训练数据集对齐。方法如下：

# Pandas encoding the data, dummies function creates different feature for each dataset
train = pd.get_dummies(train)
valid = pd.get_dummies(valid)
test = pd.get_dummies(test)

# Align the number of features across validation and test sets based on train dataset
train, valid = train.align(valid, join='left', axis=1)
train, test = train.align(test, join='left', axis=1)

Answer 3

您可以将数据类型转换为需要转换为伪变量的列中的category

df.col_1=df.col_1.astype('category')
df1=df.iloc[:1,:].copy()
df2=df.drop(df1.index)
pd.get_dummies(df1,columns=['col_1'])
Out[701]: 
      col_2 col3  col_1_A  col_1_D  col_1_G  col_1_J
index                                               
0         B    C        1        0        0        0# it will show zero even missing in the sub-set
pd.get_dummies(df2,columns=['col_1'])
Out[702]: 
      col_2 col3  col_1_A  col_1_D  col_1_G  col_1_J
index                                               
1         E    F        0        1        0        0
2         H    I        0        0        1        0
3         K    L        0        0        0        1

Answer 4

正如已经说明的那样，通常您应该在拆分之前进行一次热编码。但是还有另一个问题。有一天，您肯定希望将训练有素的机器学习模型应用于野外数据。我的意思是，您从未见过的数据，并且您需要对虚拟对象进行完全相同的转换，就像训练模型时一样。然后，您可能不得不处理两种情况。

是，新数据包含培训数据中没有的类别，并且
是，类别不再出现在您的数据集中，但是您的模型已经过训练。在第一种情况下，您应该忽略该值，因为您的模型很可能无法对其进行训练而不接受训练。在情况2中，您仍应生成这些空类别，以使要预测的数据具有与训练集中相同的结构。请注意，pandas方法不会为这些类别生成虚拟变量，因此无法保证您从预测数据中获得与训练数据中相同的结构，因此，您的模型很可能不适用于该数据。 / li>

您可以通过使用与get_dummies等效的sklearn来解决此问题（只需做更多工作），如下所示：

let groupedCars = Dictionary(grouping: cars, by: {$0.carName})

使用sklearn import pandas as pd from sklearn.preprocessing import OneHotEncoder # create some example data df= pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]}) # create a one hot encoder to create the dummies and fit it to the data ohe= OneHotEncoder(handle_unknown='ignore', sparse=False) ohe.fit(df[['x']]) # now let's simulate the two situations A and B df.loc[1, 'x']= 1 df= df.append(dict(x=5, y=5), ignore_index=True) # the actual feature generation is done in a separate step tr=ohe.transform(df[['x']]) # if you need the columns in your existing data frame, you can glue them together df2=pd.DataFrame(tr, columns=['oh1', 'oh2', 'oh3'], index=df.index) result= pd.concat([df, df2], axis='columns')，您可以将类别的标识与实际的一键编码（假人的创建）分开。您还可以保存已安装的一个热编码器，以便以后在模型应用中应用它。请注意handle_unknown选项，该选项告诉一个热编码器，以防万一以后它会包裹一些未知的东西，应该忽略它，而不是引发错误。

Answer 5

引用自kaggle：Link

别忘了添加fill_value=0以避免测试中出现NaN ...

如何在培训/验证/测试中对齐熊猫get_dummies？

5 个答案: