说我有以下数据
import pandas as pd
data = {
'Reference': [1, 2, 3, 4, 5],
'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
'Mileage': [35000, 45000, 121000, 35000, 181000],
'Year': [2015, 2014, 2012, 2016, 2013]
}
df = pd.DataFrame(data)
我想在两个栏目“Brand”和“Town”上进行单热编码,以便训练分类器(比如Scikit-Learn)并预测年份。
一旦分类器被训练,我将想要预测新输入数据的年份(不在训练中使用),我将需要重新应用相同的热编码。例如:
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
在这种情况下,在Pandas DataFrame上对2列进行单热编码的最佳方法是什么,因为他们知道需要对多个列进行编码,并且需要能够应用相同的列。稍后对新数据进行编码。
这是How to re-use LabelBinarizer for input prediction in SkLearn
的后续问题答案 0 :(得分:3)
演示:
from sklearn.preprocessing import LabelBinarizer
from collections import defaultdict
d = defaultdict(LabelBinarizer)
In [7]: cols2bnrz = ['Brand','Town']
In [8]: df[cols2bnrz].apply(lambda x: d[x.name].fit(x))
Out[8]:
Brand LabelBinarizer(neg_label=0, pos_label=1, spars...
Town LabelBinarizer(neg_label=0, pos_label=1, spars...
dtype: object
In [10]: new = pd.DataFrame({
...: 'Reference': [6, 7],
...: 'Brand': ['Volvo', 'Audi'],
...: 'Town': ['Stockholm', 'Munich']
...: })
In [11]: new
Out[11]:
Brand Reference Town
0 Volvo 6 Stockholm
1 Audi 7 Munich
In [12]: pd.DataFrame(d['Brand'].transform(new['Brand']), columns=d['Brand'].classes_)
Out[12]:
Audi Volkswagen Volvo
0 0 0 1
1 1 0 0
In [13]: pd.DataFrame(d['Town'].transform(new['Town']), columns=d['Town'].classes_)
Out[13]:
Berlin Munich Stockholm
0 0 0 1
1 0 1 0
答案 1 :(得分:1)
您可以使用get_dummies函数pandas提供并转换分类值。
像这样......
import pandas as pd
data = {
'Reference': [1, 2, 3, 4, 5],
'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
'Mileage': [35000, 45000, 121000, 35000, 181000],
'Year': [2015, 2014, 2012, 2016, 2013]
}
df = pd.DataFrame(data)
train = pd.concat([df.get(['Mileage','Reference','Year']),
pd.get_dummies(df['Brand'], prefix='Brand'),
pd.get_dummies(df['Town'], prefix='Town')],axis=1)
对于测试数据,您可以:
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
test = pd.DataFrame(new_data)
test = pd.concat([test.get(['Reference']),
pd.get_dummies(test['Brand'], prefix='Brand'),
pd.get_dummies(test['Town'], prefix='Town')],axis=1)
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]