热编码:缺少列

时间:2017-07-30 14:15:42

标签: python scikit-learn one-hot-encoding

我有1000000条记录的训练集和100条测试集。为了创建推荐系统,我创建了两个这样组织的数据框:

[in]print(training_df.head(n=5))

[out]                     product_id
transaction_id                      
0000001                   [P06, P09]
0000002         [P01, P05, P06, P09]
0000003                   [P01, P06]
0000004                   [P01, P09]
0000005                   [P06, P09]

然后我使用sklearn创建一个矩阵,其中product_id为列,transaction_id为行(索引)。

以下是代码:

# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
training_df1 = training_df.join(pd.DataFrame(mlb.fit_transform(training_df.pop('product_id')),
                          columns=mlb.classes_,
                          index=training_df.index))

product_id是P01-P10。问题是训练数据中没有P04和P08,所以我的training_df1只有8列而不是10.如何添加两列并用0填充所有交易?

1 个答案:

答案 0 :(得分:2)

初始化 MultiLabelBinarizer 时,您可以将预定义的product-id P01-P10作为类传递,因此输出将始终将这些类别包含为列:

from sklearn.preprocessing import MultiLabelBinarizer
​
product_ids = ['P{:02d}'.format(i+1) for i in range(10)]
print(product_ids)
# ['P01', 'P02', 'P03', 'P04', 'P05', 'P06', 'P07', 'P08', 'P09', 'P10']
​
mlb = MultiLabelBinarizer(classes=product_ids)
training_df.join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
                              columns=mlb.classes_,
                              index=training_df.index))

enter image description here

仅取回矩阵:

training_df.drop('product_id', 1).join(
    pd.DataFrame(mlb.fit_transform(training_df['product_id']), columns=mlb.classes_, index=training_df.index)
)

enter image description here