Question

我正在处理包含流派作为特征的电影数据集。数据集中的示例可能同时属于多个流派。因此，它们包含一个类型标签列表。

数据看起来像这样-

    movieId                                         genres
0        1  [Adventure, Animation, Children, Comedy, Fantasy]
1        2                     [Adventure, Children, Fantasy]
2        3                                  [Comedy, Romance]
3        4                           [Comedy, Drama, Romance]
4        5                                           [Comedy]

我想向量化此功能。我已经尝试过 LabelEncoder 和 OneHotEncoder ，但是它们似乎无法直接处理这些列表。

我可以手动将其矢量化，但是我还有其他类似的功能，其中包含太多的类别。对于那些我更喜欢直接使用 FeatureHasher 类的方法。

是否有某种方法可以使这些编码器类在这种功能上工作？还是有更好的方法来表示这样的功能，从而使编码更容易？我很欢迎任何建议。

Answer 1

This SO question有一些令人印象深刻的答案。在您的示例数据上，Teoretic的最后答案（使用sklearn.preprocessing.MultiLabelBinarizer比Paulo Alves的解决方案快14倍（并且两者都比公认的答案快！）：

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)

# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
   movieId  Adventure  Animation  Children  Comedy  Drama  Fantasy  Romance
0        1          1          1         1       1      0        1        0
1        2          1          0         1       0      0        1        0
2        3          0          0         0       1      0        0        1
3        4          0          0         0       1      1        0        1
4        5          0          0         0       1      0        0        0

每个示例使用多个类别对分类特征进行编码-sklearn

1 个答案: