Question

我正在尝试编码如下数据框：

A B C
2＆＃39;你好＆＃39; [＆＃39;我们＆＃39;，＆＃39;，＆＃39; good＆＃39;]
1＆＃39;全部＆＃39; [＆＃39;你好＆＃39;，＆＃39;世界＆＃39;]

现在您可以看到我可以labelencod第二列的字符串值，但我无法弄清楚如何编码第三列，其中包含字符串值列表和列表长度不同。即使我onehotencode这将我得到一个数组，我不知道如何在编码后与其他列的数组元素合并。请提出一些好的技巧

Answer 1

假设我们有以下DF：

In [31]: df
Out[31]:
   A      B                C
0  2  Hello  [we, are, good]
1  1    All   [hello, world]

让我们使用sklearn.feature_extraction.text.CountVectorizer

In [32]: from sklearn.feature_extraction.text import CountVectorizer

In [33]: vect = CountVectorizer()

In [34]: X = vect.fit_transform(df.C.str.join(' '))

In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))

In [36]: df
Out[36]:
   A      B                C  are  good  hello  we  world
0  2  Hello  [we, are, good]    1     1      0   1      0
1  1    All   [hello, world]    0     0      1   0      1

或者您可以将sklearn.preprocessing.MultiLabelBinarizer用作@VivekKumar suggested in this comment

In [56]: from sklearn.preprocessing import MultiLabelBinarizer

In [57]: mlb = MultiLabelBinarizer()

In [58]: X = mlb.fit_transform(df.C)

In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))

In [60]: df
Out[60]:
   A      B                C  are  good  hello  we  world
0  2  Hello  [we, are, good]    1     1      0   1      0
1  1    All   [hello, world]    0     0      1   0      1

scikit-learn：列表值的一个热门编码

1 个答案: