我有一个包含句子的列表,我想对每个单词中的完整句子执行一次热编码,
例如,
sentences = [
"python, java",
"linux, windows, ubuntu",
"java, linux, ubuntu, windows",
"performance, python, mac"
]
我想要这样的输出
java linux mac performance python ubuntu windows
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 1
2 1 1 0 0 0 1 1
3 0 0 1 1 1 0 0
我的尝试,
我试图将句子转换成系列,然后执行get_dummies
,但是我得到的是每个单词,但不是按句子。
print pd.get_dummies(pd.Series(sum([tag.split(', ') for tag in sentences],[])))
O / P
java linux mac performance python ubuntu windows
0 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 0 0 1
4 0 0 0 0 0 1 0
5 1 0 0 0 0 0 0
6 0 1 0 0 0 0 0
7 0 0 0 0 0 1 0
8 0 0 0 0 0 0 1
9 0 0 0 1 0 0 0
10 0 0 0 0 1 0 0
11 0 0 1 0 0 0 0
如何解决这个问题?
答案 0 :(得分:4)
将MultiLabelBinarizer与split
的列表理解一起使用:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform([x.split(', ') for x in sentences]),columns=mlb.classes_)
print (df)
java linux mac performance python ubuntu windows
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 1
2 1 1 0 0 0 1 1
3 0 0 1 1 1 0 0
使用Series.str.get_dummies
的另一种解决方案:
print (pd.Series(sentences).str.get_dummies(', '))
java linux mac performance python ubuntu windows
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 1
2 1 1 0 0 0 1 1
3 0 0 1 1 1 0 0
性能不同:
sentences = sentences * 1000
In [166]: %%timeit
...: mlb = MultiLabelBinarizer()
...: df = pd.DataFrame(mlb.fit_transform([x.split(', ') for x in sentences]),columns=mlb.classes_)
...:
8.06 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [167]: %%timeit
...: pd.Series(sentences).str.get_dummies(', ')
...:
105 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)