这是我的数据
No Text
1 You are smart
2 You are beautiful
我的预期输出
No Text You are smart beautiful
1 You are smart 1 1 1 0
2 You are beautiful 1 1 0 1
答案 0 :(得分:3)
对于nltk
解决方案需要word_tokenize
来获取单词列表,然后MultiLabelBinarizer
和最后join
为原始:
from sklearn.preprocessing import MultiLabelBinarizer
from nltk import word_tokenize
mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
No Text You are beautiful smart
0 1 You are smart 1 1 0 1
1 2 You are beautiful 1 1 1 0
对于纯pandas
使用get_dummies
+ join
:
df = df.join(df['Text'].str.get_dummies(sep=' '))