如何在pandas数据框中进行单词标记化

时间:2018-02-28 14:45:30

标签: pandas scikit-learn nltk tokenize

这是我的数据

No  Text                    
1   You are smart
2   You are beautiful

我的预期输出

No  Text                   You    are  smart  beautiful                 
1   You are smart            1      1      1          0
2   You are beautiful        1      1      0          1

1 个答案:

答案 0 :(得分:3)

对于nltk解决方案需要word_tokenize来获取单词列表,然后MultiLabelBinarizer和最后join为原始:

from sklearn.preprocessing import MultiLabelBinarizer
from  nltk import word_tokenize

mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
   No               Text  You  are  beautiful  smart
0   1      You are smart    1    1          0      1
1   2  You are beautiful    1    1          1      0

对于纯pandas使用get_dummies + join

df = df.join(df['Text'].str.get_dummies(sep=' '))