如何使用CountVectorizer功能合并数据

时间:2018-08-02 03:58:13

标签: python pandas dataframe scikit-learn countvectorizer

这是我的数据集

a.hello.com

这是我的代码

        body                                            customer_id   name
14828   Thank you to apply to us.                       5458          Sender A
23117   Congratulation your application is accepted.    5136          Sender B
23125   Your OTP will expire in 10 minutes.             5136          Sender A

输出为

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
b = a['body']
vect = CountVectorizer()
vect.fit(b)
X_vect=vect.transform(b)
pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names())

我需要的是

    10  application apply ... your  
0   0   0           1         0
1   0   1           0         1
2   1   0           0         1 

我应该怎么做?我仍然希望使用 body customer_id name 10 application apply ... your 14828 Thank you to apply to us. 5458 Sender A 0 0 1 0 23117 Congratulation your application is accepted. 5136 Sender B 0 1 0 1 23125 Your OTP will expire in 10 minutes. 5136 Sender A 1 0 0 1 ,以便将来可以修改该功能

1 个答案:

答案 0 :(得分:2)

您可以将index添加到Dataframe的构造函数中,然后将join添加到原始df并使用默认left join

b = pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names(), index= a.index)
a = a.join(b)

或使用merge,但需要更多参数,因为默认值为inner join

a = a.merge(b, left_index=True, right_index=True, how='left')