如何使用`sklearn`计算两个数据帧之间的余弦相似度

时间:2018-07-02 13:13:45

标签: python pandas text scikit-learn cosine-similarity

为了给您一个线索,我复制了以前的代码

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = df['text'].values.tolist()
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(sms)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df.columns = similarity_df.columns.map(str)
similarity_df

输出为

           0           1           2           3    
0   1.000000    0.000000    0.038781    0.108865    
1   0.000000    1.000000    0.018147    0.000000    
2   0.038781    0.018147    1.000000    0.038326    
3   0.108865    0.000000    0.038326    1.000000

我要切换到两个数据框

id  text
0   "Daei rumah Indri jam berpa?Nyasar gak de,hhehhee\nSkrang sama sapa k'bogor?Orang rumah apa temen SMA "
1   'Mas dmn .. Ak udh smpe kantor yah.. Mas udh smpe blm??'
2  'Biarin .. Km ga kenal cowonya mas.. Hehe \nKm di cikeas dari jam brp?? Kok ga sms .. Knp baru sms. Wkwkw',\
3  'Wkkwkkwkk.....Asem di tanya bilang kepo\nIya sapa Ade sayang,\nMasih di cekias de sama ibu,'

和第二个数据帧

df2

Id  text
A   udh smpe kantor
B   ga kenal cowonya mas

我应该怎么做?

0 个答案:

没有答案