为了给您一个线索,我复制了以前的代码
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = df['text'].values.tolist()
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(sms)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df.columns = similarity_df.columns.map(str)
similarity_df
输出为
0 1 2 3
0 1.000000 0.000000 0.038781 0.108865
1 0.000000 1.000000 0.018147 0.000000
2 0.038781 0.018147 1.000000 0.038326
3 0.108865 0.000000 0.038326 1.000000
我要切换到两个数据框
id text
0 "Daei rumah Indri jam berpa?Nyasar gak de,hhehhee\nSkrang sama sapa k'bogor?Orang rumah apa temen SMA "
1 'Mas dmn .. Ak udh smpe kantor yah.. Mas udh smpe blm??'
2 'Biarin .. Km ga kenal cowonya mas.. Hehe \nKm di cikeas dari jam brp?? Kok ga sms .. Knp baru sms. Wkwkw',\
3 'Wkkwkkwkk.....Asem di tanya bilang kepo\nIya sapa Ade sayang,\nMasih di cekias de sama ibu,'
和第二个数据帧
df2
Id text
A udh smpe kantor
B ga kenal cowonya mas
我应该怎么做?