Question

从下面的代码中

def dot(docA,docB):
    the_sum=0
    for (key,value) in docA.items():
        the_sum+=value*docB.get(key,0)
    return the_sum

def cos_sim(docA,docB):
    sim=dot(docA,docB)/(math.sqrt(dot(docA,docA)*dot(docB,docB)))
    return sim

def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
    return df

def idf(doclist):
    N=len(doclist)
    return {feat:math.log(N/v) for feat,v in doc_freq(doclist).items()} 


tf_med=doc_freq(bow_collections["medline"])
tf_wsj=doc_freq(bow_collections["wsj"])

idf_med=idf(bow_collections["medline"])
idf_wsj=idf(bow_collections["wsj"])

print(tf_med)
print(idf_med)

所以我终于设法做到了这一点，尽管我似乎找不到关于Python下一步需要做什么的信息，但可以肯定数学是存在的，但我觉得没有必要花费几个小时试图理解它的含义。只需快速放心，这就是我从tf_med中得到的：

{'NUM': 37, 'early': 3, 'case': 3, 'organ': 1, 'transplantation': 1, 'section': 1, 
'healthy': 1, 'ovary': 1, 'fertile': 1, 'woman': 1, 'unintentionally': 1, 
'unknowingly': 1, 'subjected': 1, 'oophorectomy': 1, 'described': 4, .... , }

这是我从idf_med得到的东西：

{'NUM': 0.3011050927839216, 'early': 2.8134107167600364, 'case': 2.8134107167600364, 
'organ': 3.912023005428146, 'transplantation': 3.912023005428146, 'section': 
3.912023005428146, 'healthy': 3.912023005428146, 'ovary': 3.912023005428146, 'fertile': 
3.912023005428146, .... , }

尽管现在我不知道如何一起计算这两者，才能获得TF-IDF和平均余弦相似度。我知道他们需要成倍增加，但是我到底该怎么做！

Answer 1

您可以使用scikit-learn：

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
text1 ='eat big yellow bananas'
text2 ='eat big yellow potatos'
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text1,text2])
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

我已经计算了TF AND IDF，但是如何获取TF-IDF？

1 个答案: