如何以不同方式校准tfidf

时间:2018-07-01 04:08:18

标签: python nlp tf-idf

我有两种类型的文档,一种带有标签,而另一种则没有。 我想使用带标签的文档来计算tf,并使用其他未标签的文档来计算idf。

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(line.lower().split(',') for line in open('data/unannotated.csv',encoding='utf-8'))
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None) 

此未注释的是未标记的数据,一行是一个文档。 根据这些数据,我得到了一个字典:

{'get': 353867,
 'http': 351618,
 'u': 324711,
 'one': 291526,
 'go': 279237,
 'know': 265001,
 'good': 249368,
 'say': 236003,
 'like': 225619,
 'qatar': 223010,
 'time': 191195,
 'would': 188766,
 'think': 187128,
 'people': 182873,
 'make': 180120,
 'need': 166287,
 'take': 157106,
...
}

我使用下一段代码来计算tfidf,它给出所有相同的答案:0.9999999999999946

def TFCalculator(question, word):
    if word not in question:
        return 0
    count = dict(Counter(question))
    q_len = len(question)
    return float(count[word]) / float(q_len)

def n_containing(unannotated, word):
    return float(unannotated.get(word,0))

def IDFCalculator(unannotated, word):
    return math.log(float(len(unannotated.keys())) / (1.0 + n_containing(unannotated, word)))

def tfidf(stem, unannotated):
    tfidfVector_ques = range(len(unannotated))
    for word in stem:
        tf= TFCalculator(stem,word)
        idf = IDFCalculator(unannotated, word)
        tfidf = tf * idf
        try:
            tfidfVector_ques[list(unannotated).index(word)] = tfidf
        except:
            pass
    return sparse.csr_matrix(tfidfVector_ques)

def cosine_similarities(question,comment,unannotated):
    X_tfidf = tfidf(question,unannotated)
    Y_tfidf = tfidf(comment,unannotated)
    similarities = cosine_similarity(X_tfidf,Y_tfidf)
    return similarities

0 个答案:

没有答案