计算语料库中所有文本之间的余弦相似度

时间:2016-04-27 10:54:23

标签: python tf-idf corpus cosine-similarity

我有一组存储在JOSN文件中的文件。沿着这一行,我使用以下代码检索它们,以便它们存储在术语data

import json
with open('SDM_2015.json') as f:
    data = [json.loads(line) for line in f]

将所有文本整合到一个文本中以形成语料库是通过以下方式完成的:

corpus = []
for i in range(len(data) -1):
    corpus.append(data[i]['body'] + data[i+1]['body'])

直到现在相当直接的操纵。为了构建tfidf,我使用以下代码行来删除停用词和标点符号,来源于每个术语并标记数据。

import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()

# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

## First function that creates the tokens
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

## Lastly, a functionthat contains all the previous ones plus stopwords removal     

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

然后我尝试将此功能应用于corpus,例如:

tfidf = vectorizer.fit_transform(corpus)

print(((tfidf*tfidf.T).A)[0,1])

但没有任何反应,对如何进行有任何想法?

亲切的问候

0 个答案:

没有答案