我有一组存储在JOSN文件中的文件。沿着这一行,我使用以下代码检索它们,以便它们存储在术语data
:
import json
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
将所有文本整合到一个文本中以形成语料库是通过以下方式完成的:
corpus = []
for i in range(len(data) -1):
corpus.append(data[i]['body'] + data[i+1]['body'])
直到现在相当直接的操纵。为了构建tfidf,我使用以下代码行来删除停用词和标点符号,来源于每个术语并标记数据。
import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a functionthat contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
然后我尝试将此功能应用于corpus
,例如:
tfidf = vectorizer.fit_transform(corpus)
print(((tfidf*tfidf.T).A)[0,1])
但没有任何反应,对如何进行有任何想法?
亲切的问候