我正在尝试从头开始构建tf-idf矢量化器。我计算了tf和idf,但是在计算tf-idf时遇到了麻烦。这是代码:
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
#splitting words of each document in the corpus
document = []
for doc in corpus:
document.append(doc.split())
#calculating the word frequency of each ord inside a document
word_freq = {} #calculate frequency of each word
for i in range(len(document)):
tokens = document[i]
for w in tokens:
try:
word_freq[w].add(i) #add the word as key
except:
word_freq[w] = {i} #if it exists already, do not add.
for val in word_freq:
word_freq[val] = len(word_freq[val]) #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.
# Calculating term frequency
def tf(document):
tf_dict = {}
for word in document:
if word in tf_dict:
tf_dict[word] += 1
else:
tf_dict[word] = 1
for word in tf_dict:
tf_dict[word] = tf_dict[word]/len(document)
tfDict = [tf(i) for i in document]
# Calculate inverse document frequency
def IDF():
idfDict = {}
for word in word_freq:
idfDict[word] = 1 + math.log((1 + len(sentence)) / (1 + word_freq[word]))
return idfDict
idfDict = IDF()
# Calculating TF-IDF
def TF_IDF():
tfIdfDict = {}
for i in tfDict:
for j in i:
tfIdfDict[j] = tfDict[i][j] * idfDict[j]
return tfIdfDict
TF_IDF()
问题出在TF_IDF函数中的这一行-
tfIdfDict[j] = tfDict[i][j] * idfDict[j]
发生的错误是这个-
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-7b7d174d2ce3> in <module>()
8 return tfIdfDict
9
---> 10 TF_IDF()
<ipython-input-38-7b7d174d2ce3> in TF_IDF()
4 for i in tfDict:
5 for j in i:
----> 6 tfIdfDict[j] = tfDict[i][j] * idfDict[j]
7
8 return tfIdfDict
TypeError: list indices must be integers or slices, not dict
我了解问题的出处和原因,但找不到解决方案。请帮忙。谢谢!
答案 0 :(得分:0)
tfDict
是词典列表。当您执行for i in tfDict
时,i
实际上将具有tfDict
的元素(一个字典)之一,而不是整数索引。
在您进行tfDict[i][j]
之前这是完全可以的,因为tfDict[i]
期望i
是一个整数索引而不是元素值。
解决方案:执行i[j]
而不是tfDict[i][j]
。