我目前正在开展一个小型项目并且已经完成了空白,我有以下代码来计算术语频率: 来自Bag import *
words =
['the','new','the','shiny','new','car','went','through','the','tunnel']
carDoc = Bag()
for word in words:
carDoc.add(word)
def tf(word, carDoc):
if word != "" and carDoc.size() > 0:
return carDoc.count(word)/carDoc.size()
我还有以下反文档频率代码:
from Bag import *
from math import log
carDoc1 = Bag()
for word in ['the', 'car']:
carDoc1.add(word)
carDoc2 = Bag()
for word in ['the', 'shiny', 'new']:
carDoc2.add(word)
allCarDocs = [carDoc1, carDoc2]
def idf(word, carDocs):
total = len(allCarDocs)
wordIsIn = 0
for docs in allCarDocs:
if docs.contains(word):
wordIsIn = wordIsIn + 1
return log(total / (1 + wordIsIn))
carDoc1 = Bag()
for word in ['the', 'car']:
carDoc1.add(word)
carDoc2 = Bag()
for word in ['the', 'shiny', 'new']:
carDoc2.add(word)
allCarDocs = [carDoc1, carDoc2]
def tf_idf(word, documents):
return tf(word, carDoc) * idf (word, allCarDocs)
我得到的错误是carDoc未定义
这些都很好,并且按照我的意图工作,但是当实现tfidf功能时,我一直都会遇到错误。任何有关解决此示例的tfidf的帮助都将受到赞赏
答案 0 :(得分:0)
def tf_idf(word,documents): return tf(word,carDoc)* idf(word,allCarDocs)
如果你的函数采用(word,文档),你想在哪里获得carDoc和allCarDoc?