我的目标是在文件列表中打印单词出现的次数,但问题是即使该单词在一行中存在多次,我的代码也将出现的次数视为1。
例如:喜欢喜欢喜欢
输出为1而不是4。
import os
import math
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords= set(stopwords.words('english'))
folderpath = "C:\\Users\\user\\Desktop\\Documents"
word = input("Choose a word : ")
for(path, dirs, files) in os.walk(folderpath, topdown=True):
for file in files:
counter = 0
idf = 0
filepath = os.path.join(path, file)
with open(filepath, 'r') as f:
info = f.readlines()
for line in f:
if word in str(info).casefold() and word not in stopwords:
for line in info:
if word in line:
counter=counter+1
idf = 1 + math.log10(counter)
weight = idf * counter
print("The tf in" + " " + os.path.splitext(file)[0] + " "+ "is :" + " " + " " + str(counter))
print ("The idf is" + ":" + " "+ str(idf))
print("The weight is"+":" + " " + str(weight))
print(" ")
结果是:
文档的名称和术语“频率”
然后是反文档频率
他们的体重
但除了:
我期望得到相同的结果
频率一词“作为出现次数的计数器”必须是文件中单词出现的次数,但实际上它是每一行中单词出现的次数,如下所示:如果单词in中的计数器加1行,而不管出现的次数
答案 0 :(得分:3)
我认为您遇到问题是因为:
if word in str(info).casefold() and word not in stopwords:
for line in info:
if word in line:
counter=counter+1
idf = 1 + math.log10(counter)
对于每条匹配的行,这只会在您的“计数器”中加1。
我认为您最好在每一行上使用re.findall然后将re.findall的结果计入您的“计数器”中
请参阅下面的代码,尽管它不是完整的解决方案,但我认为您可以看到如何将其插入代码中。
import re
Mylist = ("like like like like like like", "right ike left like herp derp") # This is in place of your files.
word = "like" # word to look for
counter = 0
for i in Mylist: # in your code this would be "for line in f:"
search = re.findall(word, i) # use re.findall to search for all instances of your word in given line.
for i in search: # then for every word returned by re.findall in that line count them into your counter.
counter = counter + 1
print(counter)
此代码返回
7
还有进一步的优化,因为您使用re.findall不需要逐行读取文件,您可以像这样一次查看整个文件。
with open(filepath, 'r') as f:
info = f.read()
search = re.findall(word, info)
for i in search:
counter = counter + 1
这应该返回相同的值,并在循环中少一层。