如何计算文件中一行中单词的出现次数

时间:2019-02-11 21:37:27

标签: python python-3.x python-2.7

我的目标是在文件列表中打印单词出现的次数,但问题是即使该单词在一行中存在多次,我的代码也将出现的次数视为1。
例如:喜欢喜欢喜欢
输出为1而不是4。

import os
import math
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stopwords= set(stopwords.words('english'))
folderpath = "C:\\Users\\user\\Desktop\\Documents"
word = input("Choose a word : ")

for(path, dirs, files) in os.walk(folderpath, topdown=True):

    for file in files:

        counter = 0
        idf = 0
        filepath = os.path.join(path, file)
        with open(filepath, 'r') as f:

            info = f.readlines()
            for line in f:

                if word in str(info).casefold() and word not in stopwords:

                    for line in info:

                        if word in line:


                            counter=counter+1 
                            idf = 1 + math.log10(counter)    



        weight = idf * counter

        print("The tf in" + " " + os.path.splitext(file)[0] + " "+ "is :" + " " +  " " +  str(counter))
        print ("The idf is" + ":" + " "+ str(idf))
        print("The weight is"+":" + " " + str(weight))
        print(" ")

结果是:
文档的名称和术语“频率”
然后是反文档频率 他们的体重
但除了:
我期望得到相同的结果 频率一词“作为出现次数的计数器”必须是文件中单词出现的次数,但实际上它是每一行中单词出现的次数,如下所示:如果单词in中的计数器加1行,而不管出现的次数

1 个答案:

答案 0 :(得分:3)

我认为您遇到问题是因为:

            if word in str(info).casefold() and word not in stopwords:

                for line in info:

                    if word in line:


                        counter=counter+1 
                        idf = 1 + math.log10(counter)

对于每条匹配的行,这只会在您的“计数器”中加1。

我认为您最好在每一行上使用re.findall然后将re.findall的结果计入您的“计数器”中

请参阅下面的代码,尽管它不是完整的解决方案,但我认为您可以看到如何将其插入代码中。

import re

Mylist = ("like like like like like like", "right ike left like herp derp") # This is in place of your files.

word = "like" # word to look for
counter = 0

for i in Mylist: # in your code this would be "for line in f:"
    search = re.findall(word, i) # use re.findall to search for all instances of your word in given line.
    for i in search: # then for every word returned by re.findall in that line count them into your counter.
        counter = counter + 1

print(counter)

此代码返回

7

还有进一步的优化,因为您使用re.findall不需要逐行读取文件,您可以像这样一次查看整个文件。

    with open(filepath, 'r') as f:
        info = f.read()
        search = re.findall(word, info)
        for i in search:
            counter = counter + 1

这应该返回相同的值,并在循环中少一层。