Question

我是python的新手，我正在尝试为文本文件编写基本过滤器，然后计算在过滤行中找到的单词的频率。我试图对其应用一个禁用词列表。到目前为止，我有这个：

import sys, re
from collections import Counter
from nltk.corpus import stopwords

reload(sys)  
sys.setdefaultencoding('utf8')

term = sys.argv[2].lower()
empty = []
count = 0


# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
    for line in f:
        for text in line.lower().split("\n"):
            if term in text:
                empty.append(text)
                count += 1
                print text

# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []


# apply stopword list to items in list containing lines matching term 
for y in empty:
    for t in stop:
        if t not in y:
            stoplist.append(y)

# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)

print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)

这样可以正常工作（但可能非常混乱/效率低下），但似乎会返回过滤的字数，这个字数太高，实际上是十倍。

我只是想知道是否有人能发现一些我不知道的东西，并指出我正确的解决方法。真的很感激任何帮助，谢谢！

计算基本Python过滤器脚本中的计数器？

0 个答案: