计算基本Python过滤器脚本中的计数器?

时间:2015-11-25 22:21:08

标签: python counter nltk

我是python的新手,我正在尝试为文本文件编写基本过滤器,然后计算在过滤行中找到的单词的频率。我试图对其应用一个禁用词列表。到目前为止,我有这个:

import sys, re
from collections import Counter
from nltk.corpus import stopwords

reload(sys)  
sys.setdefaultencoding('utf8')

term = sys.argv[2].lower()
empty = []
count = 0


# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
    for line in f:
        for text in line.lower().split("\n"):
            if term in text:
                empty.append(text)
                count += 1
                print text

# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []


# apply stopword list to items in list containing lines matching term 
for y in empty:
    for t in stop:
        if t not in y:
            stoplist.append(y)

# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)

print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)

这样可以正常工作(但可能非常混乱/效率低下),但似乎会返回过滤的字数,这个字数太高,实际上是十倍。

我只是想知道是否有人能发现一些我不知道的东西,并指出我正确的解决方法。真的很感激任何帮助,谢谢!

0 个答案:

没有答案