我是python的新手,我正在尝试为文本文件编写基本过滤器,然后计算在过滤行中找到的单词的频率。我试图对其应用一个禁用词列表。到目前为止,我有这个:
import sys, re
from collections import Counter
from nltk.corpus import stopwords
reload(sys)
sys.setdefaultencoding('utf8')
term = sys.argv[2].lower()
empty = []
count = 0
# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
for line in f:
for text in line.lower().split("\n"):
if term in text:
empty.append(text)
count += 1
print text
# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []
# apply stopword list to items in list containing lines matching term
for y in empty:
for t in stop:
if t not in y:
stoplist.append(y)
# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)
print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)
这样可以正常工作(但可能非常混乱/效率低下),但似乎会返回过滤的字数,这个字数太高,实际上是十倍。
我只是想知道是否有人能发现一些我不知道的东西,并指出我正确的解决方法。真的很感激任何帮助,谢谢!