我有一个词汇表文件,其中包含我需要在其他文本文档中找到的单词。我需要找到每个单词有多少,如果有的话。例如:
vocabulary.txt:
thought
await
thorough
away
red
的test.txt:
I thought that if i await thorough enough, my thought would take me away.
Away I thought the thought.
最后,我应该看到有4个思想实例,1个等待,2个远离,1个彻底,0个红色。我试过这种方式:
for vocabLine in vocabOutFile:
wordCounter = 0
print >> sys.stderr, "Vocab word:", vocabLine
for line in testFile:
print >> sys.stderr, "Line 1 :", line
if vocabLine.rstrip('\r\n') in line.rstrip('\r\n'):
print >> sys.stderr, "Vocab word is in line"
wordCounter = wordCounter + line.count(vocabLine)
print >> sys.stderr, "Word counter", wordCounter
testFile.seek(0, 0)
我有一种奇怪的感觉,因为vocab文件中的返回字符不能识别文件中的单词,因为在调试期间我确定它正在计算字符串末尾的任何单词。匹配。但是,使用rstrip()后,计数仍然无法正确计数。完成所有这些后,我必须从词汇表中删除不超过2次的单词。
我做错了什么?
谢谢!
答案 0 :(得分:2)
使用regex
和collections.Counter
import re
from collections import Counter
from itertools import chain
with open("voc") as v, open("test") as test:
#create a set of words from vocabulary file
words = set(line.strip().lower() for line in v)
#find words in test file using regex
words_test = [ re.findall(r'\w+', line) for line in test ]
#Create counter of words that are found in words set from vocab file
counter = Counter(word.lower() for word in chain(*words_test)\
if word.lower() in words)
for word in words:
print word, counter[word]
<强>输出强>
thought 4
away 2
await 1
red 0
thorough 1
答案 1 :(得分:2)
制作词汇词典是一个好主意。
vocab_counter = {vocabLine.strip().lower(): 0 for vocabLine in vocabOutFile}
然后扫描testFile一次(效率更高)增加每个单词的计数
for line in testFile:
for word in re.findall(r'\w+', line.lower()):
if word in vocab_counter:
vocab_counter[word] += 1