Question

我有一个词汇表文件，其中包含我需要在其他文本文档中找到的单词。我需要找到每个单词有多少，如果有的话。例如：

vocabulary.txt：

thought
await
thorough
away
red

的test.txt：

I thought that if i await thorough enough, my thought would take me away.
Away I thought the thought.

最后，我应该看到有4个思想实例，1个等待，2个远离，1个彻底，0个红色。我试过这种方式：

for vocabLine in vocabOutFile:
    wordCounter = 0
    print >> sys.stderr, "Vocab word:", vocabLine
    for line in testFile:
        print >> sys.stderr, "Line 1 :", line
        if vocabLine.rstrip('\r\n') in line.rstrip('\r\n'):
            print >> sys.stderr, "Vocab word is in line"
            wordCounter = wordCounter + line.count(vocabLine)
            print >> sys.stderr, "Word counter", wordCounter
    testFile.seek(0, 0)

我有一种奇怪的感觉，因为vocab文件中的返回字符不能识别文件中的单词，因为在调试期间我确定它正在计算字符串末尾的任何单词。匹配。但是，使用rstrip（）后，计数仍然无法正确计数。完成所有这些后，我必须从词汇表中删除不超过2次的单词。

我做错了什么？

谢谢！

Answer 1

使用regex和collections.Counter

import re
from collections import Counter
from itertools import chain

with open("voc") as v, open("test") as test:
    #create a set of words from vocabulary file
    words = set(line.strip().lower() for line in v) 

    #find words in test file using regex
    words_test = [ re.findall(r'\w+', line) for line in test ]

    #Create counter of words that are found in words set from vocab file
    counter = Counter(word.lower()  for word in chain(*words_test)\
                                          if word.lower() in words)
    for word in words:
        print word, counter[word]

<强>输出

thought 4
away 2
await 1
red 0
thorough 1

Answer 2

制作词汇词典是一个好主意。

vocab_counter = {vocabLine.strip().lower(): 0 for vocabLine in vocabOutFile}

然后扫描testFile一次（效率更高）增加每个单词的计数

for line in testFile:
    for word in re.findall(r'\w+', line.lower()):
        if word in vocab_counter:
            vocab_counter[word] += 1

尝试使用不同的文本文件作为“字典”在文本文件中查找字数

2 个答案: