如何计算单词出现次数而又不限于完全匹配

时间:2019-06-06 20:41:34

标签: python python-3.x

我有一个文件,其内容如下。

1, 1

我有一个Python脚本,该脚本计算文件中特定单词出现的次数。以下是脚本。

Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response

由于Hello发生了4次,我期望输出为4。但是我得到的输出为2?以下是脚本的输出

#!/usr/bin/env python

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

我有点理解Number of words in /path/to/file.txt is: 2 并不被认为是Hello;,因为该词并非完全是所搜索的词。

问题:
有没有办法让我的脚本选择Hello,即使其后跟逗号,分号或点?一项简单的技术,不需要在找到的单词中再次查找子字符串。

3 个答案:

答案 0 :(得分:1)

正则表达式将是一个更好的工具,因为您想忽略标点符号。它可以通过聪明的过滤和.count()方法来完成,但这更简单了:

import re
...
search_string = "Hello"
with open(filename, 'r') as file:
    filetext = file.read()
occurrences = len(re.findall(search_string, filetext))

print("Number of words in " + filename + " is: " + str(occurrences))

如果您希望不区分大小写,则可以相应地更改search_string

search_string = r"[Hh]ello"

或者,如果您要显式地使用单词Hello而不是aHelloHellon,则可以在匹配之前和之后匹配\b字符(请参见the documentation更多有趣的把戏):

search_string = r"\bHello\b"

答案 1 :(得分:1)

您可以在集合模块中使用正则表达式和计数器:

txt = '''Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response'''

import re
from collections import Counter
from pprint import pprint

c = Counter()
re.sub(r'\b\w+\b', lambda r: c.update((r.group(0), )), txt)
pprint(c)

打印:

Counter({'Someone': 4,
         'Hello': 4,
         'again': 2,
         'said': 2,
         'response': 2,
         'says': 1,
         'responded': 1,
         'back': 1,
         'No': 1,
         'waiting': 1,
         'for': 1})

答案 2 :(得分:1)

您可以使用正则表达式找到答案。

import re
filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"


with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            b = re.search(r'\bHello;?\b', i)
            if b:
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

这将检查“ Hello”还是“ Hello;”。具体在文件中。您可以扩展正则表达式以满足其他需求(例如小写)。

它将忽略诸如“ Helloing”之类的内容,此处的其他示例可能会如此。

如果您不喜欢使用正则表达式...,可以检查是否去除了最后一个字母,使其符合以下条件:

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string) or (i[:-1] == search_string and i[-1] == ';'):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))