Question

我尝试编写一个python函数来计算字符串中的特定单词。

当我要计数的单词连续重复多次时，我的正则表达式模式不起作用。否则该模式似乎效果很好。

这是我的职能

import re

def word_count(word, text):
    return len(re.findall('(^|\s|\b)'+re.escape(word)+'(\,|\s|\b|\.|$)', text, re.IGNORECASE))

当我使用随机字符串对其进行测试

>>> word_count('Linux', "Linux, Word, Linux")
2

当我要计数的单词与之相邻时

>>> word_count('Linux', "Linux Linux")
1

Answer 1

问题在您的正则表达式中。您的正则表达式正在使用2个捕获组，re.findall将返回所有捕获组（如果有）。需要使用(?:...)

更改为非捕获组

此外，还有理由将(^|\s|\b)用作\b或单词边界足以满足\b为零宽度的情况。

可以将(\,|\s|\b|\.|$)更改为\b。

因此您可以使用：

def word_count(word, text):
     return len(re.findall(r'\b' + re.escape(word) + r'\b', text, re.I))

这将给出：

>>> word_count('Linux', "Linux, Word, Linux")
2
>>> word_count('Linux', "Linux Linux")
2

Answer 2

我不确定这是100％，因为当您只是在寻找在字符串中重复的单词时，我不了解将单词传递给函数的部分。所以也许考虑...

import re

pattern = r'\b(\w+)( \1\b)+'

def word_count(text):
    split_words = text.split(' ')
    count = 0
    for split_word in split_words:
        count = count + len(re.findall(pattern, text, re.IGNORECASE))
    return count

word_count('Linux Linux Linux Linux')

输出：

也许有帮助。

更新：基于下面的评论...

def word_count(word, text):
    count = text.count(word)
    return count

word_count('Linux', "Linux, Word, Linux")

输出：

正则表达式模式用重复单词计数

2 个答案: