Python正则表达式可计算句子中多个匹配的字符串

时间:2019-02-02 19:54:47

标签: python regex

我试图让从字符串生成模式计数hello awesome world在一个大的文本中找到。通过排列单词并在其间用*替换一个单词来生成模式。在这个例子中我仅使用4种模式,以简化的东西。我对regex并不是很熟悉,所以我的代码与我需要的一切都不匹配。我可能很快就会解决,但是我不确定在输入实际数据时它是否可以很好地扩展。

是如何解决我的代码,并有更好的问题/更快的方式来实现我的目标?这是我的代码,下面有解释。

import re
from collections import Counter


# Input text. Could consist of hundreds of thousands of sentences.
txt = """
Lorèm ipsum WORLD dolor AWESOME sit amèt, consectetur adipiscing elit. 
Duis id AWESOME HELLO lorem metus. Pràesent molestie malesuada finibus. 
Morbi non èx a WORLD HELLO AWESOME erat bibendum rhoncus. Quisque sit 
ametnibh cursus, tempor mi et, sodàles neque. Nunc dapibus vitae ligula at porta. 
Quisque sit amet màgna eù sem sagittis dignissim et non leo. 
Quisque WORLD, AWESOME dapibus et vèlit tristique tristique. Sed 
efficitur dui tincidunt, aliquet lèo eget, pellentesque felis. Donec 
venenatis elit ac aliquet varius. Vestibulum ante ipsum primis in faucibus
orci luctus et ultrices posuere cubilia Curae. Vestibulum sed ligula 
gravida, commodo neque at, mattis urna. Duis nisl neque, sollicitudin nec 
mauris sit amet, euismod semper massa. Curabitur sodales ultrices nibh, 
ut ultrices ante maximus sed. Donec rutrum libero in turpis gravida 
dignissim. Suspendisse potenti. Praesent eu tempor quam, id dictum felis. 
Nullam aliquam molestie tortor, at iaculis metus volutpat et. In dolor 
lacus, AWESOME sip HELLO volutpat ac convallis non, pulvinar eu massa.
"""

txt = txt.lower()

# Patterns generated from a 1-8 word input string. Could also consist of hundreds of 
# thousands of patterns
patterns = [
    'world',
    'awesome',
    'awesome hello', 
    'world hello awesome',
    'world (.*?) awesome'   # '*' - represents any word between
]

regex = '|'.join(patterns)
result = re.findall(regex, txt)
counter = Counter(result)
print(counter)
# >>> Counter({'awesome': 5, 'world': 3})

# For some reason i can't get strings with more than one word to match

# Expected output
found_pattern_counts = {
    'world': 3,
    'awesome': 5,
    'awesome hello': 1, 
    'world hello awesome': 1,
    'world * awesome': 2
}

2 个答案:

答案 0 :(得分:1)

您没有正确使用通配符,我对其进行了修复,现在它已按照您的描述运行,现在您可以为此操作创建其他功能:

patterns = [
    'world',
    'awesome',
    'awesome hello', 
    'world hello awesome',
    'world (.*?) awesome'
]


result = {} 
for pattern in patterns:
   rex = re.compile(fr'{pattern}') 
   count = len(rex.findall(txt))   
   result[pattern] = result.get(pattern, 0) + count

print(result)

答案 1 :(得分:0)

您可以调查

re.finditer()

迭代器为您节省了大量的资源,如果你并不需要一次全部(你很少做)的数据。 这样,您不需要在内存中保存太多信息。 看看这个Do iterators save memory in Python?