Question

如何在具有多个起点和终点的Python字符串中获取所有可能的重叠匹配。

我尝试使用regex模块，而不是默认的re模块，以引入overlay = True参数，但是仍然缺少一些匹配项。

试图通过一个简单的插图来描述我的问题：

找到以axaybzb开始并以a结尾的字符串（b）中所有可能的组合

尝试以下代码：

import regex

print(regex.findall(r'a\w+b','axaybzb', overlapped=False))

['axaybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=False))

['axayb']

print(regex.findall(r'a\w+b','axaybzb', overlapped=True))

['axaybzb', 'aybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=True))

['axayb', 'ayb']

预期输出为

['axayb', 'axaybzb', 'ayb', 'aybzb']

Answer 1

正则表达式不是这里的合适工具，我建议：

识别输入字符串中第一个字母的所有索引
识别输入字符串中第二个字母的所有索引
基于这些索引构建所有子字符串

代码：

def find(str, ch):
    for i, ltr in enumerate(str):
        if ltr == ch:
            yield i

s = "axaybzb"
startChar = 'a'
endChar = 'b'

startCharList = list(find(s,startChar))
endCharList = list(find(s,endChar))

output = []
for u in startCharList:
    for v in endCharList:
           if u <= v:
               output.append(s[u:v+1])
print(output)

输出：

$ python substring.py 
['axayb', 'axaybzb', 'ayb', 'aybzb']

Answer 2

使用像您一样的简单模式，您可以生成字符串中所有连续字符的切片，并针对特定的正则表达式测试它们是否完整：

import re

def findall_overlapped(r, s):
  res = []                     # Resulting list
  reg = r'^{}$'.format(r)      # Regex must match full string
  for q in range(len(s)):      # Iterate over all chars in a string
    for w in range(q,len(s)):  # Iterate over the rest of the chars to the right
        cur = s[q:w+1]         # Currently tested slice
        if re.match(reg, cur): # If there is a full slice match
            res.append(cur)    # Append it to the resulting list
  return res

rex = r'a\w+b'
print(findall_overlapped(rex, 'axaybzb'))
# => ['axayb', 'axaybzb', 'ayb', 'aybzb']

请参见Python demo

警告：请注意，如果您使用模式检查左或右上下文，并且模式的两端都带有前瞻或后视，则此方法将不起作用，因为迭代时此上下文会丢失在字符串上。

如何在python正则表达式中获取所有重叠的匹配项，这些匹配项可能始于字符串中的相同位置？

2 个答案: