Question

我已经在Google上搜索了我的用例，但没有发现任何有用的东西。

我不是正规表达专家，所以如果社区中的任何人都可以提供帮助，我将不胜感激。

问题：

给定一个文本文件，我想使用正则表达式捕获两个子字符串（前缀和后缀）之间的最长字符串。请注意，这两个子字符串将始终位于文本的任何行的开头。请参见下面的示例。

子字符串：

前缀= ['项目1'，'项目1a'，'项目1b']
后缀= ['Item 2'，'Item 2a'，'Item 2b']

示例1：

项目1 ....
  项目2 ....
  项目1 ....
  ....
  ....
  项目2 ....
  项目1 ....
  项目2
  项目1a ....
  ....
  ....
  ....
  ....
  项目2b ....

预期结果：

项目1a ....
  ....
  ....
  ....
  ....

为什么会这样？

因为前缀Item 1a和后缀Item 2b与所有其他前缀后缀对之间的文本中最长的字符串匹配。

示例2：

项目1 ....
  项目2 ....
  项目1 ....
  ....
  ....
  项目2
  ....   项目1 ....
  项目2
  项目1a ....   ....
  ....
  ....
  ....项目2b
  ....

预期结果：

项目1 ....
  ....
  ....

为什么会这样？

这是因为这是两个字符串（前缀和后缀对）之间最大的字符串，其中前缀和后缀都始于行的开头。请注意，还有另一对（Item 1a-Item 2b），但是由于Item 2b不在行首，因此我们不能考虑这个最长的序列。

我对正则表达式所做的尝试：

我在上面的列表中尝试使用正则表达式下面的每个前缀-后缀对，但这没用。

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
    re.findall(regex, text, re.MULTLINE)

我尝试使用非正则表达式（Python字符串函数）进行的操作：

def extract_longest_match(text, prefixes, suffixes):
    longest_match = ''
    for line in text.splitlines():
        if line.startswith(tuple(prefixes)):
            beg_index = text.index(line)
            for suf in suffixes:
                end_index = text.find(suf, beg_index+len(line))
                match = text[beg_index:end_index]
                if len(match) > len(longest_match ):
                    longest_match = match
    return longest_match

我错过了什么吗？

Answer 1

您需要

构建一个正则表达式，以匹配从最左边的开始定界符到最左边的结尾定界符的字符串（请参见Match text between two strings with regular expression）
确保分隔符匹配at the line start positions only
使用.或等效选项（请参见Python regex, matching pattern over multiple lines）确保re.DOTALL与换行符匹配（请参阅Python regex find all overlapping matches）
确保正则表达式匹配重叠的子字符串（请参见How can I find all matches to a regular expression in Python?）
找到文本中的所有匹配项（请参见Python's most efficient way to choose longest string in list?）
获得最长的一个（请参见Python demo）。

regex demo：

import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))

输出：

Item 1a ....
....
....
....
....
Item 2

正则表达式看起来像

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

带有单词边界

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)\b.*?^(?:Item 2|Item 2a|Item 2b)\b))

请参见{{3}}。

详细信息

(?sm)-re.S和re.M标志
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))-正向超前匹配，可在随即出现一系列模式的任何位置进行匹配：
- ^-一行的开头
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))-第1组（此值随re.findall返回）
- (?:Item 1|Item 1a|Item 1b)-交替出现的任何项目（可能在此处\b之后添加)字边界）
- .*?-任意0个以上的字符，尽可能少
- ^-一行的开头
- (?:Item 2|Item 2a|Item 2b)-列表中的任何替代方法（可能在此处，在\b之后添加)单词边界也很有意义）。

正则表达式，用于两个字符串之间的最长匹配序列

1 个答案: