Question

在翻译测试应用程序（在Python中）我想要一个接受这两个字符串中的任何一个的正则表达式：

a = "I want the red book"
b = "the book which I want is red"

到目前为止，我正在使用这样的东西：

^(the book which )*I want (is |the )red (book)*$

这将接受字符串a和字符串b。但它也会接受一个字符串，而不是两个可选子字符串：

sub1 = (the book which )
sub2 = (book)

我怎么能指出这两个子串中的一个必须存在，即使它们不相邻？

我意识到在这个例子中，通过测试由“或”|分隔的更长的备选方案来避免问题是非常容易的。这是我正在使用的实际用户输入难以避免的问题的简化示例。

Answer 1

如何指示必须存在这两个子串中的一个，即使他们不相邻？

我认为这是你的核心问题。

解决方案是两个正则表达式。为什么人们会觉得，一旦说出import re正则表达式只需要一行就可以了。

首先测试一个正则表达式中的第一个子字符串，然后使用另一个正则表达式测试其他子字符串。逻辑上结合这两个结果。

Answer 2

这似乎是一个问题，使用difflib.SequenceMatcher可能比使用正则表达式更好地解决。

但是，适用于原始问题中特定示例的正则表达式如下：

^(the book which )*I want (is |the )red((?(1)(?: book)*| book))$

这将失败的字符串“我想要红色”（缺少必要的子串“书籍”和“预订”）。这使用（？（id / name）yes-pattern | no-pattern）语法，该语法允许基于先前匹配组的存在的替代方案。

Answer 3

import re

regx1 = re.compile('^(the book which )*I want (is |the )red'   '((?(1)|(?: book)))$')

regx2 = re.compile('^(the book which )*I want (is |the )red'   '((?(1)(?: book)*|(?: book)))$')




for x in ("I want the red book",
          "the book which I want is red",
          "I want the red",
          "the book which I want is red book"):
    print x
    print regx1.search(x).groups() if regx1.search(x) else 'No match'
    print regx2.search(x).groups() if regx2.search(x) else 'No match'
    print

结果

I want the red book
(None, 'the ', ' book')
(None, 'the ', ' book')

the book which I want is red
('the book which ', 'is ', '')
('the book which ', 'is ', '')

I want the red
No match
No match

the book which I want is red book
No match
('the book which ', 'is ', ' book')

修改

你的正则表达式

^(the book which )*I want (is |the )red (book)*$

因所有句子中的最后一个空格而无法正确匹配。

一定是

'^(the book which )*I want (is |the )red( book)*$'

正则表达式 - 如何允许不相邻的替代方案？

3 个答案: